git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Does content provenance matter?
@ 2012-05-05 20:49 Kelly Dean
  2012-05-07  8:23 ` Thomas Rast
  2012-05-07 23:12 ` Jakub Narebski
  0 siblings, 2 replies; 10+ messages in thread
From: Kelly Dean @ 2012-05-05 20:49 UTC (permalink / raw)
  To: git

Suppose you make dirs B and C, copy file X into B and C, insert "foo" somewhere into B/X and the same place into C/X, and commit. Now, you copy "foo" from B/X into the same place in the original X, and commit again. Git doesn't record the information about whether "foo" was copied from B or C, and this is intentional, on the theory that just content, not provenance, is what matters.
Suppose instead, you branch master to new branches B and C, insert "foo" into B/X, commit, insert "foo" into C/X, and commit. Now, you merge B back into master. Git records that master contains "foo" because B contained it rather than because C contained it, on the theory that not only content, but also provenance, matters.
Does provenance actually matter, or not? The reason git doesn't record it in the first case isn't simply that your editor didn't store that information (and the editor didn't store it because it isn't customary to store it, and there's no standard way to store it); even if the editor were to store the information (e.g. as metadata for X; details not relevant) and a patch to git were submitted for it to record this metadata, the git maintainers would presumably reject this patch, on the basis that it violates git's design specification which says that provenance doesn't matter. For the same reason, git intentionally doesn't distinguish the cases of renaming a file or directory vs. deleting it and creating a new one with the same content, as has already been thoroughly debated.
The basic question is, if provenance doesn't matter, then why does a git commit record its parent(s)? Why not omit this information, and figure it out at search time (by looking at all commits with older timestamps), the same as you're supposed to figure out renames at search time and figure out the movement of lines within/among files at search time (by looking at all files in the parent commit(s))? (If speed is an issue, then use an index, but this doesn't require putting such derivative information in the commit record.)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Does content provenance matter?
  2012-05-05 20:49 Does content provenance matter? Kelly Dean
@ 2012-05-07  8:23 ` Thomas Rast
  2012-05-07 21:43   ` Kelly Dean
  2012-05-08  0:08   ` Junio C Hamano
  2012-05-07 23:12 ` Jakub Narebski
  1 sibling, 2 replies; 10+ messages in thread
From: Thomas Rast @ 2012-05-07  8:23 UTC (permalink / raw)
  To: Kelly Dean; +Cc: git

Kelly Dean <kellydeanch@yahoo.com> writes:

> [copying B/X over to C/X is not recorded as such], on the theory that
> just content, not provenance, is what matters.

> [merging branches *is* recorded], on the theory that not only content,
> but also provenance, matters.

> The basic question is, if provenance doesn't matter, then why does a
> git commit record its parent(s)? Why not omit this information, and
> figure it out at search time (by looking at all commits with older
> timestamps), the same as you're supposed to figure out renames at
> search time and figure out the movement of lines within/among files at
> search time (by looking at all files in the parent commit(s))?

What's the difference between the following series of commits?

  Foo
  Bar
  Revert Bar

and

  Foo

You claim that they're the same, because the tree state after each is
the same.  But I learned that Bar was broken, and recorded it for all to
see.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Does content provenance matter?
  2012-05-07  8:23 ` Thomas Rast
@ 2012-05-07 21:43   ` Kelly Dean
  2012-05-07 22:14     ` PJ Weisberg
  2012-05-08  0:08   ` Junio C Hamano
  1 sibling, 1 reply; 10+ messages in thread
From: Kelly Dean @ 2012-05-07 21:43 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git

--- On Mon, 5/7/12, Thomas Rast <trast@student.ethz.ch> wrote:
> What's the difference between the following series of
> commits?
>
>   Foo
>   Bar
>   Revert Bar
>
> and
>
>   Foo
>
> You claim that they're the same, because the tree state
> after each is
> the same.  But I learned that Bar was broken, and
> recorded it for all to see.
No, I don't claim they're the same. Different commits have different timestamps (and different commit messages, but that's not useful for automatic searching to find which commits are derived from which others). Consider if "Revert Bar" and "Bar" didn't point to their parents; could you still deduce from them that Bar was broken? Yes--on the basis of the commit timestamps (which shows their temporal order) and the contents of the trees which the commits point to (which shows that Revert Bar undoes a change made in Bar).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Does content provenance matter?
  2012-05-07 21:43   ` Kelly Dean
@ 2012-05-07 22:14     ` PJ Weisberg
  2012-05-07 23:13       ` Kelly Dean
  0 siblings, 1 reply; 10+ messages in thread
From: PJ Weisberg @ 2012-05-07 22:14 UTC (permalink / raw)
  To: Kelly Dean; +Cc: Thomas Rast, git

On Mon, May 7, 2012 at 2:43 PM, Kelly Dean <kellydeanch@yahoo.com> wrote:
> --- On Mon, 5/7/12, Thomas Rast <trast@student.ethz.ch> wrote:
>> What's the difference between the following series of
>> commits?
>>
>>   Foo
>>   Bar
>>   Revert Bar
>>
>> and
>>
>>   Foo
>>
>> You claim that they're the same, because the tree state
>> after each is
>> the same.  But I learned that Bar was broken, and
>> recorded it for all to see.
> No, I don't claim they're the same. Different commits have different timestamps (and different commit messages, but that's not useful for automatic searching to find which commits are derived from which others). Consider if "Revert Bar" and "Bar" didn't point to their parents; could you still deduce from them that Bar was broken? Yes--on the basis of the commit timestamps (which shows their temporal order) and the contents of the trees which the commits point to (which shows that Revert Bar undoes a change made in Bar).

But there could be any number of unrelated commits newer than "Bar"
but older than "Revert Bar" on other branches.  Even if you could
trust the timestamps to be accurate (you can't), you still can't
determine a commit's parent unambiguously.

-PJ

Gehm's Corollary to Clark's Law: Any technology distinguishable from
magic is insufficiently advanced.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Does content provenance matter?
  2012-05-05 20:49 Does content provenance matter? Kelly Dean
  2012-05-07  8:23 ` Thomas Rast
@ 2012-05-07 23:12 ` Jakub Narebski
  1 sibling, 0 replies; 10+ messages in thread
From: Jakub Narebski @ 2012-05-07 23:12 UTC (permalink / raw)
  To: Kelly Dean; +Cc: git

Kelly Dean <kellydeanch@yahoo.com> writes:

> The basic question is, if provenance doesn't matter, then why does a
> git commit record its parent(s)? Why not omit this information, and
> figure it out at search time (by looking at all commits with older
> timestamps),

Because it is not possible to do in reliable way in face of parallel
concurrent branchy development by multiple developers... some of which
can have badly set clock.

Also, multiple merges of the same branch.
-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Does content provenance matter?
  2012-05-07 22:14     ` PJ Weisberg
@ 2012-05-07 23:13       ` Kelly Dean
  2012-05-08  0:03         ` Andrew Ardill
  2012-05-08  9:23         ` Philip Oakley
  0 siblings, 2 replies; 10+ messages in thread
From: Kelly Dean @ 2012-05-07 23:13 UTC (permalink / raw)
  To: PJ Weisberg; +Cc: git

--- On Mon, 5/7/12, PJ Weisberg <pj@irregularexpressions.net> wrote:
> But there could be any number of unrelated commits newer than "Bar"
> but older than "Revert Bar" on other branches.  Even if you could
> trust the timestamps to be accurate (you can't), you still can't
> determine a commit's parent unambiguously.
Therefore, provenance does matter, and it must be explicitly recorded because it can't necessarily be correctly and fully deduced from content alone. And git does record inter-commit provenance.
However, git doesn't record intra-commit provenance, as I mentioned in my original message. My question is: why this discrepancy? Either provenance matters, or it doesn't; why record it in one case but not the other?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Does content provenance matter?
  2012-05-07 23:13       ` Kelly Dean
@ 2012-05-08  0:03         ` Andrew Ardill
  2012-05-08  9:23         ` Philip Oakley
  1 sibling, 0 replies; 10+ messages in thread
From: Andrew Ardill @ 2012-05-08  0:03 UTC (permalink / raw)
  To: Kelly Dean; +Cc: PJ Weisberg, git

On 8 May 2012 09:13, Kelly Dean <kellydeanch@yahoo.com> wrote:
>
> --- On Mon, 5/7/12, PJ Weisberg <pj@irregularexpressions.net> wrote:
> > But there could be any number of unrelated commits newer than "Bar"
> > but older than "Revert Bar" on other branches.  Even if you could
> > trust the timestamps to be accurate (you can't), you still can't
> > determine a commit's parent unambiguously.
> Therefore, provenance does matter, and it must be explicitly recorded
> because it can't necessarily be correctly and fully deduced from content
> alone. And git does record inter-commit provenance.
> However, git doesn't record intra-commit provenance, as I mentioned in my
> original message. My question is: why this discrepancy? Either provenance
> matters, or it doesn't; why record it in one case but not the other?

I don't think it is firmly decided that provenance is not important in
the intra-commit scope, rather that as you stated such information is
not available to us.

My understanding is that git makes a best guess effort to track the
flow of content through the repository. If the content is moved, by
deleting in one place and adding in another it is easy to see that in
git, however if content is merely added, and that same content occurs
in multiple places in the repository, there is no sane way of knowing
where that content came from.
Even if the content that was added only occurred in one other place,
you would need to check every single file for every single hunk added
every single commit in order to be able to determine just where this
content came from. Why stop there though? It's possible we are copying
the content from some other branch we don't have checked out at the
moment, so every time we commit, let's search the entire repositories
history for an occurrence of each hunk we are adding. This way is
madness.

With regards to file renames, all that has been shown so far is that
provenance matters for commit renames. Nothing about the similarities
between the commit parent and rename situations you mention leads me
to concluded that because provenance is important to one it is
important to the other.

Indeed, one of the arguments against provenance being important in the
file rename case is that generally we can determine this information
from the existing information, as opposed to the general commit parent
case. There are additional arguments, such as simply recording file
name changes doesn't capture many situations we would like to know
about, for example when a single file is split into two files.
Tracking the content of those files, and hence being able to deduce
where their content came from, solves this and the general rename
situation. Trying to guess which file was 'renamed' and which is 'new'
when a file is actually split into two new files would lead to
misleading and incomplete information in the end.

So just because provenance matters in some situations doesn't mean it
matters in all (at least in the way we have been applying 'matters'),
furthermore there are additional reasons why the existing
content-tracking system is beneficial. Extra layers of rename encoding
or the 'heritage of data chunks' would be extra work with little added
benefit (though there are a few corner cases, from memory, where
automatic rename detection fails and so /some/ benefit would be seen).

Regards,

Andrew Ardill

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Does content provenance matter?
  2012-05-07  8:23 ` Thomas Rast
  2012-05-07 21:43   ` Kelly Dean
@ 2012-05-08  0:08   ` Junio C Hamano
  2012-05-08  0:11     ` Junio C Hamano
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2012-05-08  0:08 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Kelly Dean, git

Thomas Rast <trast@student.ethz.ch> writes:

> Kelly Dean <kellydeanch@yahoo.com> writes:
>
>> [copying B/X over to C/X is not recorded as such], on the theory that
>> just content, not provenance, is what matters.
>
>> [merging branches *is* recorded], on the theory that not only content,
>> but also provenance, matters.
>
>> The basic question is, if provenance doesn't matter, then why does a
>> git commit record its parent(s)? Why not omit this information, and
>> figure it out at search time (by looking at all commits with older
>> timestamps), the same as you're supposed to figure out renames at
>> search time and figure out the movement of lines within/among files at
>> search time (by looking at all files in the parent commit(s))?
>
> What's the difference between the following series of commits?
>
>   Foo
>   Bar
>   Revert Bar
>
> and
>
>   Foo
>
> You claim that they're the same, because the tree state after each is
> the same.  But I learned that Bar was broken, and recorded it for all to
> see.

I am not sure if that is what the original poster was claiming.

But a more illustrative situation to consider is this.  What if the change
were not just "copy B/X to C/X", but was "concatenate the first half of
B/X and the second half of C/X to create a new D/X".

As it happens, because our commit records the whole tree state and its
parent commit, the "content provenance" of what is in D/X is precisely
tracked.  Look at the tree of the parent commit and look at the result,
and you will notice the first half of D/X is identical to the first half
of B/X before the commit and the second half of D/X is identical to the
second half of C/X before the commit.

In a situation where "provenance" is disputed, it does not matter if D/X
was created by mechanically running

	head -n $n B/X >D/X
	tail -n $n C/X >C/X
        
or if you typed the file afresh.  You could try to argue "No, your honour,
I did not copy from these two files.  I typed it myself from scratch and
there is no plagiarism involved.  They are all my words."  But in the end,
by comparing the tree state before your change and after your change, it
would be very clear to any sane person that D/X is identical to the first
half of B/X and the second half of C/X.

Also see http://article.gmane.org/gmane.comp.version-control.git/217 aka
one of the most important messages in the history of the Git mailing list
for inspirations.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Does content provenance matter?
  2012-05-08  0:08   ` Junio C Hamano
@ 2012-05-08  0:11     ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2012-05-08  0:11 UTC (permalink / raw)
  To: Thomas Rast, Kelly Dean; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> But a more illustrative situation to consider is this.  What if the change
> were not just "copy B/X to C/X", but was "concatenate the first half of
> B/X and the second half of C/X to create a new D/X".

A crucial question to the original poster was missing here:

    If "a system that tracks provenance better than Git" wanted to record
    something in such a situation, what does it record and how is the
    recorded information used?
    
> As it happens, because our commit records the whole tree state and its
> parent commit, the "content provenance" of what is in D/X is precisely
> tracked.  Look at the tree of the parent commit and look at the result,
> and you will notice the first half of D/X is identical to the first half
> of B/X before the commit and the second half of D/X is identical to the
> second half of C/X before the commit.
>
> In a situation where "provenance" is disputed, it does not matter if D/X
> was created by mechanically running
>
> 	head -n $n B/X >D/X
> 	tail -n $n C/X >C/X

Typo: "tail -n $n C/X >>D/x"

>         
> or if you typed the file afresh.  You could try to argue "No, your honour,
> I did not copy from these two files.  I typed it myself from scratch and
> there is no plagiarism involved.  They are all my words."  But in the end,
> by comparing the tree state before your change and after your change, it
> would be very clear to any sane person that D/X is identical to the first
> half of B/X and the second half of C/X.
>
> Also see http://article.gmane.org/gmane.comp.version-control.git/217 aka
> one of the most important messages in the history of the Git mailing list
> for inspirations.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Does content provenance matter?
  2012-05-07 23:13       ` Kelly Dean
  2012-05-08  0:03         ` Andrew Ardill
@ 2012-05-08  9:23         ` Philip Oakley
  1 sibling, 0 replies; 10+ messages in thread
From: Philip Oakley @ 2012-05-08  9:23 UTC (permalink / raw)
  To: Kelly Dean, PJ Weisberg; +Cc: Git List

From: "Kelly Dean" <kellydeanch@yahoo.com> Sent: Tuesday, May 08, 2012 12:13
AM
> --- On Mon, 5/7/12, PJ Weisberg <pj@irregularexpressions.net> wrote:
>> But there could be any number of unrelated commits newer than "Bar"
>> but older than "Revert Bar" on other branches.  Even if you could
>> trust the timestamps to be accurate (you can't), you still can't
>> determine a commit's parent unambiguously.
> Therefore, provenance does matter, and it must be explicitly recorded
> because it can't necessarily be correctly and fully deduced from content
> alone. And git does record inter-commit provenance.
> However, git doesn't record intra-commit provenance, as I mentioned in my
> original message. My question is: why this discrepancy?

>  Either provenance matters, or it doesn't;

The logic error is here. There are many other available choices as to
deciding the points at which the many provenance quality levels decay. (e.g.
see [1]).

People eventually give up caring at some level of detail/history, each in a
different place ;-) It's a choice. e.g. Have you noticed all high
performance cars (Porche?) need brightly coloured brake cylinders with
carefully specified paint jobs - why? At some point we give up caring how
someone got a few (how few?) characters into a file... It's not right, but
it's not wrong either.

I've worked with systems (e.g. DOORS) that record every keystroke, and
recored every hunk at the undo/redo level, but for little benefit.

Git takes the approach of having lightweight (easy) branching with easy
commits, with local history re-writing (rebase), to give users the ability
to balance between their WIP (work in progress), and their public record.
But with strong verification of any given history (e.g. "My Master"). It's a
choice as to who and where to blame.

>     why record it in one case but not the other?

Philip
[1] Measuring and Managing Technological Knowledge,
IEEE Engineering Management Review Winter 1997 p77-88.
Reprinted from  Sloan Management Review, Fall 1994
http://sloanreview.mit.edu/the-magazine/1994-fall/3615/measuring-and-managing-technological-knowledge/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-05-08  9:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-05 20:49 Does content provenance matter? Kelly Dean
2012-05-07  8:23 ` Thomas Rast
2012-05-07 21:43   ` Kelly Dean
2012-05-07 22:14     ` PJ Weisberg
2012-05-07 23:13       ` Kelly Dean
2012-05-08  0:03         ` Andrew Ardill
2012-05-08  9:23         ` Philip Oakley
2012-05-08  0:08   ` Junio C Hamano
2012-05-08  0:11     ` Junio C Hamano
2012-05-07 23:12 ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).