All of lore.kernel.org
 help / color / mirror / Atom feed
* git index: how does it work?
@ 2009-08-05 16:21 Shaun Cutts
  2009-08-05 18:00 ` Junio C Hamano
  2009-08-05 18:21 ` Sverre Rabbelier
  0 siblings, 2 replies; 9+ messages in thread
From: Shaun Cutts @ 2009-08-05 16:21 UTC (permalink / raw)
  To: git

Hello,

I am wondering if someone could explain and/or point me to an explanation of how
the git index works.

For instance, suppose I have a tracked file: "foo.c"

1) [I modify "foo.c"]
2) git add foo.c
3) [modify again]
4) git commit -m "blah blah"

Since I don't include the "-a" switch, the version I added on step 2 is
committed. But how does the index keep track of these changes? Does the index
file actually contain the hunks of "foo.c" that have been modified? Or is there
a "temporary" blob created, which the index points to? 

In either case, is there some interface to access these hunks and/or get a
reference to the blob?

Thanks,

-- Shaun

PS I'm considering writing an extension to git where the "diff" understands the
semantics of certain types of files: hunks wouldn't just be textual blobs but
would try to represent a minimal change from one version to the next based on an
edit distance, so that, e.g. changing the location of a function would be
represented by a "move" edit, rather than two text changes.

I have been building a prototype as a wrapper around git, intervening to store
extra information, etc before passing commands on to git. Blobs, commits, etc
are nice abstractions I can leave as is, but the index seems sort of foggy to
me. Any advice appreciated!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git index: how does it work?
  2009-08-05 16:21 git index: how does it work? Shaun Cutts
@ 2009-08-05 18:00 ` Junio C Hamano
  2009-08-12 11:52   ` Shaun Cutts
  2009-08-05 18:21 ` Sverre Rabbelier
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2009-08-05 18:00 UTC (permalink / raw)
  To: Shaun Cutts; +Cc: git

Shaun Cutts <shaun@cuttshome.net> writes:

> I am wondering if someone could explain and/or point me to an explanation of how
> the git index works.
>
> For instance, suppose I have a tracked file: "foo.c"
>
> 1) [I modify "foo.c"]
> 2) git add foo.c
> 3) [modify again]
> 4) git commit -m "blah blah"
>
> Since I don't include the "-a" switch, the version I added on step 2 is
> committed. But how does the index keep track of these changes? Does the index
> file actually contain the hunks of "foo.c" that have been modified? Or is there
> a "temporary" blob created, which the index points to? 

Step 2 hashes foo.c and creates a blob object and registers it to the
index.  Step 4 writes out the index as a tree and makes a commit out of
it.

Running this sequence might be instructive.

	1$ edit foo.c
        2$ git add foo.c
        2a$ git ls-files -s foo.c
	2b$ git diff foo.c
        2c$ git diff --cached foo.c
        3$ edit foo.c
        3a$ git ls-files -s foo.c
	3b$ git diff foo.c
        3c$ git diff --cached foo.c
        4$ git commit -m 'half-edit of foo.c'
        4a$ git ls-files -s foo.c
	4b$ git ls-tree HEAD foo.c
        4c$ git diff foo.c
        4d$ git diff --cached foo.c

 - 2a shows the actual blob object that was created out of foo.c in step 2.

 - 2b shows the difference between that blob (now in the index) and foo.c,
   which should be empty.

 - 2c shows the difference between the HEAD commit and the index, which
   should show your edit in step 1.

 - 3a shows the blob in the index; you haven't added, so it should show
   the same as 2a.

 - 3b shows the difference between the index and foo.c, which should show
   the edit in step 3.

 - 3c shows the difference between the HEAD commit and the index, which
   should show your edit in step 1.

 - 4a shows the blob in the index; you haven't added, so it should show
   the same as 2a.

 - 4b shows the blob in the committed tree and the blob object should be
   identical to 2a.

 - 4c shows the difference between the index and foo.c, which should show
   the edit in step 3.

 - 4d shows the difference between the HEAD commit and the index, which
   should now be empty.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git index: how does it work?
  2009-08-05 16:21 git index: how does it work? Shaun Cutts
  2009-08-05 18:00 ` Junio C Hamano
@ 2009-08-05 18:21 ` Sverre Rabbelier
  2009-08-05 19:31   ` Shaun Cutts
  1 sibling, 1 reply; 9+ messages in thread
From: Sverre Rabbelier @ 2009-08-05 18:21 UTC (permalink / raw)
  To: Shaun Cutts; +Cc: Git List, Johannes Schindelin, Daniel Barkalow

Heya,

On Wed, Aug 5, 2009 at 09:21, Shaun Cutts<shaun@cuttshome.net> wrote:
> PS I'm considering writing an extension to git where the "diff" understands the
> semantics of certain types of files: hunks wouldn't just be textual blobs but
> would try to represent a minimal change from one version to the next based on an
> edit distance, so that, e.g. changing the location of a function would be
> represented by a "move" edit, rather than two text changes.

This sounds very similar to what Daniel was discussing in "[PATCH 2/3
v3] Use an external program to implement fetching with curl git" [0],
if you're truly interested in doing this, please do keep me posted
(and I suspect Dscho might also be interested in being cc-ed) :).

[0] http://thread.gmane.org/gmane.comp.version-control.git/124503

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git index: how does it work?
  2009-08-05 18:21 ` Sverre Rabbelier
@ 2009-08-05 19:31   ` Shaun Cutts
  0 siblings, 0 replies; 9+ messages in thread
From: Shaun Cutts @ 2009-08-05 19:31 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Git List, Johannes Schindelin, Daniel Barkalow

I'll be happy to keep you posted....

... I'll put up a description once I get things worked out a bit more.  
It will take me a month or two, though, probably.

... but as a quickie... :)

The general idea is to use actual syntax parsing to understand what  
happens in particular files, but be able to fall back on text if  
necessary. (Maybe "smarter text" as described by Daniel would be an  
intermediate fallback step.)

No matter what the target language, files have a hierarchical  
organization (at least as far as I am going to care about :)). My idea  
is to write a "delta" in yaml with the tree-edit operations, as a  
universal representation of changes. This could be edited by the user  
if necessary -- for example: a move with edits in it might not be  
detected, but the user could explicitly replace the delete/add pair  
with the move/edit. Tools would be provided to verify that the edited  
deltas actually produce the changes stated (& update them to capture  
the next set of deltas, etc.)

Suggestions from you guys as to the best way to tie this in would be  
greatly appreciated. I think the analysis of particular file types  
should only be loosely coupled with the rest of the system, though, as  
otherwise it will create a rats' nest.

Ideally, there would be a mechanism for an outside diff tool to  
specify "these are the hunks", and to register a utility to apply  
them... the smart diff tool would use the yaml tree-operation format  
and have its own registry (or config section) for how to analyze  
particular file types.

The diff tool would also be coupled with a merge tool... in general,  
it would be nice if there were more hooks for providing specialized  
diff & merge.

-- Shaun

On Aug 5, 2009, at 8:21 PM, Sverre Rabbelier wrote:

> Heya,
>
> On Wed, Aug 5, 2009 at 09:21, Shaun Cutts<shaun@cuttshome.net> wrote:
>> PS I'm considering writing an extension to git where the "diff"  
>> understands the
>> semantics of certain types of files: hunks wouldn't just be textual  
>> blobs but
>> would try to represent a minimal change from one version to the  
>> next based on an
>> edit distance, so that, e.g. changing the location of a function  
>> would be
>> represented by a "move" edit, rather than two text changes.
>
> This sounds very similar to what Daniel was discussing in "[PATCH 2/3
> v3] Use an external program to implement fetching with curl git" [0],
> if you're truly interested in doing this, please do keep me posted
> (and I suspect Dscho might also be interested in being cc-ed) :).
>
> [0] http://thread.gmane.org/gmane.comp.version-control.git/124503
>
> -- 
> Cheers,
>
> Sverre Rabbelier
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git index: how does it work?
  2009-08-05 18:00 ` Junio C Hamano
@ 2009-08-12 11:52   ` Shaun Cutts
  2009-08-12 17:47     ` Sverre Rabbelier
  2009-08-12 20:31     ` Junio C Hamano
  0 siblings, 2 replies; 9+ messages in thread
From: Shaun Cutts @ 2009-08-12 11:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio,

Your advice was very helpful.

Digging in, however, I find I still am in the dark on one point: how  
does the index track renamed files, and how to query it for  
information about them?

For instance, if I add a 5th step to the sequence:

5) git mv foo.c bar.c

Then I am told by "git status" that the file is renamed, but I can't  
seem to elict this info using "git ls-files". Under some circumstances  
even "git status" lists a new and deleted file after a rename.

Are renames being tracked by the index, and is there a more basic  
interface than "status" to query about them?

Thanks for any help,
--- Shaun

On Aug 5, 2009, at 8:00 PM, Junio C Hamano wrote:

> Shaun Cutts <shaun@cuttshome.net> writes:
>
>> I am wondering if someone could explain and/or point me to an  
>> explanation of how
>> the git index works.
>>
>> For instance, suppose I have a tracked file: "foo.c"
>>
>> 1) [I modify "foo.c"]
>> 2) git add foo.c
>> 3) [modify again]
>> 4) git commit -m "blah blah"
>>
>> Since I don't include the "-a" switch, the version I added on step  
>> 2 is
>> committed. But how does the index keep track of these changes? Does  
>> the index
>> file actually contain the hunks of "foo.c" that have been modified?  
>> Or is there
>> a "temporary" blob created, which the index points to?
>
> Step 2 hashes foo.c and creates a blob object and registers it to the
> index.  Step 4 writes out the index as a tree and makes a commit out  
> of
> it.
>
> Running this sequence might be instructive.
>
> 	1$ edit foo.c
>        2$ git add foo.c
>        2a$ git ls-files -s foo.c
> 	2b$ git diff foo.c
>        2c$ git diff --cached foo.c
>        3$ edit foo.c
>        3a$ git ls-files -s foo.c
> 	3b$ git diff foo.c
>        3c$ git diff --cached foo.c
>        4$ git commit -m 'half-edit of foo.c'
>        4a$ git ls-files -s foo.c
> 	4b$ git ls-tree HEAD foo.c
>        4c$ git diff foo.c
>        4d$ git diff --cached foo.c
>
> - 2a shows the actual blob object that was created out of foo.c in  
> step 2.
>
> - 2b shows the difference between that blob (now in the index) and  
> foo.c,
>   which should be empty.
>
> - 2c shows the difference between the HEAD commit and the index, which
>   should show your edit in step 1.
>
> - 3a shows the blob in the index; you haven't added, so it should show
>   the same as 2a.
>
> - 3b shows the difference between the index and foo.c, which should  
> show
>   the edit in step 3.
>
> - 3c shows the difference between the HEAD commit and the index, which
>   should show your edit in step 1.
>
> - 4a shows the blob in the index; you haven't added, so it should show
>   the same as 2a.
>
> - 4b shows the blob in the committed tree and the blob object should  
> be
>   identical to 2a.
>
> - 4c shows the difference between the index and foo.c, which should  
> show
>   the edit in step 3.
>
> - 4d shows the difference between the HEAD commit and the index, which
>   should now be empty.
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git index: how does it work?
  2009-08-12 11:52   ` Shaun Cutts
@ 2009-08-12 17:47     ` Sverre Rabbelier
  2009-08-12 18:45       ` Shaun Cutts
  2009-08-12 20:31     ` Junio C Hamano
  1 sibling, 1 reply; 9+ messages in thread
From: Sverre Rabbelier @ 2009-08-12 17:47 UTC (permalink / raw)
  To: Shaun Cutts; +Cc: Junio C Hamano, git

Heya,

On Wed, Aug 12, 2009 at 04:52, Shaun Cutts<shaun@cuttshome.net> wrote:
> Are renames being tracked by the index, and is there a more basic interface
> than "status" to query about them?

Nope, git never explicitly tracks renames. Try this:
$ mv foo bar
$ git rm --cached foo
$ git add bar
$ git status

It'll tell you that you renamed foo to bar, even if you never executed 'git mv'.

This is because git does rename _detection_, that is, it'll notice
that you have another file with (almost) the same contents, so it
assumes you did a rename.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git index: how does it work?
  2009-08-12 17:47     ` Sverre Rabbelier
@ 2009-08-12 18:45       ` Shaun Cutts
  2009-08-12 19:39         ` Björn Steinbrink
  0 siblings, 1 reply; 9+ messages in thread
From: Shaun Cutts @ 2009-08-12 18:45 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Git List

Aha ---

that explains it, then.

Is there a lower-level interface to rename detection than via  
"status"? And... um... hmmm.... how does it work? The hash codes don't  
help for "almost" the same. Is there an approximate string matching  
algorithm built in somewhere?

Thanks,

-- Shaun

On Aug 12, 2009, at 7:47 PM, Sverre Rabbelier wrote:

> Heya,
>
> On Wed, Aug 12, 2009 at 04:52, Shaun Cutts<shaun@cuttshome.net> wrote:
>> Are renames being tracked by the index, and is there a more basic  
>> interface
>> than "status" to query about them?
>
> Nope, git never explicitly tracks renames. Try this:
> $ mv foo bar
> $ git rm --cached foo
> $ git add bar
> $ git status
>
> It'll tell you that you renamed foo to bar, even if you never  
> executed 'git mv'.
>
> This is because git does rename _detection_, that is, it'll notice
> that you have another file with (almost) the same contents, so it
> assumes you did a rename.
>
> -- 
> Cheers,
>
> Sverre Rabbelier
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git index: how does it work?
  2009-08-12 18:45       ` Shaun Cutts
@ 2009-08-12 19:39         ` Björn Steinbrink
  0 siblings, 0 replies; 9+ messages in thread
From: Björn Steinbrink @ 2009-08-12 19:39 UTC (permalink / raw)
  To: Shaun Cutts; +Cc: Sverre Rabbelier, Git List

[Please don't top-post, fixed that for you...]

On 2009.08.12 20:45:48 +0200, Shaun Cutts wrote:
> On Aug 12, 2009, at 7:47 PM, Sverre Rabbelier wrote:
> >Heya,
> >
> >On Wed, Aug 12, 2009 at 04:52, Shaun Cutts<shaun@cuttshome.net> wrote:
> >>Are renames being tracked by the index, and is there a more
> >>basic interface
> >>than "status" to query about them?
> >
> >Nope, git never explicitly tracks renames. Try this:
> >$ mv foo bar
> >$ git rm --cached foo
> >$ git add bar
> >$ git status
> >
> >It'll tell you that you renamed foo to bar, even if you never
> >executed 'git mv'.
> >
> >This is because git does rename _detection_, that is, it'll notice
> >that you have another file with (almost) the same contents, so it
> >assumes you did a rename.
>
> Aha ---
> 
> that explains it, then.
> 
> Is there a lower-level interface to rename detection than via
> "status"? And... um... hmmm.... how does it work? The hash codes
> don't help for "almost" the same. Is there an approximate string
> matching algorithm built in somewhere?

Roughly, it works like this:
The files are split into small chunks, those are hashed, and if there
are chunks with the same hash in both files, those chunks are treated as
being common to both files. The more the files have in common, the
higher the similarity score. See estimate_similarity() for details.

As the splitting also happens at newlines this has some interesting
effects, for example, you can completely reorder the lines in a file
after renaming it, and git will still detect it as a rename.

Björn

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git index: how does it work?
  2009-08-12 11:52   ` Shaun Cutts
  2009-08-12 17:47     ` Sverre Rabbelier
@ 2009-08-12 20:31     ` Junio C Hamano
  1 sibling, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2009-08-12 20:31 UTC (permalink / raw)
  To: Shaun Cutts; +Cc: git

Shaun Cutts <shaun@cuttshome.net> writes:

> Are renames being tracked by the index, and is there a more basic
> interface than "status" to query about them?

No.  Index, nor git in general, never records renames.  git records
contents, not content changes.  The index records a state, so does the
tree object pointed at by the HEAD commit.

When you ask for "status", git will notice that you have lost a file, and
added a new one, between these two states, by comparing them.  The
contents of these lost files and added files are then compared, and ones
with similar contents are paired up.

That way, you do not have to use "git mv A B" to "rename" A to B.  You can
just as well "mv A B; git rm A; git add B", and get the same outcome,
exactly because git does not record renames.

Instead, we track them by deducing that you renamed from the result.

The tree-vs-index comparison "git status" does to figure all this out is
"git diff-index -M --cached HEAD".

As it should be obvious from the above description,

	git diff-index -M --cached HEAD -- A

is *NOT* the way for you to ask about "possible renames of A".  You need
to run the diff for the whole tree without path limitation so that you can
pair deletions and creations up in order to deduce renames.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-08-12 20:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-05 16:21 git index: how does it work? Shaun Cutts
2009-08-05 18:00 ` Junio C Hamano
2009-08-12 11:52   ` Shaun Cutts
2009-08-12 17:47     ` Sverre Rabbelier
2009-08-12 18:45       ` Shaun Cutts
2009-08-12 19:39         ` Björn Steinbrink
2009-08-12 20:31     ` Junio C Hamano
2009-08-05 18:21 ` Sverre Rabbelier
2009-08-05 19:31   ` Shaun Cutts

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.