* git index: how does it work?
@ 2009-08-05 16:21 Shaun Cutts
2009-08-05 18:00 ` Junio C Hamano
2009-08-05 18:21 ` Sverre Rabbelier
0 siblings, 2 replies; 9+ messages in thread
From: Shaun Cutts @ 2009-08-05 16:21 UTC (permalink / raw)
To: git
Hello,
I am wondering if someone could explain and/or point me to an explanation of how
the git index works.
For instance, suppose I have a tracked file: "foo.c"
1) [I modify "foo.c"]
2) git add foo.c
3) [modify again]
4) git commit -m "blah blah"
Since I don't include the "-a" switch, the version I added on step 2 is
committed. But how does the index keep track of these changes? Does the index
file actually contain the hunks of "foo.c" that have been modified? Or is there
a "temporary" blob created, which the index points to?
In either case, is there some interface to access these hunks and/or get a
reference to the blob?
Thanks,
-- Shaun
PS I'm considering writing an extension to git where the "diff" understands the
semantics of certain types of files: hunks wouldn't just be textual blobs but
would try to represent a minimal change from one version to the next based on an
edit distance, so that, e.g. changing the location of a function would be
represented by a "move" edit, rather than two text changes.
I have been building a prototype as a wrapper around git, intervening to store
extra information, etc before passing commands on to git. Blobs, commits, etc
are nice abstractions I can leave as is, but the index seems sort of foggy to
me. Any advice appreciated!
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git index: how does it work?
2009-08-05 16:21 git index: how does it work? Shaun Cutts
@ 2009-08-05 18:00 ` Junio C Hamano
2009-08-12 11:52 ` Shaun Cutts
2009-08-05 18:21 ` Sverre Rabbelier
1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2009-08-05 18:00 UTC (permalink / raw)
To: Shaun Cutts; +Cc: git
Shaun Cutts <shaun@cuttshome.net> writes:
> I am wondering if someone could explain and/or point me to an explanation of how
> the git index works.
>
> For instance, suppose I have a tracked file: "foo.c"
>
> 1) [I modify "foo.c"]
> 2) git add foo.c
> 3) [modify again]
> 4) git commit -m "blah blah"
>
> Since I don't include the "-a" switch, the version I added on step 2 is
> committed. But how does the index keep track of these changes? Does the index
> file actually contain the hunks of "foo.c" that have been modified? Or is there
> a "temporary" blob created, which the index points to?
Step 2 hashes foo.c and creates a blob object and registers it to the
index. Step 4 writes out the index as a tree and makes a commit out of
it.
Running this sequence might be instructive.
1$ edit foo.c
2$ git add foo.c
2a$ git ls-files -s foo.c
2b$ git diff foo.c
2c$ git diff --cached foo.c
3$ edit foo.c
3a$ git ls-files -s foo.c
3b$ git diff foo.c
3c$ git diff --cached foo.c
4$ git commit -m 'half-edit of foo.c'
4a$ git ls-files -s foo.c
4b$ git ls-tree HEAD foo.c
4c$ git diff foo.c
4d$ git diff --cached foo.c
- 2a shows the actual blob object that was created out of foo.c in step 2.
- 2b shows the difference between that blob (now in the index) and foo.c,
which should be empty.
- 2c shows the difference between the HEAD commit and the index, which
should show your edit in step 1.
- 3a shows the blob in the index; you haven't added, so it should show
the same as 2a.
- 3b shows the difference between the index and foo.c, which should show
the edit in step 3.
- 3c shows the difference between the HEAD commit and the index, which
should show your edit in step 1.
- 4a shows the blob in the index; you haven't added, so it should show
the same as 2a.
- 4b shows the blob in the committed tree and the blob object should be
identical to 2a.
- 4c shows the difference between the index and foo.c, which should show
the edit in step 3.
- 4d shows the difference between the HEAD commit and the index, which
should now be empty.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git index: how does it work?
2009-08-05 16:21 git index: how does it work? Shaun Cutts
2009-08-05 18:00 ` Junio C Hamano
@ 2009-08-05 18:21 ` Sverre Rabbelier
2009-08-05 19:31 ` Shaun Cutts
1 sibling, 1 reply; 9+ messages in thread
From: Sverre Rabbelier @ 2009-08-05 18:21 UTC (permalink / raw)
To: Shaun Cutts; +Cc: Git List, Johannes Schindelin, Daniel Barkalow
Heya,
On Wed, Aug 5, 2009 at 09:21, Shaun Cutts<shaun@cuttshome.net> wrote:
> PS I'm considering writing an extension to git where the "diff" understands the
> semantics of certain types of files: hunks wouldn't just be textual blobs but
> would try to represent a minimal change from one version to the next based on an
> edit distance, so that, e.g. changing the location of a function would be
> represented by a "move" edit, rather than two text changes.
This sounds very similar to what Daniel was discussing in "[PATCH 2/3
v3] Use an external program to implement fetching with curl git" [0],
if you're truly interested in doing this, please do keep me posted
(and I suspect Dscho might also be interested in being cc-ed) :).
[0] http://thread.gmane.org/gmane.comp.version-control.git/124503
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git index: how does it work?
2009-08-05 18:21 ` Sverre Rabbelier
@ 2009-08-05 19:31 ` Shaun Cutts
0 siblings, 0 replies; 9+ messages in thread
From: Shaun Cutts @ 2009-08-05 19:31 UTC (permalink / raw)
To: Sverre Rabbelier; +Cc: Git List, Johannes Schindelin, Daniel Barkalow
I'll be happy to keep you posted....
... I'll put up a description once I get things worked out a bit more.
It will take me a month or two, though, probably.
... but as a quickie... :)
The general idea is to use actual syntax parsing to understand what
happens in particular files, but be able to fall back on text if
necessary. (Maybe "smarter text" as described by Daniel would be an
intermediate fallback step.)
No matter what the target language, files have a hierarchical
organization (at least as far as I am going to care about :)). My idea
is to write a "delta" in yaml with the tree-edit operations, as a
universal representation of changes. This could be edited by the user
if necessary -- for example: a move with edits in it might not be
detected, but the user could explicitly replace the delete/add pair
with the move/edit. Tools would be provided to verify that the edited
deltas actually produce the changes stated (& update them to capture
the next set of deltas, etc.)
Suggestions from you guys as to the best way to tie this in would be
greatly appreciated. I think the analysis of particular file types
should only be loosely coupled with the rest of the system, though, as
otherwise it will create a rats' nest.
Ideally, there would be a mechanism for an outside diff tool to
specify "these are the hunks", and to register a utility to apply
them... the smart diff tool would use the yaml tree-operation format
and have its own registry (or config section) for how to analyze
particular file types.
The diff tool would also be coupled with a merge tool... in general,
it would be nice if there were more hooks for providing specialized
diff & merge.
-- Shaun
On Aug 5, 2009, at 8:21 PM, Sverre Rabbelier wrote:
> Heya,
>
> On Wed, Aug 5, 2009 at 09:21, Shaun Cutts<shaun@cuttshome.net> wrote:
>> PS I'm considering writing an extension to git where the "diff"
>> understands the
>> semantics of certain types of files: hunks wouldn't just be textual
>> blobs but
>> would try to represent a minimal change from one version to the
>> next based on an
>> edit distance, so that, e.g. changing the location of a function
>> would be
>> represented by a "move" edit, rather than two text changes.
>
> This sounds very similar to what Daniel was discussing in "[PATCH 2/3
> v3] Use an external program to implement fetching with curl git" [0],
> if you're truly interested in doing this, please do keep me posted
> (and I suspect Dscho might also be interested in being cc-ed) :).
>
> [0] http://thread.gmane.org/gmane.comp.version-control.git/124503
>
> --
> Cheers,
>
> Sverre Rabbelier
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git index: how does it work?
2009-08-05 18:00 ` Junio C Hamano
@ 2009-08-12 11:52 ` Shaun Cutts
2009-08-12 17:47 ` Sverre Rabbelier
2009-08-12 20:31 ` Junio C Hamano
0 siblings, 2 replies; 9+ messages in thread
From: Shaun Cutts @ 2009-08-12 11:52 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
Junio,
Your advice was very helpful.
Digging in, however, I find I still am in the dark on one point: how
does the index track renamed files, and how to query it for
information about them?
For instance, if I add a 5th step to the sequence:
5) git mv foo.c bar.c
Then I am told by "git status" that the file is renamed, but I can't
seem to elict this info using "git ls-files". Under some circumstances
even "git status" lists a new and deleted file after a rename.
Are renames being tracked by the index, and is there a more basic
interface than "status" to query about them?
Thanks for any help,
--- Shaun
On Aug 5, 2009, at 8:00 PM, Junio C Hamano wrote:
> Shaun Cutts <shaun@cuttshome.net> writes:
>
>> I am wondering if someone could explain and/or point me to an
>> explanation of how
>> the git index works.
>>
>> For instance, suppose I have a tracked file: "foo.c"
>>
>> 1) [I modify "foo.c"]
>> 2) git add foo.c
>> 3) [modify again]
>> 4) git commit -m "blah blah"
>>
>> Since I don't include the "-a" switch, the version I added on step
>> 2 is
>> committed. But how does the index keep track of these changes? Does
>> the index
>> file actually contain the hunks of "foo.c" that have been modified?
>> Or is there
>> a "temporary" blob created, which the index points to?
>
> Step 2 hashes foo.c and creates a blob object and registers it to the
> index. Step 4 writes out the index as a tree and makes a commit out
> of
> it.
>
> Running this sequence might be instructive.
>
> 1$ edit foo.c
> 2$ git add foo.c
> 2a$ git ls-files -s foo.c
> 2b$ git diff foo.c
> 2c$ git diff --cached foo.c
> 3$ edit foo.c
> 3a$ git ls-files -s foo.c
> 3b$ git diff foo.c
> 3c$ git diff --cached foo.c
> 4$ git commit -m 'half-edit of foo.c'
> 4a$ git ls-files -s foo.c
> 4b$ git ls-tree HEAD foo.c
> 4c$ git diff foo.c
> 4d$ git diff --cached foo.c
>
> - 2a shows the actual blob object that was created out of foo.c in
> step 2.
>
> - 2b shows the difference between that blob (now in the index) and
> foo.c,
> which should be empty.
>
> - 2c shows the difference between the HEAD commit and the index, which
> should show your edit in step 1.
>
> - 3a shows the blob in the index; you haven't added, so it should show
> the same as 2a.
>
> - 3b shows the difference between the index and foo.c, which should
> show
> the edit in step 3.
>
> - 3c shows the difference between the HEAD commit and the index, which
> should show your edit in step 1.
>
> - 4a shows the blob in the index; you haven't added, so it should show
> the same as 2a.
>
> - 4b shows the blob in the committed tree and the blob object should
> be
> identical to 2a.
>
> - 4c shows the difference between the index and foo.c, which should
> show
> the edit in step 3.
>
> - 4d shows the difference between the HEAD commit and the index, which
> should now be empty.
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git index: how does it work?
2009-08-12 11:52 ` Shaun Cutts
@ 2009-08-12 17:47 ` Sverre Rabbelier
2009-08-12 18:45 ` Shaun Cutts
2009-08-12 20:31 ` Junio C Hamano
1 sibling, 1 reply; 9+ messages in thread
From: Sverre Rabbelier @ 2009-08-12 17:47 UTC (permalink / raw)
To: Shaun Cutts; +Cc: Junio C Hamano, git
Heya,
On Wed, Aug 12, 2009 at 04:52, Shaun Cutts<shaun@cuttshome.net> wrote:
> Are renames being tracked by the index, and is there a more basic interface
> than "status" to query about them?
Nope, git never explicitly tracks renames. Try this:
$ mv foo bar
$ git rm --cached foo
$ git add bar
$ git status
It'll tell you that you renamed foo to bar, even if you never executed 'git mv'.
This is because git does rename _detection_, that is, it'll notice
that you have another file with (almost) the same contents, so it
assumes you did a rename.
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git index: how does it work?
2009-08-12 17:47 ` Sverre Rabbelier
@ 2009-08-12 18:45 ` Shaun Cutts
2009-08-12 19:39 ` Björn Steinbrink
0 siblings, 1 reply; 9+ messages in thread
From: Shaun Cutts @ 2009-08-12 18:45 UTC (permalink / raw)
To: Sverre Rabbelier; +Cc: Git List
Aha ---
that explains it, then.
Is there a lower-level interface to rename detection than via
"status"? And... um... hmmm.... how does it work? The hash codes don't
help for "almost" the same. Is there an approximate string matching
algorithm built in somewhere?
Thanks,
-- Shaun
On Aug 12, 2009, at 7:47 PM, Sverre Rabbelier wrote:
> Heya,
>
> On Wed, Aug 12, 2009 at 04:52, Shaun Cutts<shaun@cuttshome.net> wrote:
>> Are renames being tracked by the index, and is there a more basic
>> interface
>> than "status" to query about them?
>
> Nope, git never explicitly tracks renames. Try this:
> $ mv foo bar
> $ git rm --cached foo
> $ git add bar
> $ git status
>
> It'll tell you that you renamed foo to bar, even if you never
> executed 'git mv'.
>
> This is because git does rename _detection_, that is, it'll notice
> that you have another file with (almost) the same contents, so it
> assumes you did a rename.
>
> --
> Cheers,
>
> Sverre Rabbelier
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git index: how does it work?
2009-08-12 18:45 ` Shaun Cutts
@ 2009-08-12 19:39 ` Björn Steinbrink
0 siblings, 0 replies; 9+ messages in thread
From: Björn Steinbrink @ 2009-08-12 19:39 UTC (permalink / raw)
To: Shaun Cutts; +Cc: Sverre Rabbelier, Git List
[Please don't top-post, fixed that for you...]
On 2009.08.12 20:45:48 +0200, Shaun Cutts wrote:
> On Aug 12, 2009, at 7:47 PM, Sverre Rabbelier wrote:
> >Heya,
> >
> >On Wed, Aug 12, 2009 at 04:52, Shaun Cutts<shaun@cuttshome.net> wrote:
> >>Are renames being tracked by the index, and is there a more
> >>basic interface
> >>than "status" to query about them?
> >
> >Nope, git never explicitly tracks renames. Try this:
> >$ mv foo bar
> >$ git rm --cached foo
> >$ git add bar
> >$ git status
> >
> >It'll tell you that you renamed foo to bar, even if you never
> >executed 'git mv'.
> >
> >This is because git does rename _detection_, that is, it'll notice
> >that you have another file with (almost) the same contents, so it
> >assumes you did a rename.
>
> Aha ---
>
> that explains it, then.
>
> Is there a lower-level interface to rename detection than via
> "status"? And... um... hmmm.... how does it work? The hash codes
> don't help for "almost" the same. Is there an approximate string
> matching algorithm built in somewhere?
Roughly, it works like this:
The files are split into small chunks, those are hashed, and if there
are chunks with the same hash in both files, those chunks are treated as
being common to both files. The more the files have in common, the
higher the similarity score. See estimate_similarity() for details.
As the splitting also happens at newlines this has some interesting
effects, for example, you can completely reorder the lines in a file
after renaming it, and git will still detect it as a rename.
Björn
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git index: how does it work?
2009-08-12 11:52 ` Shaun Cutts
2009-08-12 17:47 ` Sverre Rabbelier
@ 2009-08-12 20:31 ` Junio C Hamano
1 sibling, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2009-08-12 20:31 UTC (permalink / raw)
To: Shaun Cutts; +Cc: git
Shaun Cutts <shaun@cuttshome.net> writes:
> Are renames being tracked by the index, and is there a more basic
> interface than "status" to query about them?
No. Index, nor git in general, never records renames. git records
contents, not content changes. The index records a state, so does the
tree object pointed at by the HEAD commit.
When you ask for "status", git will notice that you have lost a file, and
added a new one, between these two states, by comparing them. The
contents of these lost files and added files are then compared, and ones
with similar contents are paired up.
That way, you do not have to use "git mv A B" to "rename" A to B. You can
just as well "mv A B; git rm A; git add B", and get the same outcome,
exactly because git does not record renames.
Instead, we track them by deducing that you renamed from the result.
The tree-vs-index comparison "git status" does to figure all this out is
"git diff-index -M --cached HEAD".
As it should be obvious from the above description,
git diff-index -M --cached HEAD -- A
is *NOT* the way for you to ask about "possible renames of A". You need
to run the diff for the whole tree without path limitation so that you can
pair deletions and creations up in order to deduce renames.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-08-12 20:31 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-05 16:21 git index: how does it work? Shaun Cutts
2009-08-05 18:00 ` Junio C Hamano
2009-08-12 11:52 ` Shaun Cutts
2009-08-12 17:47 ` Sverre Rabbelier
2009-08-12 18:45 ` Shaun Cutts
2009-08-12 19:39 ` Björn Steinbrink
2009-08-12 20:31 ` Junio C Hamano
2009-08-05 18:21 ` Sverre Rabbelier
2009-08-05 19:31 ` Shaun Cutts
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.