All of lore.kernel.org
 help / color / mirror / Atom feed
* git reset for index restoration?
@ 2014-05-22 16:22 David Turner
  2014-05-22 16:46 ` Jeff King
  2014-05-22 16:46 ` Elijah Newren
  0 siblings, 2 replies; 22+ messages in thread
From: David Turner @ 2014-05-22 16:22 UTC (permalink / raw)
  To: git mailing list

If I have a git repository with a clean working tree, and I delete the
index, then I can use git reset (with no arguments) to recreate it.
However, when I do recreate it, it doesn't come back the same.  I have
not analyzed this in detail, but the effect is that commands like git
status take much longer because they must read objects out of a pack
file.  In other words, the index seems to not realize that the index (or
at least most of it) represents the same state as HEAD.  If I do git
reset --hard, the index is restored to the original state (it's
byte-for-byte identical), and the pack file is no longer read.

Before I try to dig in to why this should be so, does anyone happen to
know off the top of their head?  Does this constitute a bug in git, or a
bug in my understanding of git?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 16:22 git reset for index restoration? David Turner
@ 2014-05-22 16:46 ` Jeff King
  2014-05-22 18:08   ` David Turner
  2014-05-22 16:46 ` Elijah Newren
  1 sibling, 1 reply; 22+ messages in thread
From: Jeff King @ 2014-05-22 16:46 UTC (permalink / raw)
  To: David Turner; +Cc: git mailing list

On Thu, May 22, 2014 at 12:22:43PM -0400, David Turner wrote:

> If I have a git repository with a clean working tree, and I delete the
> index, then I can use git reset (with no arguments) to recreate it.
> However, when I do recreate it, it doesn't come back the same.  I have
> not analyzed this in detail, but the effect is that commands like git
> status take much longer because they must read objects out of a pack
> file.  In other words, the index seems to not realize that the index (or
> at least most of it) represents the same state as HEAD.  If I do git
> reset --hard, the index is restored to the original state (it's
> byte-for-byte identical), and the pack file is no longer read.

Are you sure it's reading a packfile? I would expect that it is rather
reading the files in the working tree. One of the functions of the index
is to maintain a cache of the sha1 of the working tree file content and
the stat information. Once that is populated, a subsequent "git status"
can then just lstat() the files to see that its sha1 cache is up to
date.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 16:22 git reset for index restoration? David Turner
  2014-05-22 16:46 ` Jeff King
@ 2014-05-22 16:46 ` Elijah Newren
  2014-05-22 18:17   ` David Turner
  1 sibling, 1 reply; 22+ messages in thread
From: Elijah Newren @ 2014-05-22 16:46 UTC (permalink / raw)
  To: David Turner; +Cc: git mailing list

On Thu, May 22, 2014 at 9:22 AM, David Turner <dturner@twopensource.com> wrote:
> If I have a git repository with a clean working tree, and I delete the
> index, then I can use git reset (with no arguments) to recreate it.
> However, when I do recreate it, it doesn't come back the same.  I have
> not analyzed this in detail, but the effect is that commands like git
> status take much longer because they must read objects out of a pack
> file.  In other words, the index seems to not realize that the index (or
> at least most of it) represents the same state as HEAD.  If I do git
> reset --hard, the index is restored to the original state (it's
> byte-for-byte identical), and the pack file is no longer read.
>
> Before I try to dig in to why this should be so, does anyone happen to
> know off the top of their head?  Does this constitute a bug in git, or a
> bug in my understanding of git?

It's not a bug.  The index has additional stat-info it tracks about
files -- inode number, mtime, etc. -- so that it can quickly check
whether files are up to date without comparing full file contents in
the working copy to the relevant version from .git/objects.

'git reset' means make the index match the commit in HEAD.  It implies
nothing about the working copy files, which could be different.
Although you say that you have a clean working tree, git doesn't check
to verify that, so it has to treat these files as stat-dirty until the
next operation (e.g. git status) fills these in -- an operation that
involves full comparison of the files in the working copy with the
relevant version of the file from under .git/objects. (You may find
'git update-index --refresh' helpful here.)

git reset --hard means not only make the index match the commit in
HEAD, but change files in the working copy to match as well.  In such
a case, git will know that the index matches the working copy (since
it is enforcing it), so it can update all the stat-info in the index
and future git-status operations will be fast.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 16:46 ` Jeff King
@ 2014-05-22 18:08   ` David Turner
  2014-05-22 18:23     ` Jeff King
  0 siblings, 1 reply; 22+ messages in thread
From: David Turner @ 2014-05-22 18:08 UTC (permalink / raw)
  To: Jeff King; +Cc: git mailing list

On Thu, 2014-05-22 at 12:46 -0400, Jeff King wrote:
> On Thu, May 22, 2014 at 12:22:43PM -0400, David Turner wrote:
> 
> > If I have a git repository with a clean working tree, and I delete the
> > index, then I can use git reset (with no arguments) to recreate it.
> > However, when I do recreate it, it doesn't come back the same.  I have
> > not analyzed this in detail, but the effect is that commands like git
> > status take much longer because they must read objects out of a pack
> > file.  In other words, the index seems to not realize that the index (or
> > at least most of it) represents the same state as HEAD.  If I do git
> > reset --hard, the index is restored to the original state (it's
> > byte-for-byte identical), and the pack file is no longer read.
> 
> Are you sure it's reading a packfile?

Well, it's calling inflate(), and strace says it is reading
e.g. .git/objects/pack/pack-....{idx,pack}.

So, I would say so.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 16:46 ` Elijah Newren
@ 2014-05-22 18:17   ` David Turner
  2014-05-22 18:39     ` Jeff King
  0 siblings, 1 reply; 22+ messages in thread
From: David Turner @ 2014-05-22 18:17 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git mailing list

On Thu, 2014-05-22 at 09:46 -0700, Elijah Newren wrote:
> On Thu, May 22, 2014 at 9:22 AM, David Turner <dturner@twopensource.com> wrote:
> > If I have a git repository with a clean working tree, and I delete the
> > index, then I can use git reset (with no arguments) to recreate it.
> > However, when I do recreate it, it doesn't come back the same.  I have
> > not analyzed this in detail, but the effect is that commands like git
> > status take much longer because they must read objects out of a pack
> > file.  In other words, the index seems to not realize that the index (or
> > at least most of it) represents the same state as HEAD.  If I do git
> > reset --hard, the index is restored to the original state (it's
> > byte-for-byte identical), and the pack file is no longer read.
> >
> > Before I try to dig in to why this should be so, does anyone happen to
> > know off the top of their head?  Does this constitute a bug in git, or a
> > bug in my understanding of git?
> 
> It's not a bug.  The index has additional stat-info it tracks about
> files -- inode number, mtime, etc. -- so that it can quickly check
> whether files are up to date without comparing full file contents in
> the working copy to the relevant version from .git/objects.
> 
> 'git reset' means make the index match the commit in HEAD. 

Sure, so why does git status the need to read the pack file, given that
we already know that the index matches HEAD?

>  It implies
> nothing about the working copy files, which could be different.
> Although you say that you have a clean working tree, git doesn't check
> to verify that, so it has to treat these files as stat-dirty until the
> next operation (e.g. git status) fills these in -- an operation that
> involves full comparison of the files in the working copy with the
> relevant version of the file from under .git/objects. (You may find
> 'git update-index --refresh' helpful here.)

In fact, git status does not write the index (at least in this context).
And what is slow is not (only) checking over the working tree, but
reading the packs.  There should be no need to read files from the ODB
at all, since the index knows those objects' SHA1s, and it can check
those against the working tree's SHA1s.  Interestingly, if strace is to
be believed, git status doesn't even do check the working trees SHA1s
after a git reset; maybe reset already sets the stat bits?

So that's why I'm confused.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 18:08   ` David Turner
@ 2014-05-22 18:23     ` Jeff King
  2014-05-22 19:26       ` David Turner
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2014-05-22 18:23 UTC (permalink / raw)
  To: David Turner; +Cc: git mailing list

On Thu, May 22, 2014 at 02:08:16PM -0400, David Turner wrote:

> On Thu, 2014-05-22 at 12:46 -0400, Jeff King wrote:
> > On Thu, May 22, 2014 at 12:22:43PM -0400, David Turner wrote:
> >
> > > If I have a git repository with a clean working tree, and I delete the
> > > index, then I can use git reset (with no arguments) to recreate it.
> > > However, when I do recreate it, it doesn't come back the same.  I have
> > > not analyzed this in detail, but the effect is that commands like git
> > > status take much longer because they must read objects out of a pack
> > > file.  In other words, the index seems to not realize that the index (or
> > > at least most of it) represents the same state as HEAD.  If I do git
> > > reset --hard, the index is restored to the original state (it's
> > > byte-for-byte identical), and the pack file is no longer read.
> >
> > Are you sure it's reading a packfile?
>
> Well, it's calling inflate(), and strace says it is reading
> e.g. .git/objects/pack/pack-....{idx,pack}.
>
> So, I would say so.

That seems odd that we would be spending extra time there. We do
inflate() the trees in order to diff the index against HEAD, but we
shouldn't need to inflate any blobs.

Here it is for me (on linux.git):

  [before, warm cache]
  $ time perf record -q git status >/dev/null
  real    0m0.192s
  user    0m0.080s
  sys     0m0.108s

  $ perf report | grep -v '#' | head -5
     7.46%      git  [kernel.kallsyms]   [k] __d_lookup_rcu
     4.55%      git  libz.so.1.2.8       [.] inflate
     3.53%      git  libc-2.18.so        [.] __memcmp_sse4_1
     3.46%      git  [kernel.kallsyms]   [k] security_inode_getattr
     3.29%      git  git                 [.] memihash

  $ time git reset
  real    0m0.080s
  user    0m0.036s
  sys     0m0.040s

So status is pretty quick, and the time is going to lstat in the kernel,
and some tree inflation. Reset is fast, because it has nothing much to
do. Now let's kill off the index's stat cache:

  $ rm .git/index
  $ time perf record -q git reset
  real    0m0.967s
  user    0m0.780s
  sys     0m0.180s

That took a while. What was it doing?

  $ perf report | grep -v '#' | head -5
     3.23%      git  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
     1.74%      git  libcrypto.so.1.0.0  [.] 0x000000000007e010            
     1.60%      git  [kernel.kallsyms]   [k] __d_lookup_rcu                
     1.51%      git  [kernel.kallsyms]   [k] page_fault                    
     1.44%      git  libc-2.18.so        [.] __memcmp_sse4_1               

Reading files and sha1. We hash the working-tree files here (reset
doesn't technically need to refresh the index from the working tree to
copy entries from HEAD into the index, but it does it so it can do fancy
things like tell you about which files are now out-of-date).

Now how does stat fare after this?

  $ time perf record -q git status >/dev/null
  real    0m0.189s
  user    0m0.088s
  sys     0m0.096s

Looks about the same as before to me.

Note that if you use "read-tree" instead of "reset", it _just_ loads the
index, and doesn't touch the working tree. If you then run "git status",
then _that_ command has to refresh the index, and it will pay the
hashing cost. Like:

  $ rm .git/index
  $ time git read-tree HEAD
  real    0m0.084s
  user    0m0.064s
  sys     0m0.016s
  $ time git status >/dev/null
  real    0m0.833s
  user    0m0.712s
  sys     0m0.112s

All of this is behaving as I would expect. Can you show us a set of
commands that deviate from this?

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 18:17   ` David Turner
@ 2014-05-22 18:39     ` Jeff King
  2014-05-22 19:07       ` David Turner
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2014-05-22 18:39 UTC (permalink / raw)
  To: David Turner; +Cc: Elijah Newren, git mailing list

On Thu, May 22, 2014 at 02:17:22PM -0400, David Turner wrote:

> In fact, git status does not write the index (at least in this context).
> And what is slow is not (only) checking over the working tree, but
> reading the packs.  There should be no need to read files from the ODB
> at all, since the index knows those objects' SHA1s, and it can check
> those against the working tree's SHA1s.  Interestingly, if strace is to
> be believed, git status doesn't even do check the working trees SHA1s
> after a git reset; maybe reset already sets the stat bits?

Keep in mind that "status" is running _two_ diffs. One between HEAD and
the index, and one between the index and the working tree.

Your timings might be more interesting just run "git diff-index --cached
HEAD", which is the index->HEAD half of that, ignoring the working tree
state entirely.

Naively, running that diff will involve walking the trees and the index
in parallel, which means opening a bunch of tree objects. The cache-tree
index extension, however, records tree sha1s, which means we can avoid
recursing into parts of the comparison.

Comparing the two:

  $ rm .git/index
  $ git reset
  $ time git diff-index --cached HEAD
  real    0m0.044s
  user    0m0.040s
  sys     0m0.000s

  $ git reset --hard
  $ time git diff-index --cached HEAD
  real    0m0.012s
  user    0m0.008s
  sys     0m0.000s

does show some improvement. Perhaps "git reset" is not writing out the
cache-tree extension?

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 18:39     ` Jeff King
@ 2014-05-22 19:07       ` David Turner
  2014-05-22 19:09         ` Jeff King
  0 siblings, 1 reply; 22+ messages in thread
From: David Turner @ 2014-05-22 19:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Elijah Newren, git mailing list

On Thu, 2014-05-22 at 14:39 -0400, Jeff King wrote:
> does show some improvement. Perhaps "git reset" is not writing out the
> cache-tree extension?
 
Yes, that seems to be exactly what is going on; the two indexes are
identical up to the point where the TREE extension appears.

Thanks for clearing that up!  

Do you think that this is a bug in git reset?

(I'm also going to reply to your other mail because it seems like it
maybe points up some related bug)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 19:07       ` David Turner
@ 2014-05-22 19:09         ` Jeff King
  2014-05-22 19:30           ` Jeff King
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2014-05-22 19:09 UTC (permalink / raw)
  To: David Turner; +Cc: Elijah Newren, git mailing list

On Thu, May 22, 2014 at 03:07:49PM -0400, David Turner wrote:

> On Thu, 2014-05-22 at 14:39 -0400, Jeff King wrote:
> > does show some improvement. Perhaps "git reset" is not writing out the
> > cache-tree extension?
>  
> Yes, that seems to be exactly what is going on; the two indexes are
> identical up to the point where the TREE extension appears.
> 
> Thanks for clearing that up!  
> 
> Do you think that this is a bug in git reset?

Possibly. There is a call to prime_cache_tree in builtin/reset.c, which
looks like it should trigger during a "mixed" or "hard" reset (and
without arguments, you should have a mixed reset). But it doesn't seem
to get called. I haven't traced it further.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 18:23     ` Jeff King
@ 2014-05-22 19:26       ` David Turner
  0 siblings, 0 replies; 22+ messages in thread
From: David Turner @ 2014-05-22 19:26 UTC (permalink / raw)
  To: Jeff King; +Cc: git mailing list

On Thu, 2014-05-22 at 14:23 -0400, Jeff King wrote:
> On Thu, May 22, 2014 at 02:08:16PM -0400, David Turner wrote:
> 
> > On Thu, 2014-05-22 at 12:46 -0400, Jeff King wrote:
> > > On Thu, May 22, 2014 at 12:22:43PM -0400, David Turner wrote:
> > >
> > > > If I have a git repository with a clean working tree, and I delete the
> > > > index, then I can use git reset (with no arguments) to recreate it.
> > > > However, when I do recreate it, it doesn't come back the same.  I have
> > > > not analyzed this in detail, but the effect is that commands like git
> > > > status take much longer because they must read objects out of a pack
> > > > file.  In other words, the index seems to not realize that the index (or
> > > > at least most of it) represents the same state as HEAD.  If I do git
> > > > reset --hard, the index is restored to the original state (it's
> > > > byte-for-byte identical), and the pack file is no longer read.
> > >
> > > Are you sure it's reading a packfile?
> >
> > Well, it's calling inflate(), and strace says it is reading
> > e.g. .git/objects/pack/pack-....{idx,pack}.
> >
> > So, I would say so.
> 
> That seems odd that we would be spending extra time there. We do
> inflate() the trees in order to diff the index against HEAD, but we
> shouldn't need to inflate any blobs.

I just did a fresh clone of linux.git, and it doesn't have TREE (and
spends time in inflate).  Then I did a reset --hard, and it gained TREE
(and stopped spending time in inflate).  Then I did a checkout -b, and
TREE was lost again (time in inflate).  hard reset again (to HEAD, so no
change), and TREE returns and status is fast again.

Committing doesn't seem to create a TREE extension where one doesn't
exist.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 19:09         ` Jeff King
@ 2014-05-22 19:30           ` Jeff King
  2014-05-22 21:34             ` Junio C Hamano
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2014-05-22 19:30 UTC (permalink / raw)
  To: David Turner; +Cc: Junio C Hamano, Elijah Newren, git mailing list

[+cc Junio for cache-tree expertise]

On Thu, May 22, 2014 at 03:09:59PM -0400, Jeff King wrote:

> > > does show some improvement. Perhaps "git reset" is not writing out the
> > > cache-tree extension?
> [...]
> 
> Possibly. There is a call to prime_cache_tree in builtin/reset.c, which
> looks like it should trigger during a "mixed" or "hard" reset (and
> without arguments, you should have a mixed reset). But it doesn't seem
> to get called. I haven't traced it further.

So here's what's happening. The prime_cache_tree call is in reset_index,
and was added by:

  commit 6c52ec8a9ab48b50fc8bf9559467d5a4cf7eee3b
  Author: Thomas Rast <trast@student.ethz.ch>
  Date:   Tue Dec 6 18:43:39 2011 +0100
  
      reset: update cache-tree data when appropriate
      
      In the case of --mixed and --hard, we throw away the old index and
      rebuild everything from the tree argument (or HEAD).  So we have an
      opportunity here to fill in the cache-tree data, just as read-tree
      did.
      
      Signed-off-by: Thomas Rast <trast@student.ethz.ch>
      Signed-off-by: Junio C Hamano <gitster@pobox.com>

But that was counteracted by:

  commit 3fde386a40f38dbaa684c17603e71909b862d021
  Author: Martin von Zweigbergk <martinvonz@gmail.com>
  Date:   Mon Jan 14 21:47:51 2013 -0800
  
      reset [--mixed]: use diff-based reset whether or not pathspec was given
      
      Thanks to b65982b (Optimize "diff-index --cached" using cache-tree,
      2009-05-20), resetting with paths is much faster than resetting
      without paths. Some timings for the linux-2.6 repo to illustrate this
      (best of five, warm cache):
      
              reset       reset .
      real    0m0.219s    0m0.080s
      user    0m0.140s    0m0.040s
      sys     0m0.070s    0m0.030s
      
      These two commands should do the same thing, so instead of having the
      user type the trailing " ." to get the faster do_diff_cache()-based
      implementation, always use it when doing a mixed reset, with or
      without paths (so "git reset $rev" would also be faster).
      
      Timing "git reset" shows that it indeed becomes as fast as
      "git reset ." after this patch.
      
      Signed-off-by: Martin von Zweigbergk <martinvonz@gmail.com>
      Signed-off-by: Junio C Hamano <gitster@pobox.com>

We never call reset_index now, because we handle it via diff.  We could
call prime_cache_tree in this case, but I'm not sure if that is a good
idea, because it primes it from scratch (and so it opens up all those
trees that we are trying to avoid touching). I'm not sure if there's an
easy way to update it incrementally; I don't know the cache-tree code
very well.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 19:30           ` Jeff King
@ 2014-05-22 21:34             ` Junio C Hamano
  2014-05-22 21:53               ` David Turner
  0 siblings, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2014-05-22 21:34 UTC (permalink / raw)
  To: Jeff King; +Cc: David Turner, Elijah Newren, git mailing list

Jeff King <peff@peff.net> writes:

> [+cc Junio for cache-tree expertise]
> ...
> We never call reset_index now, because we handle it via diff.  We could
> call prime_cache_tree in this case, but I'm not sure if that is a good
> idea, because it primes it from scratch (and so it opens up all those
> trees that we are trying to avoid touching). I'm not sure if there's an
> easy way to update it incrementally; I don't know the cache-tree code
> very well.

The cache-tree is designed to start in a well-populated state,
allowing you to efficiently smudge the part you touched by
invalidating while keeping the parts you haven't touched intact.

What is missing in its API is a more fine-grained support to let us
say "it has degraded too much and we need to bring it into a
well-populated state again for it to be truly useful as an
optimization."  There are only two modes of support to revive a
degraded cache-tree, one being write_cache_as_tree(), in which case
we have to compute necessary tree object names anyway (so there is
no point discarding the result of the computation), and the other
being calls to prime-cache-tree, in which we happen to know that the
whole index contents must match the whole tree structure represented
by one tree object.

Both aim to restore the cache-tree into a fully-populated state, and
there is no support to populate it "well enough" by doing anything
incremental.  You can call write-tree side incremental, because it
does reuse what is still valid without recomputing tree objects for
them---but the result is a fully-populated state.

Adding a more fine-grain support is not against the overall design,
but it was unclear what such additional API functions should look
like, and where we can call them safely, at least back when we were
actively improving it.  Two that comes to my mind are:

 - We know that the subtrees down in this directory are degraded too
   much; write-tree only the subtrees that correspond to this
   directory without restoring other parts of the tree.

 - We just populated the index with the subtrees in this directory
   and know that they should match the tree hierarchy exactly.
   prime-cache-tree only the parts without affecting other parts of
   the tree.

As with calls to existing (whole-tree) prime-cache-tree, the latter
is an error-prone optimization---I think we had cases where we said
"after this operation, we know that the index must exactly match the
tree we used to muck with the index" and added a call, and later
discovered that "must exactly match" was not true.

The former forces recomputation, so there is much less safety
concern.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 21:34             ` Junio C Hamano
@ 2014-05-22 21:53               ` David Turner
  2014-05-22 21:58                 ` Junio C Hamano
  0 siblings, 1 reply; 22+ messages in thread
From: David Turner @ 2014-05-22 21:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Elijah Newren, git mailing list

On Thu, 2014-05-22 at 14:34 -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> > [+cc Junio for cache-tree expertise]
> > ...
> > We never call reset_index now, because we handle it via diff.  We could
> > call prime_cache_tree in this case, but I'm not sure if that is a good
> > idea, because it primes it from scratch (and so it opens up all those
> > trees that we are trying to avoid touching). I'm not sure if there's an
> > easy way to update it incrementally; I don't know the cache-tree code
> > very well.
> 
> The cache-tree is designed to start in a well-populated state,
> allowing you to efficiently smudge the part you touched by
> invalidating while keeping the parts you haven't touched intact.

As far as I can tell, the cache-tree does not in fact ever get into a
well-populated state (that is, it does not exist at all) under ordinary
git operation except by git reset --hard.  Perhaps this was already
clear from the previous traffic on the thread, but I wanted to make sure
Junio was also aware of this.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 21:53               ` David Turner
@ 2014-05-22 21:58                 ` Junio C Hamano
  2014-05-22 22:01                   ` David Turner
  0 siblings, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2014-05-22 21:58 UTC (permalink / raw)
  To: David Turner; +Cc: Jeff King, Elijah Newren, git mailing list

David Turner <dturner@twopensource.com> writes:

> On Thu, 2014-05-22 at 14:34 -0700, Junio C Hamano wrote:
>> Jeff King <peff@peff.net> writes:
>> 
>> > [+cc Junio for cache-tree expertise]
>> > ...
>> > We never call reset_index now, because we handle it via diff.  We could
>> > call prime_cache_tree in this case, but I'm not sure if that is a good
>> > idea, because it primes it from scratch (and so it opens up all those
>> > trees that we are trying to avoid touching). I'm not sure if there's an
>> > easy way to update it incrementally; I don't know the cache-tree code
>> > very well.
>> 
>> The cache-tree is designed to start in a well-populated state,
>> allowing you to efficiently smudge the part you touched by
>> invalidating while keeping the parts you haven't touched intact.
>
> As far as I can tell, the cache-tree does not in fact ever get into a
> well-populated state (that is, it does not exist at all) under ordinary
> git operation except by git reset --hard.  Perhaps this was already
> clear from the previous traffic on the thread, but I wanted to make sure
> Junio was also aware of this.

Yes.  As I said, that should not usually be a problem for those who
do the real work (read: commit), at which time write-tree will fully
populate the cache-tree.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 21:58                 ` Junio C Hamano
@ 2014-05-22 22:01                   ` David Turner
  2014-05-22 22:12                     ` Junio C Hamano
  0 siblings, 1 reply; 22+ messages in thread
From: David Turner @ 2014-05-22 22:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Elijah Newren, git mailing list

On Thu, 2014-05-22 at 14:58 -0700, Junio C Hamano wrote:
> David Turner <dturner@twopensource.com> writes:
> 
> > On Thu, 2014-05-22 at 14:34 -0700, Junio C Hamano wrote:
> >> Jeff King <peff@peff.net> writes:
> >> 
> >> > [+cc Junio for cache-tree expertise]
> >> > ...
> >> > We never call reset_index now, because we handle it via diff.  We could
> >> > call prime_cache_tree in this case, but I'm not sure if that is a good
> >> > idea, because it primes it from scratch (and so it opens up all those
> >> > trees that we are trying to avoid touching). I'm not sure if there's an
> >> > easy way to update it incrementally; I don't know the cache-tree code
> >> > very well.
> >> 
> >> The cache-tree is designed to start in a well-populated state,
> >> allowing you to efficiently smudge the part you touched by
> >> invalidating while keeping the parts you haven't touched intact.
> >
> > As far as I can tell, the cache-tree does not in fact ever get into a
> > well-populated state (that is, it does not exist at all) under ordinary
> > git operation except by git reset --hard.  Perhaps this was already
> > clear from the previous traffic on the thread, but I wanted to make sure
> > Junio was also aware of this.
> 
> Yes.  As I said, that should not usually be a problem for those who
> do the real work (read: commit), at which time write-tree will fully
> populate the cache-tree.

Git commit does not in fact populate the cache-tree.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 22:01                   ` David Turner
@ 2014-05-22 22:12                     ` Junio C Hamano
  2014-05-22 22:18                       ` Junio C Hamano
  2014-05-22 22:29                       ` Junio C Hamano
  0 siblings, 2 replies; 22+ messages in thread
From: Junio C Hamano @ 2014-05-22 22:12 UTC (permalink / raw)
  To: David Turner; +Cc: Jeff King, Elijah Newren, git mailing list

David Turner <dturner@twopensource.com> writes:

>> Yes.  As I said, that should not usually be a problem for those who
>> do the real work (read: commit), at which time write-tree will fully
>> populate the cache-tree.
>
> Git commit does not in fact populate the cache-tree.

If that is the case, we must have broken the write-tree codepath
over time.

Of course, "git commit foo" will need to prepare a temporary index
where only the entry "foo" is different from the HEAD version, write
the tree to create a commit, but we do not write out the main index
as a tree after updating the same entry "foo" in it (there may be
other changes relative to HEAD), so its cache-tree is only
invalidated at "foo" when we updating the entry and we do not spend
extra cycles to repopulate it.

But at least my understanding has been that "git commit" (no partial
commit, write the whole index as a commit) which uses the "git
write-tree" machinery knows which subtree has what tree object name
and populates the cache-tree fully.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 22:12                     ` Junio C Hamano
@ 2014-05-22 22:18                       ` Junio C Hamano
  2014-05-22 23:33                         ` Duy Nguyen
  2014-05-22 22:29                       ` Junio C Hamano
  1 sibling, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2014-05-22 22:18 UTC (permalink / raw)
  To: David Turner; +Cc: Jeff King, Elijah Newren, git mailing list

Junio C Hamano <gitster@pobox.com> writes:

> David Turner <dturner@twopensource.com> writes:
>
>>> Yes.  As I said, that should not usually be a problem for those who
>>> do the real work (read: commit), at which time write-tree will fully
>>> populate the cache-tree.
>>
>> Git commit does not in fact populate the cache-tree.
>
> If that is the case, we must have broken the write-tree codepath
> over time.
>
> Of course, "git commit foo" will need to prepare a temporary index
> where only the entry "foo" is different from the HEAD version, write
> the tree to create a commit, but we do not write out the main index
> as a tree after updating the same entry "foo" in it (there may be
> other changes relative to HEAD), so its cache-tree is only
> invalidated at "foo" when we updating the entry and we do not spend
> extra cycles to repopulate it.
>
> But at least my understanding has been that "git commit" (no partial
> commit, write the whole index as a commit) which uses the "git
> write-tree" machinery knows which subtree has what tree object name
> and populates the cache-tree fully.

... and the "incrementally repair" Peff talks about would be to
cover more cases where we may know (either because we have already
computed it to write out a subtree, or we have just read from a
known tree to populate a part of the index and we know the paths in
the index that correspond to that subtree are exactly the same ones
as found in the tree we read from) parts of the cache-tree can be
updated with tree object names for subtrees, but we don't do
anything right now.  "git commit foo" (where "foo" is a directory)
may be able to say "The cache-tree entry for 'foo' can be updated
with the tree object of the new HEAD:foo because we know they must
match in the main index", for example.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 22:12                     ` Junio C Hamano
  2014-05-22 22:18                       ` Junio C Hamano
@ 2014-05-22 22:29                       ` Junio C Hamano
  2014-05-22 23:02                         ` David Turner
  1 sibling, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2014-05-22 22:29 UTC (permalink / raw)
  To: Jeff King; +Cc: David Turner, Elijah Newren, git mailing list

Junio C Hamano <gitster@pobox.com> writes:

> But at least my understanding has been that "git commit" (no partial
> commit, write the whole index as a commit) which uses the "git
> write-tree" machinery knows which subtree has what tree object name
> and populates the cache-tree fully.

Here is what I tried just now.

    $ rm .git/index
    $ git read-tree HEAD HEAD

Note that a single-tree read-tree will populate the cache-tree and
that is why I am forcing "switch branches" 2-way read-tree here,
which I know will discard the cache-tree fully.

    $ ls -l .git/index
    -rw-r----- 1 jch eng 249440 May 22 15:20 .git/index
    $ git checkout HEAD^0
    $ ls -l .git/index
    -rw-r----- 1 jch eng 249440 May 22 15:21 .git/index

Still the same size, without cache-tree.

    $ git write-tree
    57361c4add61b638dad1c1c2542edf877f515c48
    $ ls -l .git/index
    -rw-r----- 1 jch eng 254383 May 22 15:21 .git/index

The size differences come from the recomputation of the cache tree.
The result is the same if we replace "git write-tree" with a
whole-index commit, e.g.

    $ git commit --allow-empty -m foo

and test-dump-cache-tree seem to see a fully populated cache-tree
after these steps.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 22:29                       ` Junio C Hamano
@ 2014-05-22 23:02                         ` David Turner
  2014-05-22 23:14                           ` Junio C Hamano
  0 siblings, 1 reply; 22+ messages in thread
From: David Turner @ 2014-05-22 23:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Elijah Newren, git mailing list

On Thu, 2014-05-22 at 15:29 -0700, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > But at least my understanding has been that "git commit" (no partial
> > commit, write the whole index as a commit) which uses the "git
> > write-tree" machinery knows which subtree has what tree object name
> > and populates the cache-tree fully.
> 
> Here is what I tried just now.
> 
>     $ rm .git/index
>     $ git read-tree HEAD HEAD
> 
> Note that a single-tree read-tree will populate the cache-tree and
> that is why I am forcing "switch branches" 2-way read-tree here,
> which I know will discard the cache-tree fully.
> 
>     $ ls -l .git/index
>     -rw-r----- 1 jch eng 249440 May 22 15:20 .git/index
>     $ git checkout HEAD^0
>     $ ls -l .git/index
>     -rw-r----- 1 jch eng 249440 May 22 15:21 .git/index
> 
> Still the same size, without cache-tree.
> 
>     $ git write-tree
>     57361c4add61b638dad1c1c2542edf877f515c48
>     $ ls -l .git/index
>     -rw-r----- 1 jch eng 254383 May 22 15:21 .git/index
> 
> The size differences come from the recomputation of the cache tree.
> The result is the same if we replace "git write-tree" with a
> whole-index commit, e.g.
> 
>     $ git commit --allow-empty -m foo
> 
> and test-dump-cache-tree seem to see a fully populated cache-tree
> after these steps.

I get the same results as you with git write-tree.  But I do not get the
same results from a whole-index git commit (I tried your exact
command-line).  That is, when I do git commit with no cache-tree in
place, it does not create one. 

To expand: even if git commit did work for me the way it seems to work
for you, I still believe that the cache-tree behavior would be
suboptimal, because every time a user switches branches, they lose their
cache-tree, and thus all of their git status commands are slow until
their first commit. But I am willing to believe that my workflow is
atypical, and that most people commit enough soon after switching
branches.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 23:02                         ` David Turner
@ 2014-05-22 23:14                           ` Junio C Hamano
  0 siblings, 0 replies; 22+ messages in thread
From: Junio C Hamano @ 2014-05-22 23:14 UTC (permalink / raw)
  To: David Turner; +Cc: Jeff King, Elijah Newren, git mailing list

David Turner <dturner@twopensource.com> writes:

> ... I still believe that the cache-tree behavior would be
> suboptimal, ...

I do not think anybody doubts that "suboptimal"-ness in this thread.
As you saw the "incremental" thing from Peff and my responses to it,
there may be more things we could be doing.  It just has not been at
a ultra high priority, and if we can choose only one change from
possibilities, losing the entire cache-tree upon switching branches,
like in my two-way read-tree example, would be the thing that would
give us the most benefit if we somehow change it.

That however is unfortunately not a low-hanging fruit.  The two-way
merge machinery we use for switching branches wants to populate the
index one entry at a time, without having any point where you can
say "OK the result in this subdirectory will exactly match this
subtree" which would allow us to say "prime the cache tree for that
subdirectory with this tree object".

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 22:18                       ` Junio C Hamano
@ 2014-05-22 23:33                         ` Duy Nguyen
  2014-05-22 23:37                           ` David Turner
  0 siblings, 1 reply; 22+ messages in thread
From: Duy Nguyen @ 2014-05-22 23:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: David Turner, Jeff King, Elijah Newren, git mailing list

On Fri, May 23, 2014 at 5:18 AM, Junio C Hamano <gitster@pobox.com> wrote:
> ... and the "incrementally repair" Peff talks about would be to
> cover more cases where we may know (either because we have already
> computed it to write out a subtree, or we have just read from a
> known tree to populate a part of the index and we know the paths in
> the index that correspond to that subtree are exactly the same ones
> as found in the tree we read from) parts of the cache-tree can be
> updated with tree object names for subtrees, but we don't do
> anything right now.

I wanted to do this but it's hard. For "diff --cached" (which should
be where we find out and repair cache-tree), we flatten the trees in
traverse_trees() and let unpack-trees figure out the differences
against the index. The's no direct connection between a change and a
tree. Maybe I missed something. David if you are interested in "git
status" performance, this repairing thing could be important when the
worktree is large because the diff cost increases accordingly if
cache-tree is not fully populated. Empty cache-tree could cost us
nearly the same time we save with avoiding opendir/readdir..
-- 
Duy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git reset for index restoration?
  2014-05-22 23:33                         ` Duy Nguyen
@ 2014-05-22 23:37                           ` David Turner
  0 siblings, 0 replies; 22+ messages in thread
From: David Turner @ 2014-05-22 23:37 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, Jeff King, Elijah Newren, git mailing list

On Fri, 2014-05-23 at 06:33 +0700, Duy Nguyen wrote:
> On Fri, May 23, 2014 at 5:18 AM, Junio C Hamano <gitster@pobox.com> wrote:
> > ... and the "incrementally repair" Peff talks about would be to
> > cover more cases where we may know (either because we have already
> > computed it to write out a subtree, or we have just read from a
> > known tree to populate a part of the index and we know the paths in
> > the index that correspond to that subtree are exactly the same ones
> > as found in the tree we read from) parts of the cache-tree can be
> > updated with tree object names for subtrees, but we don't do
> > anything right now.
> 
> I wanted to do this but it's hard. For "diff --cached" (which should
> be where we find out and repair cache-tree), we flatten the trees in
> traverse_trees() and let unpack-trees figure out the differences
> against the index. The's no direct connection between a change and a
> tree. Maybe I missed something. David if you are interested in "git
> status" performance, this repairing thing could be important when the
> worktree is large because the diff cost increases accordingly if
> cache-tree is not fully populated. Empty cache-tree could cost us
> nearly the same time we save with avoiding opendir/readdir..

I am interested, and I believe I might be able to start looking into it
in a week or two.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-05-22 23:37 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-22 16:22 git reset for index restoration? David Turner
2014-05-22 16:46 ` Jeff King
2014-05-22 18:08   ` David Turner
2014-05-22 18:23     ` Jeff King
2014-05-22 19:26       ` David Turner
2014-05-22 16:46 ` Elijah Newren
2014-05-22 18:17   ` David Turner
2014-05-22 18:39     ` Jeff King
2014-05-22 19:07       ` David Turner
2014-05-22 19:09         ` Jeff King
2014-05-22 19:30           ` Jeff King
2014-05-22 21:34             ` Junio C Hamano
2014-05-22 21:53               ` David Turner
2014-05-22 21:58                 ` Junio C Hamano
2014-05-22 22:01                   ` David Turner
2014-05-22 22:12                     ` Junio C Hamano
2014-05-22 22:18                       ` Junio C Hamano
2014-05-22 23:33                         ` Duy Nguyen
2014-05-22 23:37                           ` David Turner
2014-05-22 22:29                       ` Junio C Hamano
2014-05-22 23:02                         ` David Turner
2014-05-22 23:14                           ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.