git status: small difference between stating whole repository and small subdirectory

All of lore.kernel.org
 help / color / mirror / Atom feed

* git status: small difference between stating whole repository and small subdirectory
@ 2012-02-10  9:42 Piotr Krukowiecki
  2012-02-10 12:33 ` Nguyen Thai Ngoc Duy
  2012-02-10 16:18 ` Piotr Krukowiecki
  0 siblings, 2 replies; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-10  9:42 UTC (permalink / raw)
  To: Git Mailing List

Hi,

I compared stating whole tree vs one small subdirectory, and I
expected that for the subdirectory status will be very very fast.
After all, it has only few files to stat. But it's not fast. Why?


With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches):

$ time git status    > /dev/null
real	0m41.670s
user	0m0.980s
sys	0m2.908s

$ time git status -- src/.../somedir   > /dev/null
real	0m17.380s
user	0m0.748s
sys	0m0.328s


With warm cache:

$ time git status    > /dev/null
real	0m0.792s
user	0m0.404s
sys	0m0.384s

$ time git status -- src/.../somedir   > /dev/null
real	0m0.335s
user	0m0.288s
sys	0m0.048s


Repository/dir stats:

$ find * -type f | wc -l
127189
$ du -shc * | grep total
2.2G	total

$ find src/.../somedir -type f | wc -l
55
$ du -shc src/.../somedir | grep total
224K	total


$ git --version
git version 1.7.9.rc0.10.gbeecc

-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-10  9:42 git status: small difference between stating whole repository and small subdirectory Piotr Krukowiecki
@ 2012-02-10 12:33 ` Nguyen Thai Ngoc Duy
  2012-02-10 13:46   ` Piotr Krukowiecki
  2012-02-10 16:18 ` Piotr Krukowiecki
  1 sibling, 1 reply; 52+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-10 12:33 UTC (permalink / raw)
  To: Piotr Krukowiecki; +Cc: Git Mailing List

On Fri, Feb 10, 2012 at 4:42 PM, Piotr Krukowiecki
<piotr.krukowiecki@gmail.com> wrote:
> Hi,
>
> I compared stating whole tree vs one small subdirectory, and I
> expected that for the subdirectory status will be very very fast.
> After all, it has only few files to stat. But it's not fast. Why?

Because stat'ing is not the only thing git-status does? In order to
find out staged changes, unstaged changes and untracked files, it has
to do the equivalence of "git diff --cached", "git diff" and "git
ls-files -o". I think copy detection is also enabled, which uses more
cycles.

Profiling it should give you a good idea what parts cost most.
-- 
Duy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-10 12:33 ` Nguyen Thai Ngoc Duy
@ 2012-02-10 13:46   ` Piotr Krukowiecki
  2012-02-10 14:37     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-10 13:46 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Nguyen Thai Ngoc Duy

On Fri, Feb 10, 2012 at 1:33 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Fri, Feb 10, 2012 at 4:42 PM, Piotr Krukowiecki
> <piotr.krukowiecki@gmail.com> wrote:
>> Hi,
>>
>> I compared stating whole tree vs one small subdirectory, and I
>> expected that for the subdirectory status will be very very fast.
>> After all, it has only few files to stat. But it's not fast. Why?
>
> Because stat'ing is not the only thing git-status does? In order to
> find out staged changes, unstaged changes and untracked files, it has
> to do the equivalence of "git diff --cached", "git diff" and "git
> ls-files -o". I think copy detection is also enabled, which uses more
> cycles.

I believe copy detection is not done, neither for tracked nor untracked files.
Rename detection is done for tracked files. In this case it should not
matter, as there were no changes to the files.

My point is, that for such small number of small files (55 files and
223KB), which had no changes at all, the status took a lot of time (17
seconds) and doing status on whole repository which has more than
2000x files and 10000x data took only 2x more time.

I don't think that the algorithm scales so well, so my guess is that
'status' is so inefficient for subdirectories (i.e. the "git status --
dir" filter does not filter as much as it could).

> Profiling it should give you a good idea what parts cost most.

I'll try that later.

BTW by stating I meant using "status", not that it uses stat() or whatever.

-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-10 13:46   ` Piotr Krukowiecki
@ 2012-02-10 14:37     ` Nguyen Thai Ngoc Duy
  2012-02-13 16:54       ` Piotr Krukowiecki
  0 siblings, 1 reply; 52+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-10 14:37 UTC (permalink / raw)
  To: Piotr Krukowiecki; +Cc: Git Mailing List

On Fri, Feb 10, 2012 at 8:46 PM, Piotr Krukowiecki
<piotr.krukowiecki@gmail.com> wrote:
> On Fri, Feb 10, 2012 at 1:33 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> On Fri, Feb 10, 2012 at 4:42 PM, Piotr Krukowiecki
>> <piotr.krukowiecki@gmail.com> wrote:
>>> Hi,
>>>
>>> I compared stating whole tree vs one small subdirectory, and I
>>> expected that for the subdirectory status will be very very fast.
>>> After all, it has only few files to stat. But it's not fast. Why?
>>
>> Because stat'ing is not the only thing git-status does? In order to
>> find out staged changes, unstaged changes and untracked files, it has
>> to do the equivalence of "git diff --cached", "git diff" and "git
>> ls-files -o". I think copy detection is also enabled, which uses more
>> cycles.
>
> I believe copy detection is not done, neither for tracked nor untracked files.
> Rename detection is done for tracked files. In this case it should not
> matter, as there were no changes to the files.
>
> My point is, that for such small number of small files (55 files and
> 223KB), which had no changes at all, the status took a lot of time (17
> seconds) and doing status on whole repository which has more than
> 2000x files and 10000x data took only 2x more time.
>
> I don't think that the algorithm scales so well, so my guess is that
> 'status' is so inefficient for subdirectories (i.e. the "git status --
> dir" filter does not filter as much as it could).

I think the cost is in $GIT_DIR, not the working directory.
-- 
Duy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-10  9:42 git status: small difference between stating whole repository and small subdirectory Piotr Krukowiecki
  2012-02-10 12:33 ` Nguyen Thai Ngoc Duy
@ 2012-02-10 16:18 ` Piotr Krukowiecki
  2012-02-14 11:34   ` Thomas Rast
  1 sibling, 1 reply; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-10 16:18 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Nguyen Thai Ngoc Duy

On Fri, Feb 10, 2012 at 10:42 AM, Piotr Krukowiecki
<piotr.krukowiecki@gmail.com> wrote:
> Hi,
>
> I compared stating whole tree vs one small subdirectory, and I
> expected that for the subdirectory status will be very very fast.
> After all, it has only few files to stat. But it's not fast. Why?
>
>
> With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches):
>
> $ time git status    > /dev/null
> real    0m41.670s
> user    0m0.980s
> sys     0m2.908s
>
> $ time git status -- src/.../somedir   > /dev/null
> real    0m17.380s
> user    0m0.748s
> sys     0m0.328s
>
>
> With warm cache:
>
> $ time git status    > /dev/null
> real    0m0.792s
> user    0m0.404s
> sys     0m0.384s
>
> $ time git status -- src/.../somedir   > /dev/null
> real    0m0.335s
> user    0m0.288s
> sys     0m0.048s
>
>
> Repository/dir stats:
>
> $ find * -type f | wc -l
> 127189
> $ du -shc * | grep total
> 2.2G    total
>
> $ find src/.../somedir -type f | wc -l
> 55
> $ du -shc src/.../somedir | grep total
> 224K    total
>
>
> $ git --version
> git version 1.7.9.rc0.10.gbeecc

I can't reproduce this behavior at the moment. 'status' on the
directory takes about 1.5s instead of 17s. status on whole repository
takes 27s.
This is my work repository, so it was changed today.

I'll get back to you when I can reproduce the problem again...


-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-10 14:37     ` Nguyen Thai Ngoc Duy
@ 2012-02-13 16:54       ` Piotr Krukowiecki
  0 siblings, 0 replies; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-13 16:54 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List

On Fri, Feb 10, 2012 at 3:37 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> I think the cost is in $GIT_DIR, not the working directory.

Could you explain?

I got the problem again today. This time I've made a copy of the
repository so hopefully I'll able to reproduce the problems.

This time it's a different repository but the 'status' on a small
subdirectory is even more than 2x slower than on the whole repository.

Whole repo:
$ find * -type f | wc -l
33021
$ du -shc * | grep total
2.1G	total

The subdir:
$ find * -type f | wc -l
17
$ du -shc * | grep total
84K	total

As previously, timing was done with cold cache (echo 3 | sudo tee
/proc/sys/vm/drop_caches) and executed several times.

This time I have used recent git (1.7.9.188.g12766) compiled with -pg.
git was executed in the subdirectory. Tracked files were not
changed/deleted, there was just a couple of small untracked files.

Timings:

$ time git status
real	0m16.595s
user	0m0.680s
sys	0m0.616s

$ time git status -- .
real	0m10.030s
user	0m0.464s
sys	0m0.184s

You can find gprof output here:
http://pastebin.com/mhddDUmv - from whole repo status
http://pastebin.com/1LdVn77A - from subdir status

-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-10 16:18 ` Piotr Krukowiecki
@ 2012-02-14 11:34   ` Thomas Rast
  2012-02-15  8:57     ` Piotr Krukowiecki
  0 siblings, 1 reply; 52+ messages in thread
From: Thomas Rast @ 2012-02-14 11:34 UTC (permalink / raw)
  To: Piotr Krukowiecki; +Cc: Git Mailing List, Nguyen Thai Ngoc Duy

Piotr Krukowiecki <piotr.krukowiecki@gmail.com> writes:

> On Fri, Feb 10, 2012 at 10:42 AM, Piotr Krukowiecki
> <piotr.krukowiecki@gmail.com> wrote:
>> I compared stating whole tree vs one small subdirectory, and I
>> expected that for the subdirectory status will be very very fast.
>> After all, it has only few files to stat. But it's not fast. Why?
>>
>>
>> With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches):
>>
>> $ time git status    > /dev/null
>> real    0m41.670s
>> user    0m0.980s
>> sys     0m2.908s
>>
>> $ time git status -- src/.../somedir   > /dev/null
>> real    0m17.380s
>> user    0m0.748s
>> sys     0m0.328s
[...]
> I can't reproduce this behavior at the moment. 'status' on the
> directory takes about 1.5s instead of 17s. status on whole repository
> takes 27s.
> This is my work repository, so it was changed today.

To me these timings smell like a combination of either a network
filesystem or a slow/busy disk, and non-packed repositories.  Next time
this happens look at 'git count-objects', run 'git gc' and redo the
timings.

If you are indeed on a network filesystem, also look at the
core.preloadIndex setting.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-14 11:34   ` Thomas Rast
@ 2012-02-15  8:57     ` Piotr Krukowiecki
  2012-02-15 11:01       ` Nguyen Thai Ngoc Duy
  2012-02-15 19:03       ` Jeff King
  0 siblings, 2 replies; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-15  8:57 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Git Mailing List, Nguyen Thai Ngoc Duy

On Tue, Feb 14, 2012 at 12:34 PM, Thomas Rast <trast@inf.ethz.ch> wrote:
> Piotr Krukowiecki <piotr.krukowiecki@gmail.com> writes:
>
>> On Fri, Feb 10, 2012 at 10:42 AM, Piotr Krukowiecki
>> <piotr.krukowiecki@gmail.com> wrote:
>>> I compared stating whole tree vs one small subdirectory, and I
>>> expected that for the subdirectory status will be very very fast.
>>> After all, it has only few files to stat. But it's not fast. Why?
>>>
>>>
>>> With cold cache (echo 3 | sudo tee /proc/sys/vm/drop_caches):
>>>
>>> $ time git status    > /dev/null
>>> real    0m41.670s
>>> user    0m0.980s
>>> sys     0m2.908s
>>>
>>> $ time git status -- src/.../somedir   > /dev/null
>>> real    0m17.380s
>>> user    0m0.748s
>>> sys     0m0.328s
> [...]
>> I can't reproduce this behavior at the moment. 'status' on the
>> directory takes about 1.5s instead of 17s. status on whole repository
>> takes 27s.
>> This is my work repository, so it was changed today.
>
> To me these timings smell like a combination of either a network
> filesystem or a slow/busy disk, and non-packed repositories.  Next time
> this happens look at 'git count-objects', run 'git gc' and redo the
> timings.
>
> If you are indeed on a network filesystem, also look at the
> core.preloadIndex setting.

All is on local disk and system is idle.

Indeed, after gc the times went down:
10s -> 2.3s (subdirectory)
17s -> 9.5s (whole repo)

2 seconds is much better and I'd say acceptable for me. But my questions are:
- why is it so slow with not packed repo?
- can it be faster without repacking?
- even with packed repo, the time on small subdirectory is much higher
than I'd expect given time on whole repo and subdirectory size - why?


$ git count-objects -v
count: 5095
size: 37084
in-pack: 755364
packs: 21
size-pack: 2398468
prune-packable: 0
garbage: 0

$ git gc
Counting objects: 760212, done.
Compressing objects: 100% (158651/158651), done.
Writing objects: 100% (760212/760212), done.
Total 760212 (delta 535848), reused 736108 (delta 513257)
Checking connectivity: 760212, done.

$ time git status  -- .
real	0m2.503s
user	0m0.160s
sys	0m0.096s

$ time git status
real	0m9.663s
user	0m0.232s
sys	0m0.556s


-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-15  8:57     ` Piotr Krukowiecki
@ 2012-02-15 11:01       ` Nguyen Thai Ngoc Duy
  2012-02-15 15:14         ` Piotr Krukowiecki
  2012-02-15 19:03       ` Jeff King
  1 sibling, 1 reply; 52+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-15 11:01 UTC (permalink / raw)
  To: Piotr Krukowiecki; +Cc: Thomas Rast, Git Mailing List

On Wed, Feb 15, 2012 at 3:57 PM, Piotr Krukowiecki
<piotr.krukowiecki@gmail.com> wrote:
> Indeed, after gc the times went down:
> 10s -> 2.3s (subdirectory)
> 17s -> 9.5s (whole repo)
>
> 2 seconds is much better and I'd say acceptable for me. But my questions are:
> - why is it so slow with not packed repo?
> - can it be faster without repacking?

gc does more than just repacking. If you still have the un-gc'd repo,
Try these commands one by one, and time "git status" after each:

 - git pack-refs --all --prune
 - git reflog expire --all
 - git repack -d -l
 - git prune --expire
 - git rerere gc

I'd be more interested in why auto-gc does not kick in (or whther it should).

> - even with packed repo, the time on small subdirectory is much higher
> than I'd expect given time on whole repo and subdirectory size - why?

Hard to say without measuring. I just notice that I missed your mail
with profiling results. I will have a look, but just in case, is the
repository publicly available?
-- 
Duy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-15 11:01       ` Nguyen Thai Ngoc Duy
@ 2012-02-15 15:14         ` Piotr Krukowiecki
  2012-02-16 13:22           ` Piotr Krukowiecki
  0 siblings, 1 reply; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-15 15:14 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Thomas Rast, Git Mailing List

On Wed, Feb 15, 2012 at 12:01 PM, Nguyen Thai Ngoc Duy
<pclouds@gmail.com> wrote:
> On Wed, Feb 15, 2012 at 3:57 PM, Piotr Krukowiecki
> <piotr.krukowiecki@gmail.com> wrote:
>> Indeed, after gc the times went down:
>> 10s -> 2.3s (subdirectory)
>> 17s -> 9.5s (whole repo)
>>
>> 2 seconds is much better and I'd say acceptable for me. But my questions are:
>> - why is it so slow with not packed repo?
>> - can it be faster without repacking?
>
> gc does more than just repacking. If you still have the un-gc'd repo,
> Try these commands one by one, and time "git status" after each:
>
>  - git pack-refs --all --prune
>  - git reflog expire --all
>  - git repack -d -l
>  - git prune --expire
>  - git rerere gc

It will take some time but hopefully I'll have the stats for tomorrow.


> I'd be more interested in why auto-gc does not kick in (or whther it should).

I don't have any specific options set, so default values should be used.

I'm using git-svn though, so my workflow looks like this:
   git svn fetch + git svn rebase
   ... git operations like commit, cherry-pick, rebase ...
   git svn dcommit

Not sure if that matters. I remember that I've seen auto-gc being run
several times in the past - I think after svn fetch/rebase.

I'm also using git-new-workdir and have 2 extra workdirs.


>> - even with packed repo, the time on small subdirectory is much higher
>> than I'd expect given time on whole repo and subdirectory size - why?
>
> Hard to say without measuring. I just notice that I missed your mail
> with profiling results. I will have a look, but just in case, is the
> repository publicly available?

Unfortunately it's not public. I can do some measuring if someone
tells me what to do.


-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-15  8:57     ` Piotr Krukowiecki
  2012-02-15 11:01       ` Nguyen Thai Ngoc Duy
@ 2012-02-15 19:03       ` Jeff King
  2012-02-16 13:37         ` Piotr Krukowiecki
  1 sibling, 1 reply; 52+ messages in thread
From: Jeff King @ 2012-02-15 19:03 UTC (permalink / raw)
  To: Piotr Krukowiecki; +Cc: Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy

On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote:

> All is on local disk and system is idle.
> 
> Indeed, after gc the times went down:
> 10s -> 2.3s (subdirectory)
> 17s -> 9.5s (whole repo)
> 
> 2 seconds is much better and I'd say acceptable for me. But my questions are:

Obviously these answers didn't come from any deep analysis, but are
educated guesses from me based on previous performance patterns we've
seen on the list:

> - why is it so slow with not packed repo?

Your numbers show that you're I/O-bound:

>>> $ time git status    > /dev/null
>>> real    0m41.670s
>>> user    0m0.980s
>>> sys     0m2.908s
>>>
>>> $ time git status -- src/.../somedir   > /dev/null
>>> real    0m17.380s
>>> user    0m0.748s
>>> sys     0m0.328s

which is not surprising, since you said you dropped caches before-hand.
Repacking probably reduced your disk footprint by a lot, which meant
less I/O.

I notice that you're still I/O bound even after the repack:

> $ time git status  -- .
> real    0m2.503s
> user    0m0.160s
> sys     0m0.096s
> 
> $ time git status
> real    0m9.663s
> user    0m0.232s
> sys     0m0.556s

Did you drop caches here, too?  Usually that would not be the case on a
warm cache. If it is, then it sounds like you are short on memory to
actually hold the directory tree and object db in cache. If not, what do
the warm cache numbers look like?

> - can it be faster without repacking?

Not really. You're showing an I/O problem, and repacking is git's way of
reducing I/O.

> - even with packed repo, the time on small subdirectory is much higher
> than I'd expect given time on whole repo and subdirectory size - why?

Hard to say without profiling.  It may be that we reduced the object db
lookups, saving some time, but still end up stat()ing the whole tree.
The optimization to stat only the directories of interest was in 688cd6d
(status: only touch path we may need to check, 2010-01-14), which went
into v1.7.0. What version of git are you using?

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-15 15:14         ` Piotr Krukowiecki
@ 2012-02-16 13:22           ` Piotr Krukowiecki
  0 siblings, 0 replies; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-16 13:22 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Thomas Rast, Git Mailing List, Jeff King

On Wed, Feb 15, 2012 at 4:14 PM, Piotr Krukowiecki
<piotr.krukowiecki@gmail.com> wrote:
> On Wed, Feb 15, 2012 at 12:01 PM, Nguyen Thai Ngoc Duy
> <pclouds@gmail.com> wrote:
>> On Wed, Feb 15, 2012 at 3:57 PM, Piotr Krukowiecki
>> <piotr.krukowiecki@gmail.com> wrote:
>>> Indeed, after gc the times went down:
>>> 10s -> 2.3s (subdirectory)
>>> 17s -> 9.5s (whole repo)
>>>
>>> 2 seconds is much better and I'd say acceptable for me. But my questions are:
>>> - why is it so slow with not packed repo?
>>> - can it be faster without repacking?
>>
>> gc does more than just repacking. If you still have the un-gc'd repo,
>> Try these commands one by one, and time "git status" after each:
>>
>>  - git pack-refs --all --prune
>>  - git reflog expire --all
>>  - git repack -d -l
>>  - git prune --expire
>>  - git rerere gc
>
> It will take some time but hopefully I'll have the stats for tomorrow.

Here they are. I did 'status' three times to get reliable results and
before each run have dropped caches. Backed up repository was copied
before each 'status'. Full log is at http://pastebin.com/VmB7J9CJ

git version 1.7.9.rc0.10.gbeecc

Results after each command:

status on whole repo:
18.5s - after git count-objects -v
16.0s - after git pack-refs --all --prune
20.2s - after git reflog expire --all
13.0s - after git repack -d -l
16.8s - after git prune --expire now
19.7s - after git rerere gc

status on subdir:
9.7s - after git count-objects -v
9.2s - after git pack-refs --all --prune
9.3s - after git reflog expire --all
4.4s - after git repack -d -l
9.2s - after git prune --expire now
9.0s - after git rerere gc


-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-15 19:03       ` Jeff King
@ 2012-02-16 13:37         ` Piotr Krukowiecki
  2012-02-16 14:05           ` Thomas Rast
  2012-02-16 19:20           ` Jeff King
  0 siblings, 2 replies; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-16 13:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy

On Wed, Feb 15, 2012 at 8:03 PM, Jeff King <peff@peff.net> wrote:
> On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote:
>>
> I notice that you're still I/O bound even after the repack:
>
>> $ time git status  -- .
>> real    0m2.503s
>> user    0m0.160s
>> sys     0m0.096s
>>
>> $ time git status
>> real    0m9.663s
>> user    0m0.232s
>> sys     0m0.556s
>
> Did you drop caches here, too?

Yes I did - with cache the status takes something like 0.1-0.3s on whole repo.


>  Usually that would not be the case on a
> warm cache. If it is, then it sounds like you are short on memory to
> actually hold the directory tree and object db in cache. If not, what do
> the warm cache numbers look like?

I've got 4GB of ram and I did not hit the swap when doing last
performance tests AFAIK.
Please see my previous posts for performance results with warm cache
and profile results:

http://article.gmane.org/gmane.comp.version-control.git/190397
http://article.gmane.org/gmane.comp.version-control.git/190638


>> - can it be faster without repacking?
>
> Not really. You're showing an I/O problem, and repacking is git's way of
> reducing I/O.

So if I understand correctly, the reason is because git must compare
workspace files with packed objects - and the problem is
reading/seeking/searching in the packs?

Is there a way to make packs better? I think most operations are on
workdir files - so maybe it'd be possible to tell gc/repack/whatever
to optimize access to files which I currently have in workdir?


>> - even with packed repo, the time on small subdirectory is much higher
>> than I'd expect given time on whole repo and subdirectory size - why?
>
> Hard to say without profiling.  It may be that we reduced the object db
> lookups, saving some time, but still end up stat()ing the whole tree.
> The optimization to stat only the directories of interest was in 688cd6d
> (status: only touch path we may need to check, 2010-01-14), which went
> into v1.7.0. What version of git are you using?

For latest tests I've used 1.7.9.rc0.10.gbeecc, for profiling - 1.7.9.188.g12766

Is there anything else I could do?


-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-16 13:37         ` Piotr Krukowiecki
@ 2012-02-16 14:05           ` Thomas Rast
  2012-02-16 20:15             ` Junio C Hamano
  2012-02-17 16:55             ` Piotr Krukowiecki
  2012-02-16 19:20           ` Jeff King
  1 sibling, 2 replies; 52+ messages in thread
From: Thomas Rast @ 2012-02-16 14:05 UTC (permalink / raw)
  To: Piotr Krukowiecki; +Cc: Jeff King, Git Mailing List, Nguyen Thai Ngoc Duy

Piotr Krukowiecki <piotr.krukowiecki@gmail.com> writes:

> On Wed, Feb 15, 2012 at 8:03 PM, Jeff King <peff@peff.net> wrote:
>> On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote:
>>>
>> I notice that you're still I/O bound even after the repack:
>>
>>> $ time git status  -- .
>>> real    0m2.503s
>>> user    0m0.160s
>>> sys     0m0.096s
>>>
>>> $ time git status
>>> real    0m9.663s
>>> user    0m0.232s
>>> sys     0m0.556s
>>
>> Did you drop caches here, too?
>
> Yes I did - with cache the status takes something like 0.1-0.3s on whole repo.

So umm, I'm not sure that leaves anything to be improved.

I looked at some strace dumps, and limiting the status to a subdirectory
(in my case, '-- t' in git.git) does omit the lstat()s on uninteresting
parts of the index-listed files, as well as the getdents() (i.e.,
readdir()) for parts of the tree that are not interesting.

BTW, some other parts of git-status's display may be responsible for the
amount of data it pulls from disk.  In particular, the "Your branch is
ahead" display requires computing the merge-base between HEAD and
@{upstream}.  If your @{upstream} is way ahead/behind, or points at a
disjoint chunk of history, this may mean essentially pulling all of the
involved history from disk.  If my memory of pack organization serves
right, the commit objects involved would essentially be spread across
the whole pack (corresponding to "time") and thus this operation would
more or less load the entire pack from disk.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-16 13:37         ` Piotr Krukowiecki
  2012-02-16 14:05           ` Thomas Rast
@ 2012-02-16 19:20           ` Jeff King
  2012-02-17 17:19             ` Piotr Krukowiecki
  1 sibling, 1 reply; 52+ messages in thread
From: Jeff King @ 2012-02-16 19:20 UTC (permalink / raw)
  To: Piotr Krukowiecki; +Cc: Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy

On Thu, Feb 16, 2012 at 02:37:47PM +0100, Piotr Krukowiecki wrote:

> >> $ time git status  -- .
> >> real    0m2.503s
> >> user    0m0.160s
> >> sys     0m0.096s
> >>
> >> $ time git status
> >> real    0m9.663s
> >> user    0m0.232s
> >> sys     0m0.556s
> >
> > Did you drop caches here, too?
> 
> Yes I did - with cache the status takes something like 0.1-0.3s on whole repo.

OK, then that makes sense. It's pretty much just I/O on the filesystem
and on the object db.

You can break status down a little more to see which is which. Try "git
update-index --refresh" to see just how expensive the lstat and index
handling is.

And then try "git diff-index HEAD" for an idea of how expensive it is to
just read the objects and compare to the index.

> > Not really. You're showing an I/O problem, and repacking is git's way of
> > reducing I/O.
> 
> So if I understand correctly, the reason is because git must compare
> workspace files with packed objects - and the problem is
> reading/seeking/searching in the packs?

Mostly reading (we keep a sorted index and access the packfiles via
mmap, so we only touch the pages we need). But you're also paying to
lstat() the directory tree, too. And you're paying to load (probably)
the whole index into memory, although it's relatively compact compared
to the actual file data.

> Is there a way to make packs better? I think most operations are on
> workdir files - so maybe it'd be possible to tell gc/repack/whatever
> to optimize access to files which I currently have in workdir?

It already does optimize for that case. If you can make it even better,
I'm sure people would be happy to see the numbers.

Mostly I think it is just the case that disk I/O is slow, and the
operation you're asking for has to do a certain amount of it. What kind
of disk/filesystem are you pulling off of?

It's not a fuse filesystem by any chance, is it? I have a repo on an
encfs-mounted filesystem, and the lstat times are absolutely horrific.

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-16 14:05           ` Thomas Rast
@ 2012-02-16 20:15             ` Junio C Hamano
  2012-02-17 16:55             ` Piotr Krukowiecki
  1 sibling, 0 replies; 52+ messages in thread
From: Junio C Hamano @ 2012-02-16 20:15 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Piotr Krukowiecki, Jeff King, Git Mailing List, Nguyen Thai Ngoc Duy

Thomas Rast <trast@inf.ethz.ch> writes:

> ...  If my memory of pack organization serves
> right, the commit objects involved would essentially be spread across
> the whole pack

No they are crumped together in a contiguous section in a packfile, so
that "git log" without any pathspec can go faster without consulting tree
objects.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-16 14:05           ` Thomas Rast
  2012-02-16 20:15             ` Junio C Hamano
@ 2012-02-17 16:55             ` Piotr Krukowiecki
  1 sibling, 0 replies; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-17 16:55 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Jeff King, Git Mailing List, Nguyen Thai Ngoc Duy

On Thu, Feb 16, 2012 at 3:05 PM, Thomas Rast <trast@inf.ethz.ch> wrote:
> Piotr Krukowiecki <piotr.krukowiecki@gmail.com> writes:
>
>> On Wed, Feb 15, 2012 at 8:03 PM, Jeff King <peff@peff.net> wrote:
>>> On Wed, Feb 15, 2012 at 09:57:29AM +0100, Piotr Krukowiecki wrote:
>>>>
>>> I notice that you're still I/O bound even after the repack:
>>>
>>>> $ time git status  -- .
>>>> real    0m2.503s
>>>> user    0m0.160s
>>>> sys     0m0.096s
>>>>
>>>> $ time git status
>>>> real    0m9.663s
>>>> user    0m0.232s
>>>> sys     0m0.556s
>>>
>>> Did you drop caches here, too?
>>
>> Yes I did - with cache the status takes something like 0.1-0.3s on whole repo.
>
> So umm, I'm not sure that leaves anything to be improved.

But even with caches time on small directory is only half of time on whole repo:
0.15s vs 0.07s


> I looked at some strace dumps, and limiting the status to a subdirectory
> (in my case, '-- t' in git.git) does omit the lstat()s on uninteresting
> parts of the index-listed files, as well as the getdents() (i.e.,
> readdir()) for parts of the tree that are not interesting.

Same in my case.

I've run strace with -c which shows system calls times. As I
understand the results the time used by lstat() is very small. With
dropped caches:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 90.70    0.065108           3     25361        12 lstat
  6.78    0.004869           1      6534           getdents
  1.94    0.001392           0      6964      3238 open
  0.27    0.000194           0      3726           close


> BTW, some other parts of git-status's display may be responsible for the
> amount of data it pulls from disk.  In particular, the "Your branch is
> ahead" display requires computing the merge-base between HEAD and
> @{upstream}.  If your @{upstream} is way ahead/behind, or points at a
> disjoint chunk of history, this may mean essentially pulling all of the
> involved history from disk.  If my memory of pack organization serves
> right, the commit objects involved would essentially be spread across
> the whole pack (corresponding to "time") and thus this operation would
> more or less load the entire pack from disk.

I don't think this is the case - I'm using git-svn and thus have no
upstream in git meaning.


-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-16 19:20           ` Jeff King
@ 2012-02-17 17:19             ` Piotr Krukowiecki
  2012-02-17 20:37               ` Jeff King
  0 siblings, 1 reply; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-17 17:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy

On Thu, Feb 16, 2012 at 8:20 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Feb 16, 2012 at 02:37:47PM +0100, Piotr Krukowiecki wrote:
>
>> >> $ time git status  -- .
>> >> real    0m2.503s
>> >> user    0m0.160s
>> >> sys     0m0.096s
>> >>
>> >> $ time git status
>> >> real    0m9.663s
>> >> user    0m0.232s
>> >> sys     0m0.556s
>> >
>> > Did you drop caches here, too?
>>
>> Yes I did - with cache the status takes something like 0.1-0.3s on whole repo.
>
> OK, then that makes sense. It's pretty much just I/O on the filesystem
> and on the object db.
>
> You can break status down a little more to see which is which. Try "git
> update-index --refresh" to see just how expensive the lstat and index
> handling is.

"git update-index --refresh" with dropped cache took
real	0m3.726s
user	0m0.024s
sys	0m0.404s

while "git status" with dropped cache takes
real	0m13.578s
user	0m0.240s
sys	0m0.600s

I'm not sure why it takes more than the 9s reported before - IIRC I
did the previous test in single mode under bare shell and this time
I'm testing under gnome. This or it's the effect of running
update-index :/
Now status on subdir takes 9.5s. But still the
not-much-faster-status-on-subdir rule is true.


> And then try "git diff-index HEAD" for an idea of how expensive it is to
> just read the objects and compare to the index.

The diff-index after dropping cache takes
real	0m14.095s
user	0m0.268s
sys	0m0.564s


>> > Not really. You're showing an I/O problem, and repacking is git's way of
>> > reducing I/O.
>>
>> So if I understand correctly, the reason is because git must compare
>> workspace files with packed objects - and the problem is
>> reading/seeking/searching in the packs?
>
> Mostly reading (we keep a sorted index and access the packfiles via
> mmap, so we only touch the pages we need). But you're also paying to
> lstat() the directory tree, too. And you're paying to load (probably)
> the whole index into memory, although it's relatively compact compared
> to the actual file data.

If the index is the objects/pack/*.idx files than it's 21MB


>> Is there a way to make packs better? I think most operations are on
>> workdir files - so maybe it'd be possible to tell gc/repack/whatever
>> to optimize access to files which I currently have in workdir?
>
> It already does optimize for that case. If you can make it even better,
> I'm sure people would be happy to see the numbers.

If I understand correctly, you only need to compute sha1 on the
workdir files and compare it with sha1 files recorded in index/gitdir.
It seems that to get the sha1 from index/gitdir I need to read the
packfiles? Maybe it'd be possible to cache/index it somehow, for
example in separate and smaller file?


> Mostly I think it is just the case that disk I/O is slow, and the
> operation you're asking for has to do a certain amount of it. What kind
> of disk/filesystem are you pulling off of?
>
> It's not a fuse filesystem by any chance, is it? I have a repo on an
> encfs-mounted filesystem, and the lstat times are absolutely horrific.

No, it's ext4 and the disk Seagate Barracuda 7200.12 500GB, as it
reads on the cover :)

But IMO faster disk won't help with this - times will be smaller, but
you'll still have to read the same data, so the subdir times will be
just 2x faster than whole repo, won't it? So maybe in my case it will
go down to e.g. 2s on subdir, but for someone with larger repository
it will still be 10s...


-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-17 17:19             ` Piotr Krukowiecki
@ 2012-02-17 20:37               ` Jeff King
  2012-02-17 22:25                 ` Junio C Hamano
  0 siblings, 1 reply; 52+ messages in thread
From: Jeff King @ 2012-02-17 20:37 UTC (permalink / raw)
  To: Piotr Krukowiecki; +Cc: Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy

On Fri, Feb 17, 2012 at 06:19:06PM +0100, Piotr Krukowiecki wrote:

> "git update-index --refresh" with dropped cache took
> real	0m3.726s
> user	0m0.024s
> sys	0m0.404s
> [...]
> The diff-index after dropping cache takes
> real	0m14.095s
> user	0m0.268s
> sys	0m0.564s

OK, that suggests to me that the real culprit is the I/O we spend in
accessing the object db, since that is the main I/O that happens in the
second command but not the first.

> > Mostly reading (we keep a sorted index and access the packfiles via
> > mmap, so we only touch the pages we need). But you're also paying to
> > lstat() the directory tree, too. And you're paying to load (probably)
> > the whole index into memory, although it's relatively compact compared
> > to the actual file data.
> 
> If the index is the objects/pack/*.idx files than it's 21MB

Yes, that's it. Though we don't necessarily read the whole thing. The
sorted list of sha1s is only a part of that. And we mmap and
binary-search that, so we only have to fault in pages that are actually
used in our binary search.

However, we're faulting in random pages of the index in series, so it
may actually have a lot of latency. You can see how expensive the I/O on
the index is with something like this:

  [whole operation, for reference]
  $ sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'
  $ time git diff-index HEAD
  real    0m2.636s
  user    0m0.248s
  sys     0m0.392s

  [prime the cache with just the index]
  $ sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'
  $ time cat .git/objects/pack/*.idx >/dev/null
  real    0m0.288s
  user    0m0.000s
  sys     0m0.028s
  $ time git diff-index HEAD
  real    0m2.175s
  user    0m0.272s
  sys     0m0.320s

So roughly 20% of the I/O time in my case went to faulting in the index.
You could pre-fault in the index, which would give the OS a chance to do
read-ahead caching. You can see that the combined cat and diff-index
times are still lower than the raw diff-index time. You could also do
them in parallel, but that will create some additional seeks as the
threads fight for the disk, but may be a win in the long run because we
can read bigger chunks. You can roughly simulate it by running the "cat"
and the "diff-index" above in parallel.  I get:

  real    0m2.464s
  user    0m0.284s
  sys     0m0.372s

which is almost exactly the same as doing them separately (though note
that this is on an SSD, so seeking is very cheap).

But the bulk of the time still goes to actually retrieving the object
data, so that's probably a more interesting area to focus, anyway (and
if we can reduce object accesses, we reduce their lookup, too :) ).

> If I understand correctly, you only need to compute sha1 on the
> workdir files and compare it with sha1 files recorded in index/gitdir.
> It seems that to get the sha1 from index/gitdir I need to read the
> packfiles? Maybe it'd be possible to cache/index it somehow, for
> example in separate and smaller file?

There are two diffs going on in "git status". One is a comparison
between index and worktree. In that one, you need to lstat each file to
make sure the cached sha1 we have in the index is up to date. Assuming
it is, you don't need to touch the file data at all. Then you compare
that sha1 to the stage 0 sha1 (i.e., what we typically think of as
"staged for commit"). If they match, you don't need to do more work.

But the expensive diff-index we've been doing above is comparing the
index to the HEAD tree. And doing that is a little trickier. The index
is a flat list of files with their sha1s. But the HEAD tree is stored
hierarchically. So to get the sha1 of foo/bar/baz, we have to access the
root tree object, find the "foo" entry, access its tree object, find the
"bar" entry, access its tree object, and then find the "baz" entry. Then
we compare the sha1 of the "baz" entry to what's in the index.

So what's where your I/O comes from: accessing each of the tree objects.
And that fact that it isn't just "compare the HEAD and index sha1s" is
that the index is stored as a list of flat files.

That being said, we do have an index extension to store the tree sha1 of
whole directories (i.e., we populate it when we write a whole tree or
subtree into the index from the object db, and it becomes invalidated
when a file becomes modified). This optimization is used by things like
"git commit" to avoid having to recreate the same sub-trees over and
over when creating tree objects from the index. But we could also use it
here to avoid having to even read the sub-tree objects from the object
db.

> No, it's ext4 and the disk Seagate Barracuda 7200.12 500GB, as it
> reads on the cover :)
> 
> But IMO faster disk won't help with this - times will be smaller, but
> you'll still have to read the same data, so the subdir times will be
> just 2x faster than whole repo, won't it? So maybe in my case it will
> go down to e.g. 2s on subdir, but for someone with larger repository
> it will still be 10s...

Sure. But a certain amount of I/O is going to be unavoidable to get the
answer to your question. So you will never be able to achieve the
warm-cache case. I'm not saying we can't improve (e.g., I think the
index extension thing I mentioned above is a promising approach). But we
have to be realistic about what will make things faster; if I/O is your
problem, faster disk is one possible solution (especially because some
of this is related to seeking and latency, an SSD is a nice improvement
for cold-cache times).

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-17 20:37               ` Jeff King
@ 2012-02-17 22:25                 ` Junio C Hamano
  2012-02-17 22:29                   ` Jeff King
  0 siblings, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2012-02-17 22:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Piotr Krukowiecki, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy

Jeff King <peff@peff.net> writes:

> That being said, we do have an index extension to store the tree sha1 of
> whole directories (i.e., we populate it when we write a whole tree or
> subtree into the index from the object db, and it becomes invalidated
> when a file becomes modified). This optimization is used by things like
> "git commit" to avoid having to recreate the same sub-trees over and
> over when creating tree objects from the index. But we could also use it
> here to avoid having to even read the sub-tree objects from the object
> db.

Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20)
perhaps?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-17 22:25                 ` Junio C Hamano
@ 2012-02-17 22:29                   ` Jeff King
  2012-02-20  8:25                     ` Piotr Krukowiecki
  0 siblings, 1 reply; 52+ messages in thread
From: Jeff King @ 2012-02-17 22:29 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Piotr Krukowiecki, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy

On Fri, Feb 17, 2012 at 02:25:25PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > That being said, we do have an index extension to store the tree sha1 of
> > whole directories (i.e., we populate it when we write a whole tree or
> > subtree into the index from the object db, and it becomes invalidated
> > when a file becomes modified). This optimization is used by things like
> > "git commit" to avoid having to recreate the same sub-trees over and
> > over when creating tree objects from the index. But we could also use it
> > here to avoid having to even read the sub-tree objects from the object
> > db.
> 
> Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20)
> perhaps?

That's what I get for speaking before running "git log".

So yeah, we may be about as reasonably fast as we can go. Or maybe that
optimization isn't kicking in for some reason. I think going further
would require Piotr to do more profiling.

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-17 22:29                   ` Jeff King
@ 2012-02-20  8:25                     ` Piotr Krukowiecki
  2012-02-20 14:06                       ` Jeff King
  0 siblings, 1 reply; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-20  8:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy

On Fri, Feb 17, 2012 at 11:29 PM, Jeff King <peff@peff.net> wrote:
> On Fri, Feb 17, 2012 at 02:25:25PM -0800, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>>
>> > That being said, we do have an index extension to store the tree sha1 of
>> > whole directories (i.e., we populate it when we write a whole tree or
>> > subtree into the index from the object db, and it becomes invalidated
>> > when a file becomes modified). This optimization is used by things like
>> > "git commit" to avoid having to recreate the same sub-trees over and
>> > over when creating tree objects from the index. But we could also use it
>> > here to avoid having to even read the sub-tree objects from the object
>> > db.
>>
>> Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20)
>> perhaps?
>
> That's what I get for speaking before running "git log".
>
> So yeah, we may be about as reasonably fast as we can go. Or maybe that
> optimization isn't kicking in for some reason. I think going further
> would require Piotr to do more profiling.

Is the cache set? Not sure how to check it. t0090-cache-tree.sh uses
test-dump-cache-tree and executes "read-tree HEAD" to establish the
cache, but in my case read-tree does not make the cache dumpable (but
it improves status performance).

$ test-dump-cache-tree  | wc -l
0
$ git read-tree HEAD
$ test-dump-cache-tree  | wc -l
0
$ echo 3 | sudo tee /proc/sys/vm/drop_caches && time git status -- .
[...]
real	0m1.085s


git version 1.7.9.188.g12766

-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20  8:25                     ` Piotr Krukowiecki
@ 2012-02-20 14:06                       ` Jeff King
  2012-02-20 14:09                         ` Thomas Rast
                                           ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Jeff King @ 2012-02-20 14:06 UTC (permalink / raw)
  To: Piotr Krukowiecki
  Cc: Junio C Hamano, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy

On Mon, Feb 20, 2012 at 09:25:00AM +0100, Piotr Krukowiecki wrote:

> Is the cache set? Not sure how to check it. t0090-cache-tree.sh uses
> test-dump-cache-tree and executes "read-tree HEAD" to establish the
> cache, but in my case read-tree does not make the cache dumpable (but
> it improves status performance).
> 
> $ test-dump-cache-tree  | wc -l
> 0
> $ git read-tree HEAD
> $ test-dump-cache-tree  | wc -l
> 0
> $ echo 3 | sudo tee /proc/sys/vm/drop_caches && time git status -- .
> [...]
> real	0m1.085s

Hmm. I would think test-dump-cache-tree would do it. I don't know why
read-tree wouldn't fill it in, though.

Interestingly, on my git.git repo, I had an empty cache. Running "git
read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
that running "git checkout" empties the cache.  So perhaps git could do
better about keeping the cache valid over time.

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 14:06                       ` Jeff King
@ 2012-02-20 14:09                         ` Thomas Rast
  2012-02-20 14:36                           ` Nguyen Thai Ngoc Duy
  2012-02-20 19:57                           ` Junio C Hamano
  2012-02-20 14:16                         ` Nguyen Thai Ngoc Duy
  2012-02-20 19:56                         ` Junio C Hamano
  2 siblings, 2 replies; 52+ messages in thread
From: Thomas Rast @ 2012-02-20 14:09 UTC (permalink / raw)
  To: Jeff King
  Cc: Piotr Krukowiecki, Junio C Hamano, Git Mailing List,
	Nguyen Thai Ngoc Duy

Jeff King <peff@peff.net> writes:

> On Mon, Feb 20, 2012 at 09:25:00AM +0100, Piotr Krukowiecki wrote:
>
>> Is the cache set? Not sure how to check it. t0090-cache-tree.sh uses
>> test-dump-cache-tree and executes "read-tree HEAD" to establish the
>> cache, but in my case read-tree does not make the cache dumpable (but
>> it improves status performance).
>> 
>> $ test-dump-cache-tree  | wc -l
>> 0
>> $ git read-tree HEAD
>> $ test-dump-cache-tree  | wc -l
>> 0
>> $ echo 3 | sudo tee /proc/sys/vm/drop_caches && time git status -- .
>> [...]
>> real	0m1.085s
>
> Hmm. I would think test-dump-cache-tree would do it. I don't know why
> read-tree wouldn't fill it in, though.
>
> Interestingly, on my git.git repo, I had an empty cache. Running "git
> read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
> that running "git checkout" empties the cache.  So perhaps git could do
> better about keeping the cache valid over time.

test_expect_failure 'checkout gives cache-tree' '
	git checkout HEAD^ &&
	test_shallow_cache_tree
'

;-)

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 14:06                       ` Jeff King
  2012-02-20 14:09                         ` Thomas Rast
@ 2012-02-20 14:16                         ` Nguyen Thai Ngoc Duy
  2012-02-20 14:22                           ` Jeff King
  2012-02-20 19:56                         ` Junio C Hamano
  2 siblings, 1 reply; 52+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-20 14:16 UTC (permalink / raw)
  To: Jeff King
  Cc: Piotr Krukowiecki, Junio C Hamano, Thomas Rast, Git Mailing List

On Mon, Feb 20, 2012 at 9:06 PM, Jeff King <peff@peff.net> wrote:
> Interestingly, on my git.git repo, I had an empty cache. Running "git
> read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
> that running "git checkout" empties the cache.  So perhaps git could do
> better about keeping the cache valid over time.

For fast forward case when result index matches 100% destination tree,
yeah we should repopulate cache-tree. "git reset" does that. Not sure
about other cases though. I don't think we can keep track what
subtrees are unchanged after unpack_trees() in order to keep them.
-- 
Duy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 14:16                         ` Nguyen Thai Ngoc Duy
@ 2012-02-20 14:22                           ` Jeff King
  0 siblings, 0 replies; 52+ messages in thread
From: Jeff King @ 2012-02-20 14:22 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Piotr Krukowiecki, Junio C Hamano, Thomas Rast, Git Mailing List

On Mon, Feb 20, 2012 at 09:16:43PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Mon, Feb 20, 2012 at 9:06 PM, Jeff King <peff@peff.net> wrote:
> > Interestingly, on my git.git repo, I had an empty cache. Running "git
> > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
> > that running "git checkout" empties the cache.  So perhaps git could do
> > better about keeping the cache valid over time.
> 
> For fast forward case when result index matches 100% destination tree,
> yeah we should repopulate cache-tree. "git reset" does that. Not sure
> about other cases though. I don't think we can keep track what
> subtrees are unchanged after unpack_trees() in order to keep them.

Yeah, doing it after unpack_trees seems crazy. But I really feel like
unpack_trees should be able to handle cache updates as it unpacks. It
knows what is being updated and what is being merged. But maybe it is
more complicated than that; I haven't looked at the code yet.

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 14:09                         ` Thomas Rast
@ 2012-02-20 14:36                           ` Nguyen Thai Ngoc Duy
  2012-02-20 14:39                             ` Jeff King
  2012-02-21 14:45                             ` Nguyen Thai Ngoc Duy
  2012-02-20 19:57                           ` Junio C Hamano
  1 sibling, 2 replies; 52+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-20 14:36 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Jeff King, Piotr Krukowiecki, Junio C Hamano, Git Mailing List

On Mon, Feb 20, 2012 at 03:09:57PM +0100, Thomas Rast wrote:
> > Interestingly, on my git.git repo, I had an empty cache. Running "git
> > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
> > that running "git checkout" empties the cache.  So perhaps git could do
> > better about keeping the cache valid over time.
> 
> test_expect_failure 'checkout gives cache-tree' '
> 	git checkout HEAD^ &&
> 	test_shallow_cache_tree
> '
> 
> ;-)

Quick and dirty that passes that test. I think we could do better if
we analyse two way merge rules carefully and avoid this diff, but
that's too much for me right now.

-- 8< --
diff --git a/builtin/checkout.c b/builtin/checkout.c
index 5bf96ba..c06287a 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts)
 		die(_("diff_setup_done failed"));
 	add_pending_object(&rev, head, NULL);
 	run_diff_index(&rev, 0);
+	if (!DIFF_OPT_TST(&rev.diffopt, HAS_CHANGES)) {
+		struct tree *tree = parse_tree_indirect(head->sha1);
+		prime_cache_tree(&active_cache_tree, tree);
+	}
 }
 
 static void describe_detached_head(const char *msg, struct commit *commit)
@@ -493,13 +497,13 @@ static int merge_working_tree(struct checkout_opts *opts,
 		}
 	}
 
+	if (!opts->force && !opts->quiet)
+		show_local_changes(&new->commit->object, &opts->diff_options);
+
 	if (write_cache(newfd, active_cache, active_nr) ||
 	    commit_locked_index(lock_file))
 		die(_("unable to write new index file"));
 
-	if (!opts->force && !opts->quiet)
-		show_local_changes(&new->commit->object, &opts->diff_options);
-
 	return 0;
 }
 
-- 8< --
-- 
Duy

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 14:36                           ` Nguyen Thai Ngoc Duy
@ 2012-02-20 14:39                             ` Jeff King
  2012-02-20 15:11                               ` Jeff King
  2012-02-21 14:45                             ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 52+ messages in thread
From: Jeff King @ 2012-02-20 14:39 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Thomas Rast, Piotr Krukowiecki, Junio C Hamano, Git Mailing List

On Mon, Feb 20, 2012 at 09:36:44PM +0700, Nguyen Thai Ngoc Duy wrote:

> > test_expect_failure 'checkout gives cache-tree' '
> > 	git checkout HEAD^ &&
> > 	test_shallow_cache_tree
> > '
> > 
> > ;-)
> 
> Quick and dirty that passes that test. I think we could do better if
> we analyse two way merge rules carefully and avoid this diff, but
> that's too much for me right now.

Unpack trees is already sprinkled with cache_tree_invalidate_path. But
something seems to throw away the cache tree entirely (I think it may be
that the extension simply isn't copied over to the destination index).
I'm walking through it right now.

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 14:39                             ` Jeff King
@ 2012-02-20 15:11                               ` Jeff King
  2012-02-20 18:45                                 ` Thomas Rast
  2012-02-20 20:08                                 ` Junio C Hamano
  0 siblings, 2 replies; 52+ messages in thread
From: Jeff King @ 2012-02-20 15:11 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Thomas Rast, Piotr Krukowiecki, Junio C Hamano, Git Mailing List

On Mon, Feb 20, 2012 at 09:39:52AM -0500, Jeff King wrote:

> On Mon, Feb 20, 2012 at 09:36:44PM +0700, Nguyen Thai Ngoc Duy wrote:
> 
> > > test_expect_failure 'checkout gives cache-tree' '
> > > 	git checkout HEAD^ &&
> > > 	test_shallow_cache_tree
> > > '
> > > 
> > > ;-)
> > 
> > Quick and dirty that passes that test. I think we could do better if
> > we analyse two way merge rules carefully and avoid this diff, but
> > that's too much for me right now.
> 
> Unpack trees is already sprinkled with cache_tree_invalidate_path. But
> something seems to throw away the cache tree entirely (I think it may be
> that the extension simply isn't copied over to the destination index).
> I'm walking through it right now.

Hmm. OK, this doesn't pass the test, but I think it is better than the
current behavior.

Basically, what happens now with "git checkout" is this:

  1. read_cache pulls the cache_tree from disk into the_index

  2. we call unpack_trees with o->src_index == o->dst_index ==
     &the_index.

  3. during tree traversal, unpack_trees callbacks properly calls
     cache_tree_invalidate_path whenever it updates a path. We write the
     results into o->result.

  4. At the end of unpack_trees, we forget about src_index, and copy
     o->result into *o->dst_index byte for byte. I.e., we overwrite
     the_index.cache_tree, which has been properly updated the whole
     time, with NULL, dropping it entirely (in fact, I believe it even
     creates a memory leak of the old cache_tree).

This one-liner improves that a bit:

diff --git a/unpack-trees.c b/unpack-trees.c
index 8be3f6c..e8aedea 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1135,6 +1135,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 		}
 	}

+	o->result.cache_tree = o->src_index->cache_tree;
 	o->src_index = NULL;
 	ret = check_updates(o) ? (-2) : 0;
 	if (o->dst_index)

by copying the cache_tree (which has been updated all along) from
src_index into the result (which will then make it into the
destination index, which of course in this case is the same as the
source index, anyway).

It makes "git checkout" with no changes just work (since we preserve the
cache tree, and it doesn't need updated). It makes something like "git
checkout HEAD^" work, keeping most of the cache-tree intact, but
invalidating trees containing paths that were modified.

But it does not actually insert the _destination_ tree into the cache
tree. Which we can do in certain situations, but only if there were no
paths in the tree that were left unchanged (e.g., you modify "foo", then
"git checkout HEAD^", which updates "bar". Your tree does not match
HEAD^, and must be invalidated).  While it would be cool to be able to
handle those complex cases, making this one simple change covers most of
the cases people care about (i.e., leaving the cache-tree intact for all
of the directories that weren't touched at all).

I think this implementation matches the intent of the original calls to
cache_tree_invalidate_path sprinkled throughout unpack-trees.c. But I
have to say that it seems a little odd for us to be modifying the
o->src_index throughout the whole thing. I would think the right thing
would be to make a deep copy of o->src_index->cache_tree into
o->result.cache_tree as the very first thing, and then update
o->result.cache_tree throughout the tree traversal. There is no point in
invalidating src_index's cache_tree, since it is not receiving the
updates.

In practice, this doesn't tend to matter because everybody just sets src
and dst to &the_index anyway. The one exception seems to be diff_cache,
which sets dst_index to NULL. But it doesn't matter there, because we
are only using oneway_diff as our callback, which does not actually
write or invalidate anything in the cache.

-Peff

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 15:11                               ` Jeff King
@ 2012-02-20 18:45                                 ` Thomas Rast
  2012-02-20 20:35                                   ` Jeff King
  2012-02-20 20:08                                 ` Junio C Hamano
  1 sibling, 1 reply; 52+ messages in thread
From: Thomas Rast @ 2012-02-20 18:45 UTC (permalink / raw)
  To: Jeff King
  Cc: Nguyen Thai Ngoc Duy, Piotr Krukowiecki, Junio C Hamano,
	Git Mailing List

Jeff King <peff@peff.net> writes:

> diff --git a/unpack-trees.c b/unpack-trees.c
> index 8be3f6c..e8aedea 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1135,6 +1135,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>  		}
>  	}
>  
> +	o->result.cache_tree = o->src_index->cache_tree;
>  	o->src_index = NULL;
>  	ret = check_updates(o) ? (-2) : 0;
>  	if (o->dst_index)

Brilliant.  I know I'm stealing Junio's punchline, but please make it so
:-)

Browsing around in history, it seems that this was silently broken by
34110cd (Make 'unpack_trees()' have a separate source and destination
index, 2008-03-06), which introduced the distinction between source and
destination index.  Before that they were the same, so the cache tree
would have been updated correctly.

> It makes "git checkout" with no changes just work (since we preserve the
> cache tree, and it doesn't need updated). It makes something like "git
> checkout HEAD^" work, keeping most of the cache-tree intact, but
> invalidating trees containing paths that were modified.

Great.  Here's a test you could use.  It's a bit noisy because the
shallow in test_shallow_cache_tree no longer made any sense, but I think
it tests what we want to see.

diff --git i/t/t0090-cache-tree.sh w/t/t0090-cache-tree.sh
index 6c33e28..5706305 100755
--- i/t/t0090-cache-tree.sh
+++ w/t/t0090-cache-tree.sh
@@ -16,14 +16,16 @@ cmp_cache_tree () {
 # We don't bother with actually checking the SHA1:
 # test-dump-cache-tree already verifies that all existing data is
 # correct.
-test_shallow_cache_tree () {
-	printf "SHA  (%d entries, 0 subtrees)\n" $(git ls-files|wc -l) >expect &&
+test_cache_tree () {
+	printf "SHA  (%d entries, 1 subtrees)\n" $(git ls-files|wc -l) >expect &&
+	printf "SHA sub/ (%d entries, 0 subtrees)\n" $(git ls-files sub|wc -l) >>expect &&
 	cmp_cache_tree expect
 }
 
 test_invalid_cache_tree () {
-	echo "invalid                                   (0 subtrees)" >expect &&
-	printf "SHA #(ref)  (%d entries, 0 subtrees)\n" $(git ls-files|wc -l) >>expect &&
+	echo "invalid                                   (1 subtrees)" >expect &&
+	printf "SHA #(ref)  (%d entries, 1 subtrees)\n" $(git ls-files|wc -l) >>expect &&
+	printf "SHA sub/ (%d entries, 0 subtrees)\n" $(git ls-files sub|wc -l) >>expect &&
 	cmp_cache_tree expect
 }
 
@@ -33,13 +35,16 @@ test_no_cache_tree () {
 }
 
 test_expect_failure 'initial commit has cache-tree' '
+	mkdir sub &&
+	echo bar > sub/bar &&
+	git add sub/bar &&
 	test_commit foo &&
-	test_shallow_cache_tree
+	test_cache_tree
 '
 
 test_expect_success 'read-tree HEAD establishes cache-tree' '
 	git read-tree HEAD &&
-	test_shallow_cache_tree
+	test_cache_tree
 '
 
 test_expect_success 'git-add invalidates cache-tree' '
@@ -59,7 +64,7 @@ test_expect_success 'update-index invalidates cache-tree' '
 test_expect_success 'write-tree establishes cache-tree' '
 	test-scrap-cache-tree &&
 	git write-tree &&
-	test_shallow_cache_tree
+	test_cache_tree
 '
 
 test_expect_success 'test-scrap-cache-tree works' '
@@ -70,24 +75,39 @@ test_expect_success 'test-scrap-cache-tree works' '
 
 test_expect_success 'second commit has cache-tree' '
 	test_commit bar &&
-	test_shallow_cache_tree
+	test_cache_tree
 '
 
 test_expect_success 'reset --hard gives cache-tree' '
 	test-scrap-cache-tree &&
 	git reset --hard &&
-	test_shallow_cache_tree
+	test_cache_tree
 '
 
 test_expect_success 'reset --hard without index gives cache-tree' '
 	rm -f .git/index &&
 	git reset --hard &&
-	test_shallow_cache_tree
+	test_cache_tree
 '
 
-test_expect_failure 'checkout gives cache-tree' '
+test_expect_success 'checkout HEAD leaves cache-tree intact' '
+	git read-tree HEAD &&
+	git checkout HEAD &&
+	test_cache_tree
+'
+
+# NEEDSWORK: only one of these two can succeed.  The second is there
+# because it would be the better result.
+test_expect_success 'checkout HEAD^ correctly invalidates cache-tree' '
+	git checkout HEAD^ &&
+	test_invalid_cache_tree
+'
+
+test_expect_failure 'checkout HEAD^ gives full cache-tree' '
+	git checkout master &&
+	git read-tree HEAD &&
 	git checkout HEAD^ &&
-	test_shallow_cache_tree
+	test_cache_tree
 '
 
 test_done

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 14:06                       ` Jeff King
  2012-02-20 14:09                         ` Thomas Rast
  2012-02-20 14:16                         ` Nguyen Thai Ngoc Duy
@ 2012-02-20 19:56                         ` Junio C Hamano
  2012-02-20 20:09                           ` Jeff King
  2 siblings, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2012-02-20 19:56 UTC (permalink / raw)
  To: Jeff King
  Cc: Piotr Krukowiecki, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy

Jeff King <peff@peff.net> writes:

> Interestingly, on my git.git repo, I had an empty cache. Running "git
> read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
> that running "git checkout" empties the cache.  So perhaps git could do
> better about keeping the cache valid over time.

At least in the early days unpack-trees built the result by manually
adding an entry without calling the add_index_entry() all over the place,
which meant that it was futile to pretend that there is even a slight
chance that complex beast would correctly invalidate cached tree
information at all the necessary places. I recall that I added a code to
nuke the cache tree at the very beginning of "merging" codepaths to avoid
any bogus cache tree result to be stored in the resulting index.

These days, we have src_index and dst_index, and dst_index IIRC can start
as empty in which case "start from kept information and selectively
invalidate" would not work at all.  When src_index and dst_index are the
same, however, you should be able to keep the cached tree valid, at least
in theory.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 14:09                         ` Thomas Rast
  2012-02-20 14:36                           ` Nguyen Thai Ngoc Duy
@ 2012-02-20 19:57                           ` Junio C Hamano
  2012-02-20 19:59                             ` Thomas Rast
  1 sibling, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2012-02-20 19:57 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Jeff King, Piotr Krukowiecki, Junio C Hamano, Git Mailing List,
	Nguyen Thai Ngoc Duy

Thomas Rast <trast@inf.ethz.ch> writes:

> test_expect_failure 'checkout gives cache-tree' '
> 	git checkout HEAD^ &&
> 	test_shallow_cache_tree
> '

Depending on what state you start the checkout from, that is not a valid
test.  Some form of "git reset" before the checkout to ensure the initial
state is needed.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 19:57                           ` Junio C Hamano
@ 2012-02-20 19:59                             ` Thomas Rast
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Rast @ 2012-02-20 19:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Piotr Krukowiecki, Git Mailing List, Nguyen Thai Ngoc Duy

Junio C Hamano <gitster@pobox.com> writes:

> Thomas Rast <trast@inf.ethz.ch> writes:
>
>> test_expect_failure 'checkout gives cache-tree' '
>> 	git checkout HEAD^ &&
>> 	test_shallow_cache_tree
>> '
>
> Depending on what state you start the checkout from, that is not a valid
> test.  Some form of "git reset" before the checkout to ensure the initial
> state is needed.

Oh, I was just quoting what we already had at the end of t0090 since
4eb0346f.  The test preceding it runs 'git reset --hard'.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 15:11                               ` Jeff King
  2012-02-20 18:45                                 ` Thomas Rast
@ 2012-02-20 20:08                                 ` Junio C Hamano
  2012-02-20 20:17                                   ` Jeff King
  1 sibling, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2012-02-20 20:08 UTC (permalink / raw)
  To: Jeff King
  Cc: Nguyen Thai Ngoc Duy, Thomas Rast, Piotr Krukowiecki,
	Junio C Hamano, Git Mailing List

Jeff King <peff@peff.net> writes:

>   4. At the end of unpack_trees, we forget about src_index, and copy
>      o->result into *o->dst_index byte for byte. I.e., we overwrite
>      the_index.cache_tree, which has been properly updated the whole
>      time,

I strongly suspect that "properly updated" part needs to be thoroughly
audited.  I wouldn't be surprised that this behaviour is what we did when
we split src_index vs dst_index when he rewrote unpack_trees() in order to
emulate the original "unpack-trees is beyond salvation because it does not
maintain cache tree correctly, just nuke it" behaviour.

> But it does not actually insert the _destination_ tree into the cache
> tree. Which we can do in certain situations, but only if there were no
> paths in the tree that were left unchanged (e.g., you modify "foo", then
> "git checkout HEAD^", which updates "bar". Your tree does not match
> HEAD^, and must be invalidated).  While it would be cool to be able to
> handle those complex cases,...

It may look cool but it may not be a good change. You are spending extra
cycles to optimize for the next write-tree that may not happen before the
index is further updated.

> I think this implementation matches the intent of the original calls to
> cache_tree_invalidate_path sprinkled throughout unpack-trees.c.

Yes, and as long as we invalidate all the directories that need to be
invalidated during the unpack-tree operation, I think it is a correct
thing to do.

> But I
> have to say that it seems a little odd for us to be modifying the
> o->src_index throughout the whole thing.

Yes, that part is logically *wrong*.  I think it is a remnant from the
days when there was no distinction between src_index and dst_index.

> I would think the right thing
> would be to make a deep copy of o->src_index->cache_tree into
> o->result.cache_tree as the very first thing, and then update
> o->result.cache_tree throughout the tree traversal.

Yes.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 19:56                         ` Junio C Hamano
@ 2012-02-20 20:09                           ` Jeff King
  0 siblings, 0 replies; 52+ messages in thread
From: Jeff King @ 2012-02-20 20:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Piotr Krukowiecki, Thomas Rast, Git Mailing List, Nguyen Thai Ngoc Duy

On Mon, Feb 20, 2012 at 11:56:13AM -0800, Junio C Hamano wrote:

> These days, we have src_index and dst_index, and dst_index IIRC can start
> as empty in which case "start from kept information and selectively
> invalidate" would not work at all.  When src_index and dst_index are the
> same, however, you should be able to keep the cached tree valid, at least
> in theory.

Yeah, I was worried that the cache invalidations sprinkled throughout
unpack-trees.c would not be sufficient (and because we are invalidating,
a missing invalidation would give us bogus cache info, which is Very
Bad).

So I think the one-liner I posted before is not sufficient in the
general case, because it definitely doesn't consider where the
destination is starting from. It should at least be more like:

  if (src_index == dst_index) {
          /* We would ordinarily want to do a deep copy here, but since
           * we know that we will be overwriting src_index in the long
           * run, it's OK to just take ownership of its cache_tree. */
          o->result.cache_tree = o->src_index->cache_tree;
          o->src_index->cache_tree = NULL;
  }

  [... do the usual tree traversal here, except invalidate entries in
       o->result.call_tree instead of o->src_index. That makes it a
       no-op when src_index != dst_index (because we have no cache tree
       defined in result, then), and otherwise we are invalidating what
       will go into the result...]

  [then as before, we copy the result to dst_index; except now the
   result may have src_index's cache_tree plus any invalidations]
  o->result = *o->dst_index;

And fortunately that does exactly what we want in all cases, because we
always either read from and write to the_index, or we write to NULL (in
which case we will not bother with a cache_tree for the result, and it
is fixing a minor bug that we might be invalidating src_index's tree in the
first place).

I'm still slightly worried that we are missing some invalidation
somewhere deep in unpack_tree's callbacks (especially because they _are_
callbacks, and invalidating the cache_tree properly is now a promise
that the callbacks have to make).

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 20:08                                 ` Junio C Hamano
@ 2012-02-20 20:17                                   ` Jeff King
  0 siblings, 0 replies; 52+ messages in thread
From: Jeff King @ 2012-02-20 20:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nguyen Thai Ngoc Duy, Thomas Rast, Piotr Krukowiecki, Git Mailing List

On Mon, Feb 20, 2012 at 12:08:03PM -0800, Junio C Hamano wrote:

> >   4. At the end of unpack_trees, we forget about src_index, and copy
> >      o->result into *o->dst_index byte for byte. I.e., we overwrite
> >      the_index.cache_tree, which has been properly updated the whole
> >      time,
> 
> I strongly suspect that "properly updated" part needs to be thoroughly
> audited.  I wouldn't be surprised that this behaviour is what we did when
> we split src_index vs dst_index when he rewrote unpack_trees() in order to
> emulate the original "unpack-trees is beyond salvation because it does not
> maintain cache tree correctly, just nuke it" behaviour.

Yep, I am also concerned about that.

> > But it does not actually insert the _destination_ tree into the cache
> > tree. Which we can do in certain situations, but only if there were no
> > paths in the tree that were left unchanged (e.g., you modify "foo", then
> > "git checkout HEAD^", which updates "bar". Your tree does not match
> > HEAD^, and must be invalidated).  While it would be cool to be able to
> > handle those complex cases,...
> 
> It may look cool but it may not be a good change. You are spending extra
> cycles to optimize for the next write-tree that may not happen before the
> index is further updated.

I don't think it would be too many cycles; you would have to mark each
tree you enter as having items from the left-hand tree or the right-hand
tree. If only one, you can reuse the cache-tree entry (or tree sha1, if
coming from a tree). Otherwise, you must invalidate.  And it doesn't
just help the next write-tree, but any intermediate index diffs.

Of course any such change would need timings to justify it, though.

That being said, I think just invalidating really covers 99% of the
cases. What we really care about is that when I modify kernel/foo.c, the
~2300 other directories (besides "" and "kernel") don't need rebuilt,
and that is relatively simple to do. Even if doing it the other way
produced a tiny speedup, I would be concerned with the increase in code
complexity.

> > I think this implementation matches the intent of the original calls to
> > cache_tree_invalidate_path sprinkled throughout unpack-trees.c.
> 
> Yes, and as long as we invalidate all the directories that need to be
> invalidated during the unpack-tree operation, I think it is a correct
> thing to do.

OK. I'll do some reading of the code to convince myself that the
unpack_trees callbacks are doing the right thing. I'm not sure of a good
automatic test that would detect a failure there. Just making test cases
is going to end up too contrived, unless we are missing something really
obvious.

I'm thinking maybe something like replaying the commit history of
linux-2.6 and making sure that each the tree generated by the cache-tree
in each case matches the actual committed tree.

> > But I
> > have to say that it seems a little odd for us to be modifying the
> > o->src_index throughout the whole thing.
> 
> Yes, that part is logically *wrong*.  I think it is a remnant from the
> days when there was no distinction between src_index and dst_index.

OK. I'll include a fix for that in the series I prepare.

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 18:45                                 ` Thomas Rast
@ 2012-02-20 20:35                                   ` Jeff King
  2012-02-20 22:04                                     ` Junio C Hamano
  0 siblings, 1 reply; 52+ messages in thread
From: Jeff King @ 2012-02-20 20:35 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Nguyen Thai Ngoc Duy, Piotr Krukowiecki, Junio C Hamano,
	Git Mailing List

On Mon, Feb 20, 2012 at 07:45:59PM +0100, Thomas Rast wrote:

> > +	o->result.cache_tree = o->src_index->cache_tree;
> >  	o->src_index = NULL;
> >  	ret = check_updates(o) ? (-2) : 0;
> >  	if (o->dst_index)
> 
> Brilliant.  I know I'm stealing Junio's punchline, but please make it so
> :-)
> 
> Browsing around in history, it seems that this was silently broken by
> 34110cd (Make 'unpack_trees()' have a separate source and destination
> index, 2008-03-06), which introduced the distinction between source and
> destination index.  Before that they were the same, so the cache tree
> would have been updated correctly.

OK, good. When you write a one-liner that makes a huge change in
performance, it is usually a good idea to think to yourself "no, it
couldn't be this easy, could it?".

But after more discussion from people more clueful than I (this is the
first time I've even looked at cache-tree code), I'm feeling like this
is the right direction, at least, if not exactly the right patch.
And seeing that it is in fact a regression in 34110cd, and that the
existing cache-tree invalidations predate that makes me feel better. At
one point, at least, they were complete and we were depending on them to
be accurate. Things may have changed since then, of course, but I at
least know that they were sufficient in 34110cd^.

> +# NEEDSWORK: only one of these two can succeed.  The second is there
> +# because it would be the better result.
> +test_expect_success 'checkout HEAD^ correctly invalidates cache-tree' '
> +	git checkout HEAD^ &&
> +	test_invalid_cache_tree
> +'
> +
> +test_expect_failure 'checkout HEAD^ gives full cache-tree' '
> +	git checkout master &&
> +	git read-tree HEAD &&
>  	git checkout HEAD^ &&
> -	test_shallow_cache_tree
> +	test_cache_tree
>  '

I think you can construct two tests that will both work in the "ideal"
case. In the first one, you move to a tree that updates "foo", and
therefore the root cache-tree is invalidated.

In the second, you update "subdir1/foo" in the index, then move to a
commit that differs in "subdir1/bar" and "subdir2/bar". You should see
that subdir2 has the cache-tree of the destination commit, but that
subdir1 is invalidated (and therefore the root is also invalidated).
That will fail with my patch, of course, as it would invalidate subdir2,
also; so it would just be an expect_failure for somebody in the future.

In general, t0090 could benefit from using a larger tree. For example,
the add test does "git add foo" and checks that the root cache-tree was
invalidated. But it should _also_ check that the cache-tree for a
subdirectory is _not_ invalidated (and it isn't; git-add does the right
thing).

I'll see if I can work up some fancier tests, too.

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 20:35                                   ` Jeff King
@ 2012-02-20 22:04                                     ` Junio C Hamano
  2012-02-20 22:41                                       ` Jeff King
  0 siblings, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2012-02-20 22:04 UTC (permalink / raw)
  To: Jeff King
  Cc: Thomas Rast, Nguyen Thai Ngoc Duy, Piotr Krukowiecki, Git Mailing List

Jeff King <peff@peff.net> writes:

> ... Things may have changed since then, of course, but I at
> least know that they were sufficient in 34110cd^.

Looking at where cache_tree_free() is called, I think back then the
two-way merge was deemed OK, but we did not trust three-way merge or
merge-recursive at all.

> I think you can construct two tests that will both work in the "ideal"
> case. In the first one, you move to a tree that updates "foo", and
> therefore the root cache-tree is invalidated.

I have to warn you that under-invalidation of cache-tree is extremely hard
to find.  The only way I know, which I had to resort to when dealing with
a handful of instances of under-invalidation bugs, is to run write-tree
with potentially corrupt cache-tree and then using the same index but with
the cache-tree purged, run another write-tree and check to see if two
trees match.

> In general, t0090 could benefit from using a larger tree. For example,
> the add test does "git add foo" and checks that the root cache-tree was
> invalidated. But it should _also_ check that the cache-tree for a
> subdirectory is _not_ invalidated (and it isn't; git-add does the right
> thing).

It is OK to check that we do not over-invalidate for performance, but it
is a lot more important to make sure we do not under-invalidate for
correctness.  I am a bit worried that you seem to be putting more stress
on the former.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 22:04                                     ` Junio C Hamano
@ 2012-02-20 22:41                                       ` Jeff King
  2012-02-20 23:31                                         ` Junio C Hamano
  0 siblings, 1 reply; 52+ messages in thread
From: Jeff King @ 2012-02-20 22:41 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Nguyen Thai Ngoc Duy, Piotr Krukowiecki, Git Mailing List

On Mon, Feb 20, 2012 at 02:04:26PM -0800, Junio C Hamano wrote:

> > ... Things may have changed since then, of course, but I at
> > least know that they were sufficient in 34110cd^.
> 
> Looking at where cache_tree_free() is called, I think back then the
> two-way merge was deemed OK, but we did not trust three-way merge or
> merge-recursive at all.

Thanks, I'll take a look more closely at those cases.

> It is OK to check that we do not over-invalidate for performance, but it
> is a lot more important to make sure we do not under-invalidate for
> correctness.  I am a bit worried that you seem to be putting more stress
> on the former.

I think it is just selection bias of the specific parts of his tests
that I was responding to. I completely agree that correctness is way
more important, and I'm also trying to come up with tests to validate
correctness. I just wasn't talking about them there.

I still think replaying real-world test cases is going to be more likely
to find issues in invalidation. I can come up with lots of simple
test-cases, but they're not likely to find anything we wouldn't find in
the code with trivial inspection. I think a combination of careful
analysis and real-world validation is going to be more helpful in the
long run than the kind of simplistic tests that are in t0090.

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 22:41                                       ` Jeff King
@ 2012-02-20 23:31                                         ` Junio C Hamano
  2012-02-21  7:21                                           ` Piotr Krukowiecki
  0 siblings, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2012-02-20 23:31 UTC (permalink / raw)
  To: Jeff King
  Cc: Thomas Rast, Nguyen Thai Ngoc Duy, Piotr Krukowiecki, Git Mailing List

Jeff King <peff@peff.net> writes:

> I still think replaying real-world test cases is going to be more likely
> to find issues in invalidation.

Yes, but it depends on what kind of replaying you have in mind.  It is
very hard to come up with "replaying real-world test case".

For example, randomly picking commit pair <$A,$B> from the kernel
repository and running

	git reset --hard $A
        git checkout $B
        T0=$(git write-tree)
        drop-cache-tree
        T1=$(git write-tree)
        test "$T0" = "$T1" && test "$T0" = $(git rev-parse $B^{tree})

is necessary but I do not think that is sufficient.  We also want to do
something like:

	git reset --hard $A
        ... modify paths that do not change between $A and $B
        ... git add these paths
        git write-tree
        git checkout $B

with and without the "git write-tree" to see the part of the cache-tree
smudged by the modification behaves sanely.  The codepath that is used
to deal with the case where the index does not match $A but matches $B is
also different, so the "modify path and git add" step would have to be
crafted carefully to touch all bases.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 23:31                                         ` Junio C Hamano
@ 2012-02-21  7:21                                           ` Piotr Krukowiecki
  0 siblings, 0 replies; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-02-21  7:21 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Thomas Rast, Nguyen Thai Ngoc Duy, Git Mailing List

On Tue, Feb 21, 2012 at 12:31 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
[...]

Hi,

I hope the fixes will also help git-svn users? I.e. there's a lot of
rebasing (and cherry-picking - at least in my case) and probably some
other stuff going on under hood. I.e. I hope that if I git-svn rebase
or cherry-pick, it won't invalidate the cache...

Thanks,
-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-20 14:36                           ` Nguyen Thai Ngoc Duy
  2012-02-20 14:39                             ` Jeff King
@ 2012-02-21 14:45                             ` Nguyen Thai Ngoc Duy
  2012-02-21 19:16                               ` Junio C Hamano
  2012-02-22  3:32                               ` Junio C Hamano
  1 sibling, 2 replies; 52+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-21 14:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List

On Mon, Feb 20, 2012 at 9:36 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Mon, Feb 20, 2012 at 03:09:57PM +0100, Thomas Rast wrote:
>> > Interestingly, on my git.git repo, I had an empty cache. Running "git
>> > read-tree HEAD" filled it (according to test-dump-cache-tree). It seems
>> > that running "git checkout" empties the cache.  So perhaps git could do
>> > better about keeping the cache valid over time.
>>
>> test_expect_failure 'checkout gives cache-tree' '
>>       git checkout HEAD^ &&
>>       test_shallow_cache_tree
>> '
>>
>> ;-)
>
> Quick and dirty that passes that test.

I'm aware that Jeff's tackling at lower level, which retains
cache-tree for many more cases. But this patch seems simple and safe
to me, and in my experience this case happens quite often (or maybe I
tend to keep my index clean). Junio, any chance this patch may get in?

> -- 8< --
> diff --git a/builtin/checkout.c b/builtin/checkout.c
> index 5bf96ba..c06287a 100644
> --- a/builtin/checkout.c
> +++ b/builtin/checkout.c
> @@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts)
>                die(_("diff_setup_done failed"));
>        add_pending_object(&rev, head, NULL);
>        run_diff_index(&rev, 0);
> +       if (!DIFF_OPT_TST(&rev.diffopt, HAS_CHANGES)) {
> +               struct tree *tree = parse_tree_indirect(head->sha1);
> +               prime_cache_tree(&active_cache_tree, tree);
> +       }
>  }
>
>  static void describe_detached_head(const char *msg, struct commit *commit)
> @@ -493,13 +497,13 @@ static int merge_working_tree(struct checkout_opts *opts,
>                }
>        }
>
> +       if (!opts->force && !opts->quiet)
> +               show_local_changes(&new->commit->object, &opts->diff_options);
> +
>        if (write_cache(newfd, active_cache, active_nr) ||
>            commit_locked_index(lock_file))
>                die(_("unable to write new index file"));
>
> -       if (!opts->force && !opts->quiet)
> -               show_local_changes(&new->commit->object, &opts->diff_options);
> -
>        return 0;
>  }
>
> -- 8< --
> --
> Duy
-- 
Duy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-21 14:45                             ` Nguyen Thai Ngoc Duy
@ 2012-02-21 19:16                               ` Junio C Hamano
  2012-02-22  2:12                                 ` Nguyen Thai Ngoc Duy
  2012-02-22 10:34                                 ` Nguyen Thai Ngoc Duy
  2012-02-22  3:32                               ` Junio C Hamano
  1 sibling, 2 replies; 52+ messages in thread
From: Junio C Hamano @ 2012-02-21 19:16 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> I'm aware that Jeff's tackling at lower level, which retains
> cache-tree for many more cases.
>
> But this patch seems simple and safe
> to me, and in my experience this case happens quite often (or maybe I
> tend to keep my index clean). Junio, any chance this patch may get in?

I do not think we are talking about a duplicated effort here.

By definition, the change to hook into unpack_trees() and making sure we
invalidate all the necessary subtrees in the cache cannot give you a cache
tree that is more populated than what you started with.  And the train of
thought in Peff's message is to improve this invalidation---we currently
invalidate everything ;-)

Somebody has to populate the cache tree fully when we _know_ the index
matches a certain tree, and adding a call to prime_cache_tree() in
strategic places is a way to do so.  The most obvious is write-tree, but
there are a few other existing codepaths that do so.

Because prime_cache_tree() by itself is a fairly expensive operation that
reads all the trees recursively, its benefits need to be evaluated. It
should to happen only in an operation that is already heavy-weight, is
likely to have read all the trees and have many of them in-core cache, and
also relatively rarely happens compared to "git add" so that the cost can
be amortised over time, such as "reset --(hard|mixed)".

Switching branches is likely to fall into that category, but that is just
my gut feeling.  I would feel better at night if somebody did a benchmark
;-)

One thing we do not currently do anywhere that _might_ be of merit is to
make a call to cache_tree_update() instead of prime_cache_tree() when we
already know that only a very small subpart of the cache-tree is invalid
and it is cheaper to repair it by rehashing only a small portion of the
index than to re-prime the entire cache tree with prime_cache_tree().

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-21 19:16                               ` Junio C Hamano
@ 2012-02-22  2:12                                 ` Nguyen Thai Ngoc Duy
  2012-02-22  2:55                                   ` Junio C Hamano
  2012-02-22 10:34                                 ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 52+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-22  2:12 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List

On Wed, Feb 22, 2012 at 2:16 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Because prime_cache_tree() by itself is a fairly expensive operation that
> reads all the trees recursively, its benefits need to be evaluated. It
> should to happen only in an operation that is already heavy-weight, is
> likely to have read all the trees and have many of them in-core cache, and
> also relatively rarely happens compared to "git add" so that the cost can
> be amortised over time, such as "reset --(hard|mixed)".
>
> Switching branches is likely to fall into that category, but that is just
> my gut feeling.  I would feel better at night if somebody did a benchmark
> ;-)

In this particular case, "git diff --cached" is run internally, so I
say all trees are read once and hopefully most of them still in OS
cache. Will run some benchmark, maybe with the coming perf test suite.

> One thing we do not currently do anywhere that _might_ be of merit is to
> make a call to cache_tree_update() instead of prime_cache_tree() when we
> already know that only a very small subpart of the cache-tree is invalid
> and it is cheaper to repair it by rehashing only a small portion of the
> index than to re-prime the entire cache tree with prime_cache_tree().

That makes me think if "diff --cached" can take advantage of
cache-tree to avoid walking down valid cached trees and do tree-tree
diff in those cases instead. Not sure if it gains us anything but code
complexity.
-- 
Duy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-22  2:12                                 ` Nguyen Thai Ngoc Duy
@ 2012-02-22  2:55                                   ` Junio C Hamano
  2012-02-22 12:54                                     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2012-02-22  2:55 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> That makes me think if "diff --cached" can take advantage of
> cache-tree to avoid walking down valid cached trees and do tree-tree
> diff in those cases instead. Not sure if it gains us anything but code
> complexity.

Why do I have this funny feeling that we saw that comment in this thread
already?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-21 14:45                             ` Nguyen Thai Ngoc Duy
  2012-02-21 19:16                               ` Junio C Hamano
@ 2012-02-22  3:32                               ` Junio C Hamano
  2012-04-10 15:16                                 ` Piotr Krukowiecki
  1 sibling, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2012-02-22  3:32 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

>> diff --git a/builtin/checkout.c b/builtin/checkout.c
>> index 5bf96ba..c06287a 100644
>> --- a/builtin/checkout.c
>> +++ b/builtin/checkout.c
>> @@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts)
>>                die(_("diff_setup_done failed"));
>>        add_pending_object(&rev, head, NULL);
>>        run_diff_index(&rev, 0);
>> +       if (!DIFF_OPT_TST(&rev.diffopt, HAS_CHANGES)) {
>> +               struct tree *tree = parse_tree_indirect(head->sha1);
>> +               prime_cache_tree(&active_cache_tree, tree);
>> +       }
>>  }

I think this patch is wrong on at least two counts.

 * The run_diff_index(&rev, 0) you reused is doing "diff HEAD" and not
   "diff --cached HEAD".  The added check does not say anything about the
   comparison between the index and the tree at the HEAD.

 * Even if we added an extra run_diff_index(&rev, 1) there, or added a
   call to index_differs_from() to run "diff --cached HEAD" to check what
   needs to be checked, it is still not quite right.

On the latter point, imagine what happens in the two invocations of
checkout in the following sequence:

   $ git reset --hard master
   $ git checkout master
   $ git checkout master

The second one should notice that the cache tree is fully valid, so the
internal "diff --cached" it runs should only open the top-level tree
and scan entries in it, without recursing into any of the subtrees, and
realize that the index is in sync with "HEAD", which should be a very
cheap operation (that is the whole point of the current topic of our
discussion looking at the cache-tree).  Then the new code calls
prime_cache_tree() to read _everything_?

Probably cache_tree_fully_valid() should be called before deciding that we
need to re-populate the cache tree from "HEAD".

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-21 19:16                               ` Junio C Hamano
  2012-02-22  2:12                                 ` Nguyen Thai Ngoc Duy
@ 2012-02-22 10:34                                 ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 52+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-22 10:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List

On Tue, Feb 21, 2012 at 11:16:37AM -0800, Junio C Hamano wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
> 
> > I'm aware that Jeff's tackling at lower level, which retains
> > cache-tree for many more cases.
> >
> > But this patch seems simple and safe
> > to me, and in my experience this case happens quite often (or maybe I
> > tend to keep my index clean). Junio, any chance this patch may get in?
> 
> I do not think we are talking about a duplicated effort here.
> 
> By definition, the change to hook into unpack_trees() and making sure we
> invalidate all the necessary subtrees in the cache cannot give you a cache
> tree that is more populated than what you started with.  And the train of
> thought in Peff's message is to improve this invalidation---we currently
> invalidate everything ;-)
> 
> Somebody has to populate the cache tree fully when we _know_ the index
> matches a certain tree, and adding a call to prime_cache_tree() in
> strategic places is a way to do so.  The most obvious is write-tree, but
> there are a few other existing codepaths that do so.
> 
> Because prime_cache_tree() by itself is a fairly expensive operation that
> reads all the trees recursively, its benefits need to be evaluated. It
> should to happen only in an operation that is already heavy-weight, is
> likely to have read all the trees and have many of them in-core cache, and
> also relatively rarely happens compared to "git add" so that the cost can
> be amortised over time, such as "reset --(hard|mixed)".

It's tradeoff. As you said prime_cache_tree() is expensive.
cache_tree_update is supposed to be cheap. But cache_tree_update() when
empty is even more expensive than prime_cache_tree(). So it boils down
how much cache-tree we have after unpack_trees() and whether it's worth
repopulate cache-tree again.

> Switching branches is likely to fall into that category, but that is just
> my gut feeling.  I would feel better at night if somebody did a benchmark
> ;-)

I timed prime_cache_tree() and cache_tree_update() while switching branch
on linux-2.6, between v2.6.32 and a quite recent version. prime_cache_tree()
took ~55ms while cache_tree_update() 150ms or 90ms (depending on final tree).
It depends on your view, whether 55ms is expensive on such a reasonably large
repository. I took several seconds for me to complete the checkout though.

Checking out with "-q" prime_cache_tree() took 145ms and 80ms respectively,
as expensive as cache_tree_update()

If cache-tree is only used at commit time, I think we could delay
prime_cache_tree() until absolutely needed. We could add an optional index
extension recording the fact that index matches certain tree.
On the first cache_tree_invalidate_path(), if cache-tree is still
empty, we prime cache-tree, then invalidate the requested path.
It would then add no cost to a quick branch switch.

But if diff-cached also takes advantage of cache-tree, it's a different story.

Anyway, I think this patch does better than my last one

-- 8< --
diff --git a/builtin/checkout.c b/builtin/checkout.c
index 6b9061f..e7eaeec 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -387,6 +387,7 @@ static int merge_working_tree(struct checkout_opts *opts,
 	int ret;
 	struct lock_file *lock_file = xcalloc(1, sizeof(struct lock_file));
 	int newfd = hold_locked_index(lock_file, 1);
+	int head_index_mismatch = 1;
 
 	if (read_cache_preload(NULL) < 0)
 		return error(_("corrupt index file"));
@@ -396,6 +397,7 @@ static int merge_working_tree(struct checkout_opts *opts,
 		ret = reset_tree(new->commit->tree, opts, 1);
 		if (ret)
 			return ret;
+		head_index_mismatch = 0;
 	} else {
 		struct tree_desc trees[2];
 		struct tree *tree;
@@ -490,7 +492,27 @@ static int merge_working_tree(struct checkout_opts *opts,
 			ret = reset_tree(new->commit->tree, opts, 0);
 			if (ret)
 				return ret;
-		}
+		} else
+			head_index_mismatch = topts.head_index_mismatch;
+	}
+
+	/*
+	 * Currently cache-tree is always destroyed after
+	 * unpack_trees(). It's probably a good idea to repopulate
+	 * cache-tree. If the user makes a few modifications and
+	 * commits, tree generation woulda be cheap. If they switch
+	 * away again, not so cheap.
+	 *
+	 * When unpack_trees() learns to retains as much cache-tree as
+	 * possible, this code probably does not help much on tree
+	 * generation, unless the tree difference between to heads are
+	 * too far, little cache-tree can be kept.
+	 */
+	if (!head_index_mismatch &&
+	    !cache_tree_fully_valid(active_cache_tree)) {
+		if (!new->commit->tree->object.parsed)
+			parse_tree(new->commit->tree);
+		prime_cache_tree(&active_cache_tree, new->commit->tree);
 	}
 
 	if (write_cache(newfd, active_cache, active_nr) ||
diff --git a/unpack-trees.c b/unpack-trees.c
index 7c9ecf6..f2c518f 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1022,6 +1022,8 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	o->result.timestamp.nsec = o->src_index->timestamp.nsec;
 	o->merge_size = len;
 	mark_all_ce_unused(o->src_index);
+	if (o->fn != twoway_merge)
+		o->head_index_mismatch = 1;
 
 	/*
 	 * Sparse checkout loop #1: set NEW_SKIP_WORKTREE on existing entries
@@ -1736,6 +1738,8 @@ int twoway_merge(struct cache_entry **src, struct unpack_trees_options *o)
 		    (oldtree && newtree &&
 		     !same(oldtree, newtree) && /* 18 and 19 */
 		     same(current, newtree))) {
+			if (!newtree || (newtree && !same(current, newtree)))
+				o->head_index_mismatch = 1;
 			return keep_entry(current, o);
 		}
 		else if (oldtree && !newtree && same(current, oldtree)) {
diff --git a/unpack-trees.h b/unpack-trees.h
index 5e432f5..b75b64e 100644
--- a/unpack-trees.h
+++ b/unpack-trees.h
@@ -48,7 +48,8 @@ struct unpack_trees_options {
 		     gently,
 		     exiting_early,
 		     show_all_errors,
-		     dry_run;
+		     dry_run,
+		     head_index_mismatch;
 	const char *prefix;
 	int cache_bottom;
 	struct dir_struct *dir;
-- 8< --

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-22  2:55                                   ` Junio C Hamano
@ 2012-02-22 12:54                                     ` Nguyen Thai Ngoc Duy
  2012-02-22 13:17                                       ` Thomas Rast
  0 siblings, 1 reply; 52+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-22 12:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Jeff King, Piotr Krukowiecki, Git Mailing List

On Wed, Feb 22, 2012 at 9:55 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>
>> That makes me think if "diff --cached" can take advantage of
>> cache-tree to avoid walking down valid cached trees and do tree-tree
>> diff in those cases instead. Not sure if it gains us anything but code
>> complexity.
>
> Why do I have this funny feeling that we saw that comment in this thread
> already?

Simple. You wrote it and I missed it.

On Sat, Feb 18, 2012 at 5:25 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>
>> That being said, we do have an index extension to store the tree sha1 of
>> whole directories (i.e., we populate it when we write a whole tree or
>> subtree into the index from the object db, and it becomes invalidated
>> when a file becomes modified). This optimization is used by things like
>> "git commit" to avoid having to recreate the same sub-trees over and
>> over when creating tree objects from the index. But we could also use it
>> here to avoid having to even read the sub-tree objects from the object
>> db.
>
> Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20)
> perhaps?

This optimizes the case when a cached tree matches entirely.I wonder
whether it's faster if we switch to tree-tree diff whenever we find
valid cached trees. If cache-tree is fully valid, "git diff --cached
foo" would be equivalent to "git diff HEAD foo".

I tried "git diff --raw HEAD HEAD~100" (where HEAD was
v3.1-rc1-272-g73e0881 on linux-2.6) and "git diff --cached --raw
HEAD~100" with no cache-tree. The former is a little bit faster than
the latter (177ms vs 275ms). On gentoo-x86, 70k worktree files, it's
4.33s vs 4.45s. But in tree-tree diff we pay high in cold cache case
for loading trees from "HEAD". So no, probably not worth more code
changes. Your optimization is good enough.
-- 
Duy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-22 12:54                                     ` Nguyen Thai Ngoc Duy
@ 2012-02-22 13:17                                       ` Thomas Rast
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Rast @ 2012-02-22 13:17 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Junio C Hamano, Jeff King, Piotr Krukowiecki, Git Mailing List

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> On Sat, Feb 18, 2012 at 5:25 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> Jeff King <peff@peff.net> writes:
>>
>>> That being said, we do have an index extension to store the tree sha1 of
>>> whole directories (i.e., we populate it when we write a whole tree or
>>> subtree into the index from the object db, and it becomes invalidated
>>> when a file becomes modified). This optimization is used by things like
>>> "git commit" to avoid having to recreate the same sub-trees over and
>>> over when creating tree objects from the index. But we could also use it
>>> here to avoid having to even read the sub-tree objects from the object
>>> db.
>>
>> Like b65982b (Optimize "diff-index --cached" using cache-tree, 2009-05-20)
>> perhaps?
>
> This optimizes the case when a cached tree matches entirely.I wonder
> whether it's faster if we switch to tree-tree diff whenever we find
> valid cached trees. If cache-tree is fully valid, "git diff --cached
> foo" would be equivalent to "git diff HEAD foo".

Not necessarily; the cache-tree is valid if it faithfully represents
what is in the index.  It does not have any direct relation to HEAD.

> I tried "git diff --raw HEAD HEAD~100" (where HEAD was
> v3.1-rc1-272-g73e0881 on linux-2.6) and "git diff --cached --raw
> HEAD~100" with no cache-tree. The former is a little bit faster than
> the latter (177ms vs 275ms). On gentoo-x86, 70k worktree files, it's
> 4.33s vs 4.45s. But in tree-tree diff we pay high in cold cache case
> for loading trees from "HEAD". So no, probably not worth more code
> changes. Your optimization is good enough.

I'm still wondering about using mincore() to good effect.  I tried it
for git-grep, but it ended up slowing things down.  However, it irks me
that in some cases a clueful use of one form over the other can really
make a huge performance difference, e.g.,

  git grep stuff
  git grep HEAD stuff

If I am in a big repository that I haven't used in a while, the HEAD
form will be much faster as the worktree search would fault many files.
OTOH if I am in a heavily-used repository (and perhaps just said 'make'
minutes ago) the worktree version will avoid the pack decompression
effort.

Sadly this also has the problem that we must first determine whether
substituting HEAD for the worktree (or vice versa) is valid at all.  For
grep perhaps there could be a "just do a fast search somewhere" option
since usually you are looking for something that hasn't changed in ages.

Ok, that was almost completely beside the point of this thread.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-02-22  3:32                               ` Junio C Hamano
@ 2012-04-10 15:16                                 ` Piotr Krukowiecki
  2012-04-10 16:23                                   ` Junio C Hamano
  0 siblings, 1 reply; 52+ messages in thread
From: Piotr Krukowiecki @ 2012-04-10 15:16 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nguyen Thai Ngoc Duy, Thomas Rast, Jeff King, Git Mailing List

On Wed, Feb 22, 2012 at 4:32 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>
>>> diff --git a/builtin/checkout.c b/builtin/checkout.c
>>> index 5bf96ba..c06287a 100644
>>> --- a/builtin/checkout.c
>>> +++ b/builtin/checkout.c
>>> @@ -319,6 +319,10 @@ static void show_local_changes(struct object *head, struct diff_options *opts)
>>>                die(_("diff_setup_done failed"));
>>>        add_pending_object(&rev, head, NULL);
>>>        run_diff_index(&rev, 0);
>>> +       if (!DIFF_OPT_TST(&rev.diffopt, HAS_CHANGES)) {
>>> +               struct tree *tree = parse_tree_indirect(head->sha1);
>>> +               prime_cache_tree(&active_cache_tree, tree);
>>> +       }
>>>  }
>
> I think this patch is wrong on at least two counts.
>
>  * The run_diff_index(&rev, 0) you reused is doing "diff HEAD" and not
>   "diff --cached HEAD".  The added check does not say anything about the
>   comparison between the index and the tree at the HEAD.
>
>  * Even if we added an extra run_diff_index(&rev, 1) there, or added a
>   call to index_differs_from() to run "diff --cached HEAD" to check what
>   needs to be checked, it is still not quite right.
>
> On the latter point, imagine what happens in the two invocations of
> checkout in the following sequence:
>
>   $ git reset --hard master
>   $ git checkout master
>   $ git checkout master
>
> The second one should notice that the cache tree is fully valid, so the
> internal "diff --cached" it runs should only open the top-level tree
> and scan entries in it, without recursing into any of the subtrees, and
> realize that the index is in sync with "HEAD", which should be a very
> cheap operation (that is the whole point of the current topic of our
> discussion looking at the cache-tree).  Then the new code calls
> prime_cache_tree() to read _everything_?
>
> Probably cache_tree_fully_valid() should be called before deciding that we
> need to re-populate the cache tree from "HEAD".

Hi,

could I ask what is the status of this? There were some patches
posted, but I think nothing final?

Thanks,

-- 
Piotr Krukowiecki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-04-10 15:16                                 ` Piotr Krukowiecki
@ 2012-04-10 16:23                                   ` Junio C Hamano
  2012-04-10 18:00                                     ` Jeff King
  0 siblings, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2012-04-10 16:23 UTC (permalink / raw)
  To: Piotr Krukowiecki
  Cc: Nguyen Thai Ngoc Duy, Thomas Rast, Jeff King, Git Mailing List

Piotr Krukowiecki <piotr.krukowiecki@gmail.com> writes:

> could I ask what is the status of this? There were some patches
> posted, but I think nothing final?

I do not think you meant to address your inquiry to me, but I think these
patches tried out some ideas, got issues discovered in them and then got
abandoned before resulting in a working code that is ready for testing.

I wish there were fewer such series, but it happens (see "Stalled" section
in What's cooking).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git status: small difference between stating whole repository and small subdirectory
  2012-04-10 16:23                                   ` Junio C Hamano
@ 2012-04-10 18:00                                     ` Jeff King
  0 siblings, 0 replies; 52+ messages in thread
From: Jeff King @ 2012-04-10 18:00 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Piotr Krukowiecki, Nguyen Thai Ngoc Duy, Thomas Rast, Git Mailing List

On Tue, Apr 10, 2012 at 09:23:59AM -0700, Junio C Hamano wrote:

> > could I ask what is the status of this? There were some patches
> > posted, but I think nothing final?
> 
> I do not think you meant to address your inquiry to me, but I think these
> patches tried out some ideas, got issues discovered in them and then got
> abandoned before resulting in a working code that is ready for testing.

Yes. I think we decided that we needed some pretty good testing to add
the cache_tree handling back into unpack_trees. I'd still like to do
that testing, but haven't done it yet.

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2012-04-10 18:00 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-10  9:42 git status: small difference between stating whole repository and small subdirectory Piotr Krukowiecki
2012-02-10 12:33 ` Nguyen Thai Ngoc Duy
2012-02-10 13:46   ` Piotr Krukowiecki
2012-02-10 14:37     ` Nguyen Thai Ngoc Duy
2012-02-13 16:54       ` Piotr Krukowiecki
2012-02-10 16:18 ` Piotr Krukowiecki
2012-02-14 11:34   ` Thomas Rast
2012-02-15  8:57     ` Piotr Krukowiecki
2012-02-15 11:01       ` Nguyen Thai Ngoc Duy
2012-02-15 15:14         ` Piotr Krukowiecki
2012-02-16 13:22           ` Piotr Krukowiecki
2012-02-15 19:03       ` Jeff King
2012-02-16 13:37         ` Piotr Krukowiecki
2012-02-16 14:05           ` Thomas Rast
2012-02-16 20:15             ` Junio C Hamano
2012-02-17 16:55             ` Piotr Krukowiecki
2012-02-16 19:20           ` Jeff King
2012-02-17 17:19             ` Piotr Krukowiecki
2012-02-17 20:37               ` Jeff King
2012-02-17 22:25                 ` Junio C Hamano
2012-02-17 22:29                   ` Jeff King
2012-02-20  8:25                     ` Piotr Krukowiecki
2012-02-20 14:06                       ` Jeff King
2012-02-20 14:09                         ` Thomas Rast
2012-02-20 14:36                           ` Nguyen Thai Ngoc Duy
2012-02-20 14:39                             ` Jeff King
2012-02-20 15:11                               ` Jeff King
2012-02-20 18:45                                 ` Thomas Rast
2012-02-20 20:35                                   ` Jeff King
2012-02-20 22:04                                     ` Junio C Hamano
2012-02-20 22:41                                       ` Jeff King
2012-02-20 23:31                                         ` Junio C Hamano
2012-02-21  7:21                                           ` Piotr Krukowiecki
2012-02-20 20:08                                 ` Junio C Hamano
2012-02-20 20:17                                   ` Jeff King
2012-02-21 14:45                             ` Nguyen Thai Ngoc Duy
2012-02-21 19:16                               ` Junio C Hamano
2012-02-22  2:12                                 ` Nguyen Thai Ngoc Duy
2012-02-22  2:55                                   ` Junio C Hamano
2012-02-22 12:54                                     ` Nguyen Thai Ngoc Duy
2012-02-22 13:17                                       ` Thomas Rast
2012-02-22 10:34                                 ` Nguyen Thai Ngoc Duy
2012-02-22  3:32                               ` Junio C Hamano
2012-04-10 15:16                                 ` Piotr Krukowiecki
2012-04-10 16:23                                   ` Junio C Hamano
2012-04-10 18:00                                     ` Jeff King
2012-02-20 19:57                           ` Junio C Hamano
2012-02-20 19:59                             ` Thomas Rast
2012-02-20 14:16                         ` Nguyen Thai Ngoc Duy
2012-02-20 14:22                           ` Jeff King
2012-02-20 19:56                         ` Junio C Hamano
2012-02-20 20:09                           ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.