"git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file

All of lore.kernel.org
 help / color / mirror / Atom feed

* "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file
@ 2014-11-26  0:42 Mike Hommey
  2014-11-26  1:00 ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Hommey @ 2014-11-26  0:42 UTC (permalink / raw)
  To: git

Hi,

I have a note tree with a bit more than 200k notes.

$ time git notes --ref foo show $sha1 > /dev/null
real    0m0.147s
user    0m0.136s
sys     0m0.008s

That's a lot of time, especially when you have a script that does that
on a fair amount of sha1s.

Now, the interesting thing is this:

$ time git ls-tree -r refs/notes/foo $sha1 ${sha1:0:2}/${sha1:2:38} ${sha1:0:2}/${sha1:2:2}/${sha1:4:36} > /dev/null
real    0m0.006s
user    0m0.008s
sys     0m0.000s

$ time git cat-file blob $objsha1 > /dev/null
real    0m0.002s
user    0m0.000s
sys     0m0.000s

And even better:

$ wc -l /tmp/note
39 /tmp/note
$ time git ls-tree refs/notes/foo $(awk '{print $1, substr($1,1,2) "/" substr($1,3), substr($1,1,2) "/" substr($1,3,2) "/" substr($1,5)}' /tmp/note) | awk '{print $3}' | git cat-file --batch > /dev/null
real    0m0.035s
user    0m0.028s
sys     0m0.004s

Reading 39 notes with ls-tree and cat-file --batch takes less time than
using git notes show for only one of them...

(and reading all 39 of them with git notes show takes 5.5s)

Mike

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file
  2014-11-26  0:42 "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file Mike Hommey
@ 2014-11-26  1:00 ` Jeff King
  2014-11-26  1:24   ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2014-11-26  1:00 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Johan Herland, git

On Wed, Nov 26, 2014 at 09:42:42AM +0900, Mike Hommey wrote:

> I have a note tree with a bit more than 200k notes.
>
> $ time git notes --ref foo show $sha1 > /dev/null
> real    0m0.147s
> user    0m0.136s
> sys     0m0.008s
> 
> That's a lot of time, especially when you have a script that does that
> on a fair amount of sha1s.

IIRC, the notes code populates an in-memory data structure, which gives
faster per-commit lookup at the cost of some setup time. Obviously for a
single lookup, that's going to be a bad tradeoff (but it does make sense
for "git log --notes"). I don't know offhand how difficult it would be
to tune the data structure differently (or avoid it altogether) if we
know ahead of time we are only going to do a small number of lookups.
But Johan (cc'd) might.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file
  2014-11-26  1:00 ` Jeff King
@ 2014-11-26  1:24   ` Jeff King
  2014-11-26  1:34     ` Jeff King
  2014-11-26  2:25     ` Mike Hommey
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff King @ 2014-11-26  1:24 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Johan Herland, git

On Tue, Nov 25, 2014 at 08:00:51PM -0500, Jeff King wrote:

> On Wed, Nov 26, 2014 at 09:42:42AM +0900, Mike Hommey wrote:
> 
> > I have a note tree with a bit more than 200k notes.
> >
> > $ time git notes --ref foo show $sha1 > /dev/null
> > real    0m0.147s
> > user    0m0.136s
> > sys     0m0.008s
> > 
> > That's a lot of time, especially when you have a script that does that
> > on a fair amount of sha1s.
> 
> IIRC, the notes code populates an in-memory data structure, which gives
> faster per-commit lookup at the cost of some setup time. Obviously for a
> single lookup, that's going to be a bad tradeoff (but it does make sense
> for "git log --notes"). I don't know offhand how difficult it would be
> to tune the data structure differently (or avoid it altogether) if we
> know ahead of time we are only going to do a small number of lookups.
> But Johan (cc'd) might.

One other question: how were your notes created?

I tried to replicate your setup by creating one note per commit in
linux.git (over 400k notes total). I did it with one big mktree,
creating a single top-level notes tree. Doing a single "git notes show"
lookup on the tree was something like 800ms.

However, this is not what trees created by git-notes look like. It
shards the object sha1s into subtrees (1a/2b/{36}), and I think does so
dynamically in a way that keeps each individual tree size low. The
in-memory data structure then only "faults in" tree objects as they are
needed. So a single lookup should only hit a small part of the total
tree.

Doing a single "git notes edit HEAD" in my case caused the notes code to
write the result using its sharding algorithm. Subsequent "git notes
show" invocations were only 14ms.

Did you use something besides git-notes to create the tree? From your
examples, it looks like you were accounting for the sharding during
lookup, so maybe this is leading in the wrong direction (but if so, I
could not reproduce your times at all even with a much larger case).

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file
  2014-11-26  1:24   ` Jeff King
@ 2014-11-26  1:34     ` Jeff King
  2014-11-26  2:30       ` Mike Hommey
  2014-11-26  2:25     ` Mike Hommey
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff King @ 2014-11-26  1:34 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Johan Herland, git

On Tue, Nov 25, 2014 at 08:24:48PM -0500, Jeff King wrote:

> However, this is not what trees created by git-notes look like. It
> shards the object sha1s into subtrees (1a/2b/{36}), and I think does so
> dynamically in a way that keeps each individual tree size low. The
> in-memory data structure then only "faults in" tree objects as they are
> needed. So a single lookup should only hit a small part of the total
> tree.
> 
> Doing a single "git notes edit HEAD" in my case caused the notes code to
> write the result using its sharding algorithm. Subsequent "git notes
> show" invocations were only 14ms.
> 
> Did you use something besides git-notes to create the tree? From your
> examples, it looks like you were accounting for the sharding during
> lookup, so maybe this is leading in the wrong direction (but if so, I
> could not reproduce your times at all even with a much larger case).

Hmph. Having just written all that, I looked at your example again, and
you are running "git ls-tree -r", which would read the whole tree
anyway. So "git notes" should be _faster_ for a single lookup.

Something weird is definitely going on. Can you use "strace" or "perf"
to get a sense of where the time is going? Has your repository been
packed recently?

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file
  2014-11-26  1:24   ` Jeff King
  2014-11-26  1:34     ` Jeff King
@ 2014-11-26  2:25     ` Mike Hommey
  2014-11-26  4:46       ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: Mike Hommey @ 2014-11-26  2:25 UTC (permalink / raw)
  To: Jeff King; +Cc: Johan Herland, git

On Tue, Nov 25, 2014 at 08:24:49PM -0500, Jeff King wrote:
> On Tue, Nov 25, 2014 at 08:00:51PM -0500, Jeff King wrote:
> 
> > On Wed, Nov 26, 2014 at 09:42:42AM +0900, Mike Hommey wrote:
> > 
> > > I have a note tree with a bit more than 200k notes.
> > >
> > > $ time git notes --ref foo show $sha1 > /dev/null
> > > real    0m0.147s
> > > user    0m0.136s
> > > sys     0m0.008s
> > > 
> > > That's a lot of time, especially when you have a script that does that
> > > on a fair amount of sha1s.
> > 
> > IIRC, the notes code populates an in-memory data structure, which gives
> > faster per-commit lookup at the cost of some setup time. Obviously for a
> > single lookup, that's going to be a bad tradeoff (but it does make sense
> > for "git log --notes"). I don't know offhand how difficult it would be
> > to tune the data structure differently (or avoid it altogether) if we
> > know ahead of time we are only going to do a small number of lookups.
> > But Johan (cc'd) might.
> 
> One other question: how were your notes created?
> 
> I tried to replicate your setup by creating one note per commit in
> linux.git (over 400k notes total). I did it with one big mktree,
> creating a single top-level notes tree. Doing a single "git notes show"
> lookup on the tree was something like 800ms.
> 
> However, this is not what trees created by git-notes look like. It
> shards the object sha1s into subtrees (1a/2b/{36}), and I think does so
> dynamically in a way that keeps each individual tree size low. The
> in-memory data structure then only "faults in" tree objects as they are
> needed. So a single lookup should only hit a small part of the total
> tree.
> 
> Doing a single "git notes edit HEAD" in my case caused the notes code to
> write the result using its sharding algorithm. Subsequent "git notes
> show" invocations were only 14ms.
> 
> Did you use something besides git-notes to create the tree? From your
> examples, it looks like you were accounting for the sharding during
> lookup, so maybe this is leading in the wrong direction (but if so, I
> could not reproduce your times at all even with a much larger case).

So... this is interesting. I happen to have recreated the notes tree
"manually", and now each git notes show takes under 10ms.

Now, looking at the notes tree reflog, I see that at some point, some
notes were added at the top-level of the tree, without being nested,
which is strange.

And it looks like it's related to how I've been adding them, through
git-fast-import. I was using notemodify commands, and was using the
filemodify command to load the previous notes tree instead of using the
from command because I don't care about keeping the notes history.
So fast-import was actually filling the notes tree as if it were
starting over with whatever new notes were added with notemodify (which,
in a case where there were many, it filled with one level of
indirection)

I'm not sure this is a case worth fixing in fast-import. I can easily
work around it.

Mike

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file
  2014-11-26  1:34     ` Jeff King
@ 2014-11-26  2:30       ` Mike Hommey
  2014-11-26  4:49         ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Hommey @ 2014-11-26  2:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Johan Herland, git

On Tue, Nov 25, 2014 at 08:34:57PM -0500, Jeff King wrote:
> On Tue, Nov 25, 2014 at 08:24:48PM -0500, Jeff King wrote:
> 
> > However, this is not what trees created by git-notes look like. It
> > shards the object sha1s into subtrees (1a/2b/{36}), and I think does so
> > dynamically in a way that keeps each individual tree size low. The
> > in-memory data structure then only "faults in" tree objects as they are
> > needed. So a single lookup should only hit a small part of the total
> > tree.
> > 
> > Doing a single "git notes edit HEAD" in my case caused the notes code to
> > write the result using its sharding algorithm. Subsequent "git notes
> > show" invocations were only 14ms.
> > 
> > Did you use something besides git-notes to create the tree? From your
> > examples, it looks like you were accounting for the sharding during
> > lookup, so maybe this is leading in the wrong direction (but if so, I
> > could not reproduce your times at all even with a much larger case).
> 
> Hmph. Having just written all that, I looked at your example again, and
> you are running "git ls-tree -r", which would read the whole tree
> anyway. So "git notes" should be _faster_ for a single lookup.

The -r actually doesn't matter, since what's being listed is a blob, not
a tree, so there is no recursion.

Mike

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file
  2014-11-26  2:25     ` Mike Hommey
@ 2014-11-26  4:46       ` Jeff King
  2014-11-26 11:46         ` Johan Herland
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2014-11-26  4:46 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Johan Herland, git

On Wed, Nov 26, 2014 at 11:25:53AM +0900, Mike Hommey wrote:

> Now, looking at the notes tree reflog, I see that at some point, some
> notes were added at the top-level of the tree, without being nested,
> which is strange.

That's somewhat expected. The fanout is dynamic based on the number of
notes, so early on I think some notes may be found at the top of the tree.

> And it looks like it's related to how I've been adding them, through
> git-fast-import. I was using notemodify commands, and was using the
> filemodify command to load the previous notes tree instead of using the
> from command because I don't care about keeping the notes history.
> So fast-import was actually filling the notes tree as if it were
> starting over with whatever new notes were added with notemodify (which,
> in a case where there were many, it filled with one level of
> indirection)

Ah, that sort of makes sense. This confused the code to adjust the
fanout, because we track "number of notes" independently of "number of
files" (even though they are really the same thing in a notes tree).

> I'm not sure this is a case worth fixing in fast-import. I can easily
> work around it.

Yeah. Probably fast-import could be smarter here, but I think ultimately
it makes sense to stick to using the note commands. I think what you
want is a version of "from" that takes an existing tree (and number of
notes!) from a commit, but does not add it as a parent. AFAIK,
fast-import doesn't have a way to do that.

Probably the simplest thing is to build it with history via fast-import,
and then just truncate the history at the end with:

  commit=$(echo "final notes tree" | git commit-tree refs/notes/foo^{tree})
  git update-ref refs/notes/foo $commit

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file
  2014-11-26  2:30       ` Mike Hommey
@ 2014-11-26  4:49         ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2014-11-26  4:49 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Johan Herland, git

On Wed, Nov 26, 2014 at 11:30:39AM +0900, Mike Hommey wrote:

> > Hmph. Having just written all that, I looked at your example again, and
> > you are running "git ls-tree -r", which would read the whole tree
> > anyway. So "git notes" should be _faster_ for a single lookup.
> 
> The -r actually doesn't matter, since what's being listed is a blob, not
> a tree, so there is no recursion.

Ah, right. I should have looked more carefully. I took the "-r" and the
patterns to mean "recursively list the tree, and I will grep for these
elements". But you were actually generating a set of pathspecs, which
git could then use to limit some of the walk.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file
  2014-11-26  4:46       ` Jeff King
@ 2014-11-26 11:46         ` Johan Herland
  2014-11-26 12:30           ` Mike Hommey
  0 siblings, 1 reply; 10+ messages in thread
From: Johan Herland @ 2014-11-26 11:46 UTC (permalink / raw)
  To: Jeff King; +Cc: Mike Hommey, Git mailing list

(First of all, thanks to both for great investigation and analysis)

On Wed, Nov 26, 2014 at 5:46 AM, Jeff King <peff@peff.net> wrote:
> On Wed, Nov 26, 2014 at 11:25:53AM +0900, Mike Hommey wrote:
>
>> Now, looking at the notes tree reflog, I see that at some point, some
>> notes were added at the top-level of the tree, without being nested,
>> which is strange.
>
> That's somewhat expected. The fanout is dynamic based on the number of
> notes, so early on I think some notes may be found at the top of the tree.
>
>> And it looks like it's related to how I've been adding them, through
>> git-fast-import. I was using notemodify commands, and was using the
>> filemodify command to load the previous notes tree instead of using the
>> from command because I don't care about keeping the notes history.

I'd very much like to see this fast-import stream (or a script
generating it). I'm assuming that it roughly follows along these lines:

 - Start a new commit from a clean slate (no 'from')

 - Do a single filemodify to "load" the previous notes tree.
   Exactly what does this filemodify command look like? I'm
   guessing you're using the root tree object from the previous
   notes tree as <dataref>, and an empty <path>, i.e.:

     M 040000 $previous_notes_tree_root_sha1 \n

 - Do a series of notemodify commands for the notes being added in
   this commit.

 - End of commit.

Browsing the fast-import code, I believe I can confirm some of the
behavior you're seeing. Your commit starts from a clean slate
(with branch->num_notes == 0). The initial filemodify command loads
the previous notes tree, but does not adjust branch->num_notes.
The subsequent notemodify commands each increment branch->num_notes.
Then, when wrapping up the commit, branch->num_notes is used to
calculate the new fanout (the fast-import code does a simple
divide-by-256 heuristic). If you happened to have added more than
256 notes in this commit, you get a 2/38 fanout, otherwise you get a
flat notes tree. This is obviously incorrect, because the notes that
were "preloaded" by the filemodify command are not being accounted in
the fanout calculation. Thus, you end up with a seriously suboptimal
(notes) tree structure, which kills your lookup times.

However, at the start of note_change_n() (which handles the "notemodify"
commands), there is some logic to deal with this. The initial comment
there suggests that it was written for the case when the commit uses
"from refs/notes/foo" to add notes to an existing notes tree, but I don't
really see why it should not also apply to the case where the notemodify
commands are preceded by a filemodify to "preload" the notes tree.

Furthermore, It is not obvious why this logic (that counts the number of
exiting notes before we add the first note) does not ALREADY trigger in
your case (both b->num_notes and *old_fanout should be == 0). However, I
wonder if the "counting mode" is in fact invoked, but that there is some
non-obvious interplay with how "struct tree_entry"s are prepared by your
use of filemodify:

When using 'from ...' to preload b->branch_tree, the code in parse_from()
prepares a b->branch_tree tree_entry looking like this:

  .versions[1].sha1 = $some_sha1
  .tree == NULL

Following this, parse_new_commit() goes on to call load_branch(b), which
AFAICS, populates the .tree:

  .versions[1].sha1 = $some_sha1
  .tree == <tree structure loaded from $some_sha1>

However, when you use the filemodify command to replace the root tree
object, we end up calling

  tree_content_replace(&b->branch_tree, sha1, mode, NULL);

near the end of file_change_m(), which causes the b->branch_tree tree_entry
to look like this:

  .versions[1].sha1 = $some_sha1
  .tree == NULL

(i.e. same as parse_from()), BUT there is never a follow-up call to
load_branch(b).

So, at the time we handle the first notemodify command, the
b->branch_tree.tree entry is either non-NULL (in the case of 'from ...'),
or NULL (in the case of your "preloading" filemodify). And when we start
the "counting mode" of change_note_fanout(), the for loop in
do_change_note_fanout() depends on the .tree entry being already loaded.

A possible fix would be to ensure .tree is loaded before starting the for
loop in do_change_note_fanout(). But it's been too long since I thoroughly
grokked the fast-import code to say if simply inserting a

  load_tree(&b->branch_tree);

before the for loop is Safe(tm) and Correct(tm)...

>> So fast-import was actually filling the notes tree as if it were
>> starting over with whatever new notes were added with notemodify (which,
>> in a case where there were many, it filled with one level of
>> indirection)
>
> Ah, that sort of makes sense. This confused the code to adjust the
> fanout, because we track "number of notes" independently of "number of
> files" (even though they are really the same thing in a notes tree).

Not always. There is no way to summarily prevent someone from adding
objects at arbitrary locations in a notes tree. There "non-notes" must be
(and are) handled by the notes code (albeit less than gracefully).

>> I'm not sure this is a case worth fixing in fast-import. I can easily
>> work around it.
>
> Yeah. Probably fast-import could be smarter here, but I think ultimately
> it makes sense to stick to using the note commands. I think what you
> want is a version of "from" that takes an existing tree (and number of
> notes!) from a commit, but does not add it as a parent. AFAIK,
> fast-import doesn't have a way to do that.
>
> Probably the simplest thing is to build it with history via fast-import,
> and then just truncate the history at the end with:
>
>   commit=$(echo "final notes tree" | git commit-tree refs/notes/foo^{tree})
>   git update-ref refs/notes/foo $commit

I agree that this is probably the best workaround for now.

...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file
  2014-11-26 11:46         ` Johan Herland
@ 2014-11-26 12:30           ` Mike Hommey
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Hommey @ 2014-11-26 12:30 UTC (permalink / raw)
  To: Johan Herland; +Cc: Jeff King, Git mailing list

On Wed, Nov 26, 2014 at 12:46:20PM +0100, Johan Herland wrote:
> (First of all, thanks to both for great investigation and analysis)
> 
> On Wed, Nov 26, 2014 at 5:46 AM, Jeff King <peff@peff.net> wrote:
> > On Wed, Nov 26, 2014 at 11:25:53AM +0900, Mike Hommey wrote:
> >
> >> Now, looking at the notes tree reflog, I see that at some point, some
> >> notes were added at the top-level of the tree, without being nested,
> >> which is strange.
> >
> > That's somewhat expected. The fanout is dynamic based on the number of
> > notes, so early on I think some notes may be found at the top of the tree.
> >
> >> And it looks like it's related to how I've been adding them, through
> >> git-fast-import. I was using notemodify commands, and was using the
> >> filemodify command to load the previous notes tree instead of using the
> >> from command because I don't care about keeping the notes history.
> 
> I'd very much like to see this fast-import stream (or a script
> generating it). I'm assuming that it roughly follows along these lines:
> 
>  - Start a new commit from a clean slate (no 'from')
> 
>  - Do a single filemodify to "load" the previous notes tree.
>    Exactly what does this filemodify command look like? I'm
>    guessing you're using the root tree object from the previous
>    notes tree as <dataref>, and an empty <path>, i.e.:
> 
>      M 040000 $previous_notes_tree_root_sha1 \n
> 
>  - Do a series of notemodify commands for the notes being added in
>    this commit.
> 
>  - End of commit.

That's exactly the scenario.

<snip>
> >   commit=$(echo "final notes tree" | git commit-tree refs/notes/foo^{tree})
> >   git update-ref refs/notes/foo $commit
> 
> I agree that this is probably the best workaround for now.

Indeed, that's about what I had in mind when I said I could easily work
around (except I use the ls command in fast-import), and what I
implemented.

Mike

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-11-26 12:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-26  0:42 "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file Mike Hommey
2014-11-26  1:00 ` Jeff King
2014-11-26  1:24   ` Jeff King
2014-11-26  1:34     ` Jeff King
2014-11-26  2:30       ` Mike Hommey
2014-11-26  4:49         ` Jeff King
2014-11-26  2:25     ` Mike Hommey
2014-11-26  4:46       ` Jeff King
2014-11-26 11:46         ` Johan Herland
2014-11-26 12:30           ` Mike Hommey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.