All of lore.kernel.org
 help / color / mirror / Atom feed
* Why does send-pack call pack-objects for all remote refs?
@ 2015-12-07 21:02 Daniel Koverman
  2015-12-07 22:41 ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel Koverman @ 2015-12-07 21:02 UTC (permalink / raw)
  To: git

I have a repository which has ~2000 branches on the remote, and it takes ~8 seconds to push a change to one ref. The majority of this time is spent in pack-object. I wrote a hack so that only the ref being updated would be packed (the normal behavior is to pack for every ref on the remote).  The push time dropped to <1 second with (seemingly) no consequences. This raised a couple of questions:

1) Are there consequences for not packing refs that are not being updated? Can all operations in send-pack which operate on other refs be skipped?
2) Why isn't git more selective about what it packs? Is it simply a performance optimization not worth the implementation complexity?
3) Is something about my repository strange? ex: 2000 is too many branches, packing for each ref takes me an unusually long time, etc.

Thanks in advance to anyone who takes the time to clarify this for me.

Daniel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why does send-pack call pack-objects for all remote refs?
  2015-12-07 21:02 Why does send-pack call pack-objects for all remote refs? Daniel Koverman
@ 2015-12-07 22:41 ` Junio C Hamano
  2015-12-07 22:57   ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2015-12-07 22:41 UTC (permalink / raw)
  To: Daniel Koverman; +Cc: git

Daniel Koverman <dkoverman@predictiveTechnologies.com> writes:

> I have a repository which has ~2000 branches on the remote, and it
> takes ~8 seconds to push a change to one ref. The majority of this
> time is spent in pack-object. I wrote a hack so that only the ref
> being updated would be packed (the normal behavior is to pack for
> every ref on the remote).

I am having a hard time understanding what you are trying to say, as
nobody's pack-objects "packs for a ref" or "packs a ref", so my
response has to be based on my best guess---I think you are talking
about feeding the object names of the tips of all remote refs as
the bottoms of the revision range to pack-objects.

When you are pushing your 'topic' branch to update the 'topic'
branch at the remote, it is true that we compute

	git rev-list --objects $your_topic --not $all_of_the_remote_refs

to produce a packfile.  And by tweaking this to

	git rev-list --objects $your_topic --not $their_topic

you will cut down the processing time of 'rev-list', especially if
you have insane number of refs at the remote end.

There is a price you would pay for doing so, though.  An obvious one
is what if the 'topic' branch does not exist yet at the remote.
Without the "--not ..." part, you would end up sending the entire
history behind $your_topic, and the way you prevent that from
happening is to give what are known to exist at the remote end.
Even when there already is 'topic' at the remote, the contents at
the paths that are different between your 'topic' and the 'topic' as
exists at the remote may already exist on some other branches that
are already at the remote (e.g. you may have merged some branches
that are common between your repository and the remote, and the only
object missing from the remote that your repository has to send may
be a merge commit and the top-level tree object), but limiting the
bottoms of the revision range only to "--not $their_topic" would rob
this obvious optimization opportunity from you.

There has to be some way to limit the list of remote-refs that are
used as bottoms of the revision range.  For example, if you know
that the remote has all the tags, and that everything in the v1.0
tag is contained in the v2.0 tag, then a single "--not v2.0" should
give the same result as "--not v1.0 v2.0" that lists both.  But the
computation that is needed to figure out which tags and branches are
not worth listing as bottoms would need to look at all of them at
least once anyway, so a naive implementation of such would end up
spending the same cycles, I would suspect.

Also it was unclear if you are working with a shallow repository.
The performance trade-off made between the packsize and the cycles
is somewhat different between a normal and a shallow repository,
e.g. 2dacf26d (pack-objects: use --objects-edge-aggressive for
shallow repos, 2014-12-24) might be a good starting point to think
about this issue.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why does send-pack call pack-objects for all remote refs?
  2015-12-07 22:41 ` Junio C Hamano
@ 2015-12-07 22:57   ` Jeff King
  2015-12-08 17:34     ` Daniel Koverman
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2015-12-07 22:57 UTC (permalink / raw)
  To: Daniel Koverman; +Cc: Junio C Hamano, git

On Mon, Dec 07, 2015 at 02:41:00PM -0800, Junio C Hamano wrote:

> Also it was unclear if you are working with a shallow repository.
> The performance trade-off made between the packsize and the cycles
> is somewhat different between a normal and a shallow repository,
> e.g. 2dacf26d (pack-objects: use --objects-edge-aggressive for
> shallow repos, 2014-12-24) might be a good starting point to think
> about this issue.

Also note that for a while the "aggressive" form was used everywhere. I
think that started in fbd4a70 (list-objects: mark more commits as edges
in mark_edges_uninteresting - 2013-08-16), and was fixed in 1684c1b
(rev-list: add an option to mark fewer edges as uninteresting,
2014-12-24).

So it was present from v1.8.4.2 up to v2.3.0.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Why does send-pack call pack-objects for all remote refs?
  2015-12-07 22:57   ` Jeff King
@ 2015-12-08 17:34     ` Daniel Koverman
  2015-12-10  4:19       ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel Koverman @ 2015-12-08 17:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Your interpretation of my email was correct. As you picked up on, I
had a fundamental misunderstanding of what pack-objects was doing.
Thanks for the explanation, I have a much better idea of what is
going on now.

Given my use pattern, it may be reasonable for me to patch in an
option to compute

    git rev-list --objects $my_topic --not $subset_of_remote_refs

capitalizing on my knowledge of this particular repository to come
up with heuristics for picking a reasonable subset. This will
come with the risk of sometimes producing an unnecessarily large
(potentially an obscenely large) packfile. You have thoroughly
convinced me that an option like that will not generalize and would
be unsuitable for main line git.

It is also good to know that 2000 remote refs is insane. The lower
hanging fruit here sounds like trimming that to a reasonable
number, so I'll try that approach first.

Thanks again, Junio and Peff.

Daniel

-----Original Message-----
From: Jeff King [mailto:peff@peff.net] 
Sent: Monday, December 07, 2015 5:57 PM
To: Daniel Koverman
Cc: Junio C Hamano; git@vger.kernel.org
Subject: Re: Why does send-pack call pack-objects for all remote refs?

On Mon, Dec 07, 2015 at 02:41:00PM -0800, Junio C Hamano wrote:

> Also it was unclear if you are working with a shallow repository.
> The performance trade-off made between the packsize and the cycles
> is somewhat different between a normal and a shallow repository,
> e.g. 2dacf26d (pack-objects: use --objects-edge-aggressive for
> shallow repos, 2014-12-24) might be a good starting point to think
> about this issue.

Also note that for a while the "aggressive" form was used everywhere. I
think that started in fbd4a70 (list-objects: mark more commits as edges
in mark_edges_uninteresting - 2013-08-16), and was fixed in 1684c1b
(rev-list: add an option to mark fewer edges as uninteresting,
2014-12-24).

So it was present from v1.8.4.2 up to v2.3.0.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why does send-pack call pack-objects for all remote refs?
  2015-12-08 17:34     ` Daniel Koverman
@ 2015-12-10  4:19       ` Jeff King
  2015-12-12  4:15         ` Nasser Grainawi
  2015-12-14 13:47         ` Daniel Koverman
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff King @ 2015-12-10  4:19 UTC (permalink / raw)
  To: Daniel Koverman; +Cc: Junio C Hamano, git

On Tue, Dec 08, 2015 at 05:34:43PM +0000, Daniel Koverman wrote:

> Your interpretation of my email was correct. As you picked up on, I
> had a fundamental misunderstanding of what pack-objects was doing.
> Thanks for the explanation, I have a much better idea of what is
> going on now.
> 
> Given my use pattern, it may be reasonable for me to patch in an
> option to compute
> 
>     git rev-list --objects $my_topic --not $subset_of_remote_refs

You might also try repacking with "git repack -adb", which will build
reachability bitmaps. Pack-objects can use them to compute the set of
required objects much faster.

> It is also good to know that 2000 remote refs is insane. The lower
> hanging fruit here sounds like trimming that to a reasonable
> number, so I'll try that approach first.

It's definitely a lot, but it's not unheard of. The git project has over
500 tags. That's not 2000, but you're within an order of magnitude.

I have seen repositories with 20,000+ tags. I consider that a bit more
ridiculous, but it does work in practice.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why does send-pack call pack-objects for all remote refs?
  2015-12-10  4:19       ` Jeff King
@ 2015-12-12  4:15         ` Nasser Grainawi
  2015-12-14 13:47         ` Daniel Koverman
  1 sibling, 0 replies; 10+ messages in thread
From: Nasser Grainawi @ 2015-12-12  4:15 UTC (permalink / raw)
  To: Jeff King; +Cc: Daniel Koverman, Junio C Hamano, git


> On Dec 9, 2015, at 9:19 PM, Jeff King <peff@peff.net> wrote:
> 
> On Tue, Dec 08, 2015 at 05:34:43PM +0000, Daniel Koverman wrote:
> 
>> It is also good to know that 2000 remote refs is insane. The lower
>> hanging fruit here sounds like trimming that to a reasonable
>> number, so I'll try that approach first.
> 
> It's definitely a lot, but it's not unheard of. The git project has over
> 500 tags. That's not 2000, but you're within an order of magnitude.
> 
> I have seen repositories with 20,000+ tags. I consider that a bit more
> ridiculous, but it does work in practice.
> 

We have one at $DAY_JOB with 400,000+ refs. It presents some issues, but
Martin has raised those with the community and it works pretty well now.

Ref advertisement is still a pain...

> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Why does send-pack call pack-objects for all remote refs?
  2015-12-10  4:19       ` Jeff King
  2015-12-12  4:15         ` Nasser Grainawi
@ 2015-12-14 13:47         ` Daniel Koverman
  2015-12-14 21:04           ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: Daniel Koverman @ 2015-12-14 13:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Nasser Grainawi

> You might also try repacking with "git repack -adb", which will
> build reachability bitmaps. Pack-objects can use them to compute
> the set of required objects much faster.

Running "git repack -adb" caused my push time to incease by about 5x.
I made some fresh clones and tried other options with repack, and
consistently anything I tried with -b caused the push time to
increase about 5x.

I don't know much about reachability bitmaps, but perhaps it is
important to note that I timed the pushes after repacking on Git for
Windows. My earlier timings were done on both Linux and Windows and I
did not see a significant difference.

> It's definitely a lot, but it's not unheard of. The git project has
> over 500 tags. That's not 2000, but you're within an order of
> magnitude.
>
> I have seen repositories with 20,000+ tags. I consider that a bit
> more ridiculous, but it does work in practice.

Thanks for the extra context. I'll keep that in mind while I decide
how to approach this.

Daniel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why does send-pack call pack-objects for all remote refs?
  2015-12-14 13:47         ` Daniel Koverman
@ 2015-12-14 21:04           ` Jeff King
  2015-12-14 22:31             ` Jonathan Nieder
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2015-12-14 21:04 UTC (permalink / raw)
  To: Daniel Koverman; +Cc: Junio C Hamano, git, Nasser Grainawi

On Mon, Dec 14, 2015 at 01:47:39PM +0000, Daniel Koverman wrote:

> > You might also try repacking with "git repack -adb", which will
> > build reachability bitmaps. Pack-objects can use them to compute
> > the set of required objects much faster.
> 
> Running "git repack -adb" caused my push time to incease by about 5x.
> I made some fresh clones and tried other options with repack, and
> consistently anything I tried with -b caused the push time to
> increase about 5x.
> 
> I don't know much about reachability bitmaps, but perhaps it is
> important to note that I timed the pushes after repacking on Git for
> Windows. My earlier timings were done on both Linux and Windows and I
> did not see a significant difference.

Hmm. I guess that makes sense. The bitmap we want is the set difference
between the objects we are sending, and the tips the other side has. If
we have a bitmap at each ref tip, that's very fast. But if you have a
very large number of refs, we don't make one for each ref, and it has to
fallback to walking to the nearest one (and it ends up worse than a
regular walk, because it's filling in the bitmap for each tree, rather
than just doing the "good enough" commit walk that we usually do).

I suspect there's room for improvement in the way we select commits to
store bitmaps for (so that the average walk is smaller). But it's rather
tricky; there's not a single constant to change to make it work better.

Thanks for trying out my suggestion.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why does send-pack call pack-objects for all remote refs?
  2015-12-14 21:04           ` Jeff King
@ 2015-12-14 22:31             ` Jonathan Nieder
  2015-12-14 22:37               ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Jonathan Nieder @ 2015-12-14 22:31 UTC (permalink / raw)
  To: Jeff King; +Cc: Daniel Koverman, Junio C Hamano, git, Nasser Grainawi

Jeff King wrote:

> Hmm. I guess that makes sense. The bitmap we want is the set difference
> between the objects we are sending, and the tips the other side has. If
> we have a bitmap at each ref tip, that's very fast. But if you have a
> very large number of refs, we don't make one for each ref, and it has to
> fallback to walking to the nearest one (and it ends up worse than a
> regular walk, because it's filling in the bitmap for each tree, rather
> than just doing the "good enough" commit walk that we usually do).
>
> I suspect there's room for improvement in the way we select commits to
> store bitmaps for (so that the average walk is smaller). But it's rather
> tricky; there's not a single constant to change to make it work better.

Git gc and JGit GC differ here.  JGit partitions the commits being
packed by branch and then runs a selection algorithm on each part.
Git runs a selection once on a list of all commits.

Some effects:
- JGit selects more bitmaps, so the gc takes longer and the resulting
  bitmap file is larger (bad)
- JGit is more likely to have bitmaps for the commits involved in
  pushes and fetches (good)

The commit selection code, for reference:

https://eclipse.googlesource.com/jgit/jgit/+/86af34e1/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/pack/PackWriterBitmapPreparer.java#151
https://kernel.googlesource.com/pub/scm/git/git/+/ed1c9977/pack-bitmap-write.c#383

Thoughts?
Jonathan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why does send-pack call pack-objects for all remote refs?
  2015-12-14 22:31             ` Jonathan Nieder
@ 2015-12-14 22:37               ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2015-12-14 22:37 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Daniel Koverman, Junio C Hamano, git, Nasser Grainawi

On Mon, Dec 14, 2015 at 02:31:55PM -0800, Jonathan Nieder wrote:

> > I suspect there's room for improvement in the way we select commits to
> > store bitmaps for (so that the average walk is smaller). But it's rather
> > tricky; there's not a single constant to change to make it work better.
> 
> Git gc and JGit GC differ here.  JGit partitions the commits being
> packed by branch and then runs a selection algorithm on each part.
> Git runs a selection once on a list of all commits.
> 
> Some effects:
> - JGit selects more bitmaps, so the gc takes longer and the resulting
>   bitmap file is larger (bad)
> - JGit is more likely to have bitmaps for the commits involved in
>   pushes and fetches (good)
> 
> The commit selection code, for reference:
> 
> https://eclipse.googlesource.com/jgit/jgit/+/86af34e1/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/pack/PackWriterBitmapPreparer.java#151
> https://kernel.googlesource.com/pub/scm/git/git/+/ed1c9977/pack-bitmap-write.c#383
> 
> Thoughts?

My thought is it would be great if somebody wanted to work on this. :)

My understanding is that JGit's approach has some problems, too. Terry's
message doesn't seem to have made it to the list, but you can see in the
quoted bits he mentions some OOM problems during the bitmap write:

  http://article.gmane.org/gmane.comp.version-control.git/281476

That may not be a big deal to work around. I really just haven't looked
at it at all. Vicent did the original bitmap selection code for C Git,
and I don't think it has been touched since then.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-12-14 22:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-07 21:02 Why does send-pack call pack-objects for all remote refs? Daniel Koverman
2015-12-07 22:41 ` Junio C Hamano
2015-12-07 22:57   ` Jeff King
2015-12-08 17:34     ` Daniel Koverman
2015-12-10  4:19       ` Jeff King
2015-12-12  4:15         ` Nasser Grainawi
2015-12-14 13:47         ` Daniel Koverman
2015-12-14 21:04           ` Jeff King
2015-12-14 22:31             ` Jonathan Nieder
2015-12-14 22:37               ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.