git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Huge push upload despite only having a tiny change
@ 2020-06-02 19:21 Elijah Newren
  2020-06-02 19:40 ` Derrick Stolee
  2020-06-03  1:53 ` Jonathan Nieder
  0 siblings, 2 replies; 6+ messages in thread
From: Elijah Newren @ 2020-06-02 19:21 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Jonathan Nieder

Hi,

I had a user report that two nearly identical pushes (the second being
an amended commit of the first) took dramatically differing amounts of
time and amount of data uploaded (from 4.5 seconds and about 21k
uploaded, to 223 seconds and over 100 MB uploaded).

I'm curious if this might be a known issue; it sounds similar to some
push protocol discussion I remember from the contributor's summit (but
I don't know anything on the protocol side and tend to work on other
things during protocol discussion).  If this does sound like a known
issue, does anyone have links to some relevant discussion I can read
up on (and perhaps pass long to this user)?  If it doesn't sound like
a known issue, what other things would be useful for me to dig up?

Additional details:
* Both pushes involved cases where the user had a single commit that
the server didn't.
* The parent of the commit that needed to be pushed was the same in both cases.
* The commit in question was small; modifying either 13 or 15 lines of
two files that were each less than about 8k in size.
* The user was pushing up a new branch each time, but the new branch
was closely related to an existing branch (i.e. had all but a few
commits of history in common)
* The user was two commits behind the closely-related branch at the
time of the first push, and 10 commits behind at the time of the
second push.  Running format-patch on these 10 commits that were on
the server at the time shows their size is at most about ~55 k.
* The server has a huge number of refs; about 470k of them (most of
them related to code reviews).
* The server was running Gerrit 3.1.4 (i.e. jgit).
* The user was using a version of git based off master from a few
weeks ago, in particular, with a few changes on top of commit
b994622632 ("The eighth batch", 2020-05-08).  I don't think the few
internal changes could affect anything here, but those changes were:
(1) making features.experimental default to true (which only turns on
two fetch settings as far as I can tell), (2) making
merge.directoryRenames default to true, and (3) setting a few trace2.*
config settings by default.

Thanks,
Elijah

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Huge push upload despite only having a tiny change
  2020-06-02 19:21 Huge push upload despite only having a tiny change Elijah Newren
@ 2020-06-02 19:40 ` Derrick Stolee
  2020-06-03  1:35   ` Elijah Newren
  2020-06-03  1:53 ` Jonathan Nieder
  1 sibling, 1 reply; 6+ messages in thread
From: Derrick Stolee @ 2020-06-02 19:40 UTC (permalink / raw)
  To: Elijah Newren, Git Mailing List; +Cc: Jonathan Nieder

On 6/2/2020 3:21 PM, Elijah Newren wrote:
> * The user was two commits behind the closely-related branch at the
> time of the first push, and 10 commits behind at the time of the
> second push.  Running format-patch on these 10 commits that were on
> the server at the time shows their size is at most about ~55 k.

This is most-likely the difference, since the pack-objects algorithm
only looks at the _boundary_ between the server's commits and the
commits-to-push. This also could have dramatically changed the delta-base
matches.

Do you have exact object counts? It would help to know if somehow the
object discovery algorithm is at fault or the delta-base algorithm
is to blame.

For instance, `pack.useSparse` was enabled by default this release,
and has some opportunity to push extra objects. See [1] for more
details on both the "boundary" description (the "commit frontier")
but also how that option changes the algorithm.

The only case I know of that could lead to sending extra objects
(that was not the case before) is described in t5322-pack-objects-sparse.sh
and 4f6d26b1 (list-objects: consume sparse tree walk, 2019-01-16).
It involves doing a full _copy_ of a tree from one position to
another, without "disturbing" the parent of the original tree.

(I mean: copy directory A/B to C/D and make sure nothing is
different in directory A.)

However, if these two pushes were with the same config setting,
I'm not sure what could have changed between the pushes to hit
this very narrow case.

Thanks,
-Stolee


[1] https://devblogs.microsoft.com/devops/exploring-new-frontiers-for-git-push-performance/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Huge push upload despite only having a tiny change
  2020-06-02 19:40 ` Derrick Stolee
@ 2020-06-03  1:35   ` Elijah Newren
  0 siblings, 0 replies; 6+ messages in thread
From: Elijah Newren @ 2020-06-03  1:35 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List, Jonathan Nieder

On Tue, Jun 2, 2020 at 12:40 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/2/2020 3:21 PM, Elijah Newren wrote:
> > * The user was two commits behind the closely-related branch at the
> > time of the first push, and 10 commits behind at the time of the
> > second push.  Running format-patch on these 10 commits that were on
> > the server at the time shows their size is at most about ~55 k.
>
> This is most-likely the difference, since the pack-objects algorithm
> only looks at the _boundary_ between the server's commits and the
> commits-to-push. This also could have dramatically changed the delta-base
> matches.
>
> Do you have exact object counts? It would help to know if somehow the
> object discovery algorithm is at fault or the delta-base algorithm
> is to blame.

Output from the users' terminal for the two different runs, just
redacting URLs.  The first push was:

Enumerating objects: 23, done.
Counting objects: 100% (23/23), done.
Delta compression using up to 16 threads
Compressing objects: 100% (7/7), done.
Writing objects: 100% (12/12), 952 bytes | 952.00 KiB/s, done.
Total 12 (delta 5), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (5/5)
remote: Processing changes: refs: 1, new: 1, done
remote: commit 5efcf04: warning: subject >50 characters; use shorter
first paragraph
remote:
remote: SUCCESS
remote:
remote:   https://gerrit.internal.site/c/path/to/repo/+/341489 Remove
ambiguous method from ArtifactProducerPeeringService [NEW]
remote:
To gerrit.internal.site:path/to/repo.git
 * [new branch]                HEAD -> refs/for/develop

The second was:

Enumerating objects: 325816, done.
Counting objects: 100% (266559/266559), done.
Delta compression using up to 16 threads
Compressing objects: 100% (102457/102457), done.
Writing objects: 100% (257630/257630), 102.88 MiB | 619.00 KiB/s, done.
Total 257630 (delta 122169), reused 218776 (delta 87259), pack-reused 0
remote: Resolving deltas: 100% (122169/122169)
remote: Processing changes: refs: 1, updated: 1, done
remote: commit 360a266: warning: subject >50 characters; use shorter
first paragraph
remote:
remote: SUCCESS
remote:
remote:   https://gerrit.internal.site/c/path/to/repo/+/341489 Remove
ambiguous method from ArtifactProducerPeeringService
remote:
To gerrit.internal.site:path/to/repo.git
 * [new branch]                HEAD -> refs/for/develop

> For instance, `pack.useSparse` was enabled by default this release,
> and has some opportunity to push extra objects. See [1] for more
> details on both the "boundary" description (the "commit frontier")
> but also how that option changes the algorithm.
>
> The only case I know of that could lead to sending extra objects
> (that was not the case before) is described in t5322-pack-objects-sparse.sh
> and 4f6d26b1 (list-objects: consume sparse tree walk, 2019-01-16).
> It involves doing a full _copy_ of a tree from one position to
> another, without "disturbing" the parent of the original tree.
>
> (I mean: copy directory A/B to C/D and make sure nothing is
> different in directory A.)
>
> However, if these two pushes were with the same config setting,
> I'm not sure what could have changed between the pushes to hit
> this very narrow case.

So, after a day of attempting to figure out how to debug, I found out
we had a backup of the server between the first and second pushes and
fairly close to the second push.  I was able to launch a separate VM
of that backup, and then attempt to make a local repo that I hoped
mimic what the user had.  I can't duplicate the 100MB push (which is
about 25000x bigger than expected) as the user did, but I can with
some tweaking state that the push size should be closer to ~2.5 KB and
I can readly duplicate pushes in the 4-6 MB range -- i.e. about 1000x
bigger than expected.  pack.useSparse affects things, but not much.
So, some output of my own duplication attempts:

First, if I do a git fetch of the 'develop' branch (I can even nuke
.git/FETCH_HEAD afterwards; it only matters that I have the history
locally) from the place I'm pushing to, then the push size is tiny as
expected:

<Go to server, use rsync to make a pristine copy of repo>
<Go to server, rsync repo from pristine state, then on my laptop:>

$ git fetch git_over_ssh_url develop
$ rm .git/FETCH_HEAD
$ time git push git_over_ssh_url mike-push:refs/for/develop
Enumerating objects: 37, done.
Counting objects: 100% (37/37), done.
Delta compression using up to 8 threads
Compressing objects: 100% (11/11), done.
Writing objects: 100% (22/22), 2.54 KiB | 2.54 MiB/s, done.
Total 22 (delta 7), reused 14 (delta 1), pack-reused 0
remote: Checking connectivity: 22, done.
To git_over_ssh_url
 * [new branch]                mike-push -> refs/for/develop

real 0m13.044s
user 0m0.202s
sys 0m0.146s

<Remove the pack downloaded from the earlier fetch of develop>
<Without this step, I always get small pushes:>
$ ls -rt .git/objects/pack/ | tail -n 2 | xargs -n 1 -IPATH rm
.git/objects/pack/PATH

<Go to server, rsync repo from pristine state, then on my laptop:>
$ time git push git_over_ssh_url mike-push:refs/for/develop
Enumerating objects: 40785, done.
Counting objects: 100% (20543/20543), done.
Delta compression using up to 8 threads
Compressing objects: 100% (9032/9032), done.
Writing objects: 100% (16685/16685), 6.27 MiB | 208.00 KiB/s, done.
Total 16685 (delta 7103), reused 12412 (delta 3389), pack-reused 0
remote: Resolving deltas: 100% (7103/7103), completed with 1864 local objects.
remote: Checking connectivity: 22, done.
To git_over_ssh_url
 * [new branch]                mike-push -> refs/for/develop

real 0m59.703s
user 0m3.139s
sys 0m0.735s

<Also, this is only slightly affected by pack.useSparse...>
<Go to server, rsync repo from pristine state, then on my laptop:>
$ time git -c pack.useSparse=false push git_over_ssh_url
mike-push:refs/for/develop
Enumerating objects: 39891, done.
Counting objects: 100% (18953/18953), done.
Delta compression using up to 8 threads
Compressing objects: 100% (7687/7687), done.
Writing objects: 100% (14991/14991), 4.85 MiB | 228.00 KiB/s, done.
Total 14991 (delta 6665), reused 11301 (delta 3141), pack-reused 0
remote: Resolving deltas: 100% (6665/6665), completed with 1939 local objects.
remote: Checking connectivity: 22, done.
To git_over_ssh_url
 * [new branch]                mike-push -> refs/for/develop

real 0m49.295s
user 0m2.362s
sys 0m0.592s


Finally, you'll note that when I was reproducing, things were a bit
different than what Mike (the end user) was dealing with.  I was using
vanilla git-2.27.0.  Also, on the server, I was using plain old
git-over-ssh, with the server running git-2.19.0.  However, I also
tried it with gerrit (i.e. jgit) as the server and got identical
numbers for enumerating objects, counting objects, compressing
objects, and the size of the pushed data and the number of resolved
deltas was within 1% of the git-over-ssh case.  And that was also true
both with and without pack.useSparse.

Any ideas?  Anything else I should try or provide data on?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Huge push upload despite only having a tiny change
  2020-06-02 19:21 Huge push upload despite only having a tiny change Elijah Newren
  2020-06-02 19:40 ` Derrick Stolee
@ 2020-06-03  1:53 ` Jonathan Nieder
  2020-06-03  2:36   ` Elijah Newren
  2020-06-03 20:39   ` Junio C Hamano
  1 sibling, 2 replies; 6+ messages in thread
From: Jonathan Nieder @ 2020-06-03  1:53 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Derrick Stolee, Jonathan Tan, Minh Thai

(cc: Jonathan Tan for "git push" discussion; Minh Thai for negotiate
 hook discussion)
Hi,

Elijah Newren wrote:

> I had a user report that two nearly identical pushes (the second being
> an amended commit of the first) took dramatically differing amounts of
> time and amount of data uploaded (from 4.5 seconds and about 21k
> uploaded, to 223 seconds and over 100 MB uploaded).

Yes, this is why I want push negotiation.  (It's been something we've
been discussing for protocol v2 for push.)

If they fetch before they push, does that help?

[...]
> * The server was running Gerrit 3.1.4 (i.e. jgit).

Gerrit servers have the interesting property that many people are
pushing to the same Git repo.  (This is common in some other hosting
scenarios such as Gitlab, but the most common case among Git users
still seems to be pushing to a repo you own.)

When you push, because there's no negotiation phase, the only
information we have about what is present on the server is what is in
the ref advertisement.  (We have remote-tracking branches which seem
potentially useful, but we don't have a way to ask the server "are
these objects you still have?")  The ref advertisement describes the
*current* state of all refs.  If I am pushing a new topic branch (in
Gerrit jargon, a new change for review) based on the *old* state of a
branch that has moved on, then we can only hope that some other ref
(for example a tag) points to a recent enough state to give us a base
for what to upload.

There is one trick a server can use to mitigate this: advertise some
refs that don't exist!  If you advertise a ref ".have", then Git
will understand that the server has that object but it is not an
actual ref.  Gerrit uses this trick in its HackPushNegotiateHook[1]
to advertise a few recent commits.

At $DAYJOB we ran into some clients where "a few recent commits" was
not sufficient to get to history that the client is aware of.  We
tried changing it to do some exponential deepening, and that helped.
We should probably upstream that change for other Gerrit users.

Gerrit also advertises some other ".have"s, for example for recent
changes by the same author in case you're uploading an amended
version.  That's less relevant here.

But fundamentally, this is something that cannot be addressed properly
without improving the "git push" protocol (adding a negotiation
phase).

Summary: (1) try fetching first (2) let's improve
HackPushNegotiateHook#advertiseRefs (3) let's improve "git push"
protocol to make this a problem of the past.

Thanks and hope that helps,
Jonathan

[1] https://gerrit.googlesource.com/gerrit/+/e1f4fee1f3ce674f44cb9788e6798ff8522bb876/java/com/google/gerrit/server/git/receive/HackPushNegotiateHook.java#111

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Huge push upload despite only having a tiny change
  2020-06-03  1:53 ` Jonathan Nieder
@ 2020-06-03  2:36   ` Elijah Newren
  2020-06-03 20:39   ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Elijah Newren @ 2020-06-03  2:36 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Git Mailing List, Derrick Stolee, Jonathan Tan, Minh Thai

Hi,

On Tue, Jun 2, 2020 at 6:53 PM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> (cc: Jonathan Tan for "git push" discussion; Minh Thai for negotiate
>  hook discussion)
> Hi,
>
> Elijah Newren wrote:
>
> > I had a user report that two nearly identical pushes (the second being
> > an amended commit of the first) took dramatically differing amounts of
> > time and amount of data uploaded (from 4.5 seconds and about 21k
> > uploaded, to 223 seconds and over 100 MB uploaded).
>
> Yes, this is why I want push negotiation.  (It's been something we've
> been discussing for protocol v2 for push.)
>
> If they fetch before they push, does that help?

In my attempts to reproduce which I assume are similar, yes it helps immensely.

> [...]
> > * The server was running Gerrit 3.1.4 (i.e. jgit).
>
> Gerrit servers have the interesting property that many people are
> pushing to the same Git repo.  (This is common in some other hosting
> scenarios such as Gitlab, but the most common case among Git users
> still seems to be pushing to a repo you own.)
>
> When you push, because there's no negotiation phase, the only
> information we have about what is present on the server is what is in
> the ref advertisement.  (We have remote-tracking branches which seem
> potentially useful, but we don't have a way to ask the server "are
> these objects you still have?")  The ref advertisement describes the
> *current* state of all refs.  If I am pushing a new topic branch (in
> Gerrit jargon, a new change for review) based on the *old* state of a
> branch that has moved on, then we can only hope that some other ref
> (for example a tag) points to a recent enough state to give us a base
> for what to upload.
>
> There is one trick a server can use to mitigate this: advertise some
> refs that don't exist!  If you advertise a ref ".have", then Git
> will understand that the server has that object but it is not an
> actual ref.  Gerrit uses this trick in its HackPushNegotiateHook[1]
> to advertise a few recent commits.
>
> At $DAYJOB we ran into some clients where "a few recent commits" was
> not sufficient to get to history that the client is aware of.  We
> tried changing it to do some exponential deepening, and that helped.
> We should probably upstream that change for other Gerrit users.
>
> Gerrit also advertises some other ".have"s, for example for recent
> changes by the same author in case you're uploading an amended
> version.  That's less relevant here.
>
> But fundamentally, this is something that cannot be addressed properly
> without improving the "git push" protocol (adding a negotiation
> phase).
>
> Summary: (1) try fetching first (2) let's improve
> HackPushNegotiateHook#advertiseRefs (3) let's improve "git push"
> protocol to make this a problem of the past.
>
> Thanks and hope that helps,
> Jonathan
>
> [1] https://gerrit.googlesource.com/gerrit/+/e1f4fee1f3ce674f44cb9788e6798ff8522bb876/java/com/google/gerrit/server/git/receive/HackPushNegotiateHook.java#111

Yes, this helps a lot.  Thanks for the pointers and explanations!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Huge push upload despite only having a tiny change
  2020-06-03  1:53 ` Jonathan Nieder
  2020-06-03  2:36   ` Elijah Newren
@ 2020-06-03 20:39   ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2020-06-03 20:39 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Elijah Newren, Git Mailing List, Derrick Stolee, Jonathan Tan, Minh Thai

Jonathan Nieder <jrnieder@gmail.com> writes:

> There is one trick a server can use to mitigate this: advertise some
> refs that don't exist!  If you advertise a ref ".have", then Git
> will understand that the server has that object but it is not an
> actual ref.  Gerrit uses this trick in its HackPushNegotiateHook[1]
> to advertise a few recent commits.
> ...
> Summary: (1) try fetching first (2) let's improve
> HackPushNegotiateHook#advertiseRefs (3) let's improve "git push"
> protocol to make this a problem of the past.

In the meantime, perhaps we can teach receive-pack the same "let's
advertise a few recent commits as if they exist as tips of refs" as
well?

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-06-03 20:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-02 19:21 Huge push upload despite only having a tiny change Elijah Newren
2020-06-02 19:40 ` Derrick Stolee
2020-06-03  1:35   ` Elijah Newren
2020-06-03  1:53 ` Jonathan Nieder
2020-06-03  2:36   ` Elijah Newren
2020-06-03 20:39   ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).