git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>,
	Jonathan Nieder <jrnieder@gmail.com>
Subject: Re: Huge push upload despite only having a tiny change
Date: Tue, 2 Jun 2020 18:35:13 -0700	[thread overview]
Message-ID: <CABPp-BGH=uqOP2x5w4ghLBv1sUiyKwdj1ox8kJKrELOp_OhudQ@mail.gmail.com> (raw)
In-Reply-To: <29e6c05e-2d79-d2ac-a033-dab6342ebcaa@gmail.com>

On Tue, Jun 2, 2020 at 12:40 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/2/2020 3:21 PM, Elijah Newren wrote:
> > * The user was two commits behind the closely-related branch at the
> > time of the first push, and 10 commits behind at the time of the
> > second push.  Running format-patch on these 10 commits that were on
> > the server at the time shows their size is at most about ~55 k.
>
> This is most-likely the difference, since the pack-objects algorithm
> only looks at the _boundary_ between the server's commits and the
> commits-to-push. This also could have dramatically changed the delta-base
> matches.
>
> Do you have exact object counts? It would help to know if somehow the
> object discovery algorithm is at fault or the delta-base algorithm
> is to blame.

Output from the users' terminal for the two different runs, just
redacting URLs.  The first push was:

Enumerating objects: 23, done.
Counting objects: 100% (23/23), done.
Delta compression using up to 16 threads
Compressing objects: 100% (7/7), done.
Writing objects: 100% (12/12), 952 bytes | 952.00 KiB/s, done.
Total 12 (delta 5), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (5/5)
remote: Processing changes: refs: 1, new: 1, done
remote: commit 5efcf04: warning: subject >50 characters; use shorter
first paragraph
remote:
remote: SUCCESS
remote:
remote:   https://gerrit.internal.site/c/path/to/repo/+/341489 Remove
ambiguous method from ArtifactProducerPeeringService [NEW]
remote:
To gerrit.internal.site:path/to/repo.git
 * [new branch]                HEAD -> refs/for/develop

The second was:

Enumerating objects: 325816, done.
Counting objects: 100% (266559/266559), done.
Delta compression using up to 16 threads
Compressing objects: 100% (102457/102457), done.
Writing objects: 100% (257630/257630), 102.88 MiB | 619.00 KiB/s, done.
Total 257630 (delta 122169), reused 218776 (delta 87259), pack-reused 0
remote: Resolving deltas: 100% (122169/122169)
remote: Processing changes: refs: 1, updated: 1, done
remote: commit 360a266: warning: subject >50 characters; use shorter
first paragraph
remote:
remote: SUCCESS
remote:
remote:   https://gerrit.internal.site/c/path/to/repo/+/341489 Remove
ambiguous method from ArtifactProducerPeeringService
remote:
To gerrit.internal.site:path/to/repo.git
 * [new branch]                HEAD -> refs/for/develop

> For instance, `pack.useSparse` was enabled by default this release,
> and has some opportunity to push extra objects. See [1] for more
> details on both the "boundary" description (the "commit frontier")
> but also how that option changes the algorithm.
>
> The only case I know of that could lead to sending extra objects
> (that was not the case before) is described in t5322-pack-objects-sparse.sh
> and 4f6d26b1 (list-objects: consume sparse tree walk, 2019-01-16).
> It involves doing a full _copy_ of a tree from one position to
> another, without "disturbing" the parent of the original tree.
>
> (I mean: copy directory A/B to C/D and make sure nothing is
> different in directory A.)
>
> However, if these two pushes were with the same config setting,
> I'm not sure what could have changed between the pushes to hit
> this very narrow case.

So, after a day of attempting to figure out how to debug, I found out
we had a backup of the server between the first and second pushes and
fairly close to the second push.  I was able to launch a separate VM
of that backup, and then attempt to make a local repo that I hoped
mimic what the user had.  I can't duplicate the 100MB push (which is
about 25000x bigger than expected) as the user did, but I can with
some tweaking state that the push size should be closer to ~2.5 KB and
I can readly duplicate pushes in the 4-6 MB range -- i.e. about 1000x
bigger than expected.  pack.useSparse affects things, but not much.
So, some output of my own duplication attempts:

First, if I do a git fetch of the 'develop' branch (I can even nuke
.git/FETCH_HEAD afterwards; it only matters that I have the history
locally) from the place I'm pushing to, then the push size is tiny as
expected:

<Go to server, use rsync to make a pristine copy of repo>
<Go to server, rsync repo from pristine state, then on my laptop:>

$ git fetch git_over_ssh_url develop
$ rm .git/FETCH_HEAD
$ time git push git_over_ssh_url mike-push:refs/for/develop
Enumerating objects: 37, done.
Counting objects: 100% (37/37), done.
Delta compression using up to 8 threads
Compressing objects: 100% (11/11), done.
Writing objects: 100% (22/22), 2.54 KiB | 2.54 MiB/s, done.
Total 22 (delta 7), reused 14 (delta 1), pack-reused 0
remote: Checking connectivity: 22, done.
To git_over_ssh_url
 * [new branch]                mike-push -> refs/for/develop

real 0m13.044s
user 0m0.202s
sys 0m0.146s

<Remove the pack downloaded from the earlier fetch of develop>
<Without this step, I always get small pushes:>
$ ls -rt .git/objects/pack/ | tail -n 2 | xargs -n 1 -IPATH rm
.git/objects/pack/PATH

<Go to server, rsync repo from pristine state, then on my laptop:>
$ time git push git_over_ssh_url mike-push:refs/for/develop
Enumerating objects: 40785, done.
Counting objects: 100% (20543/20543), done.
Delta compression using up to 8 threads
Compressing objects: 100% (9032/9032), done.
Writing objects: 100% (16685/16685), 6.27 MiB | 208.00 KiB/s, done.
Total 16685 (delta 7103), reused 12412 (delta 3389), pack-reused 0
remote: Resolving deltas: 100% (7103/7103), completed with 1864 local objects.
remote: Checking connectivity: 22, done.
To git_over_ssh_url
 * [new branch]                mike-push -> refs/for/develop

real 0m59.703s
user 0m3.139s
sys 0m0.735s

<Also, this is only slightly affected by pack.useSparse...>
<Go to server, rsync repo from pristine state, then on my laptop:>
$ time git -c pack.useSparse=false push git_over_ssh_url
mike-push:refs/for/develop
Enumerating objects: 39891, done.
Counting objects: 100% (18953/18953), done.
Delta compression using up to 8 threads
Compressing objects: 100% (7687/7687), done.
Writing objects: 100% (14991/14991), 4.85 MiB | 228.00 KiB/s, done.
Total 14991 (delta 6665), reused 11301 (delta 3141), pack-reused 0
remote: Resolving deltas: 100% (6665/6665), completed with 1939 local objects.
remote: Checking connectivity: 22, done.
To git_over_ssh_url
 * [new branch]                mike-push -> refs/for/develop

real 0m49.295s
user 0m2.362s
sys 0m0.592s


Finally, you'll note that when I was reproducing, things were a bit
different than what Mike (the end user) was dealing with.  I was using
vanilla git-2.27.0.  Also, on the server, I was using plain old
git-over-ssh, with the server running git-2.19.0.  However, I also
tried it with gerrit (i.e. jgit) as the server and got identical
numbers for enumerating objects, counting objects, compressing
objects, and the size of the pushed data and the number of resolved
deltas was within 1% of the git-over-ssh case.  And that was also true
both with and without pack.useSparse.

Any ideas?  Anything else I should try or provide data on?

  reply	other threads:[~2020-06-03  1:35 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-02 19:21 Huge push upload despite only having a tiny change Elijah Newren
2020-06-02 19:40 ` Derrick Stolee
2020-06-03  1:35   ` Elijah Newren [this message]
2020-06-03  1:53 ` Jonathan Nieder
2020-06-03  2:36   ` Elijah Newren
2020-06-03 20:39   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BGH=uqOP2x5w4ghLBv1sUiyKwdj1ox8kJKrELOp_OhudQ@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).