git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Taylor Blau <me@ttaylorr.com>
Cc: "Patrick Steinhardt" <ps@pks.im>,
	git@vger.kernel.org,
	"Felipe Contreras" <felipe.contreras@gmail.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>,
	"Chris Torek" <chris.torek@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>
Subject: Re: [PATCH v2 3/3] connected: implement connectivity check using bitmaps
Date: Wed, 30 Jun 2021 01:45:03 -0400	[thread overview]
Message-ID: <YNwE3wES3iv+Xynp@coredump.intra.peff.net> (raw)
In-Reply-To: <YNvgA6RLIMdD77Hk@nand.local>

On Tue, Jun 29, 2021 at 11:07:47PM -0400, Taylor Blau wrote:

> On Tue, Jun 29, 2021 at 10:04:56PM -0400, Jeff King wrote:
> > In the warm-cache case, the improvement seems to go away (or maybe I'm
> > holding it wrong; even in the cold-cache case I don't get anywhere near
> > as impressive a speedup as you showed above). Which isn't to say that
> > cold-cache isn't sometimes important, and this may still be worth doing.
> > But it really seems like the CPU involve in walking over the file isn't
> > actually that much.
> 
> Hmm. I think that you might be holding it wrong, or at least I'm able to
> reproduce some substantial improvements in the warm cache case with
> limited traversals.

OK, I definitely was holding it wrong. It turns out that it helps to run
the custom version of Git when passing in the pack.writebitmapcomittable
option. (I regret there is no portable way to communicate a facepalm
image over plain text).

So that helped, but I did seem some other interesting bits.

Here's my cold-cache time for the commit you selected:

  $ export commit=2ab38c17aac10bf55ab3efde4c4db3893d8691d2
  $ hyperfine \
      -L table 0,1 \
      'GIT_READ_COMMIT_TABLE={table} git.compile rev-list --count --objects --use-bitmap-index $commit' \
      --prepare='sync; echo 3 | sudo tee /proc/sys/vm/drop_caches'

  Benchmark #1: GIT_READ_COMMIT_TABLE=0 git.compile rev-list --count --objects --use-bitmap-index $commit
    Time (mean ± σ):      1.420 s ±  0.162 s    [User: 196.1 ms, System: 293.7 ms]
    Range (min … max):    1.083 s …  1.540 s    10 runs
   
  Benchmark #2: GIT_READ_COMMIT_TABLE=1 git.compile rev-list --count --objects --use-bitmap-index $commit
    Time (mean ± σ):      1.319 s ±  0.256 s    [User: 171.1 ms, System: 237.1 ms]
    Range (min … max):    0.773 s …  1.588 s    10 runs
   
  Summary
    'GIT_READ_COMMIT_TABLE=1 git.compile rev-list --count --objects --use-bitmap-index $commit' ran
      1.08 ± 0.24 times faster than 'GIT_READ_COMMIT_TABLE=0 git.compile rev-list --count --objects --use-bitmap-index $commit'

So better, but well within the noise (and I had a few runs where it
actually performed worse). But you picked that commit because you knew
it was bitmapped, and it's not in my repo. If I switch to a commit that
is covered in my repo, then I get similar results to yours:

  $ export commit=9b1ea29bc0d7b94d420f96a0f4121403efc3dd85
  $ hyperfine \
        -L table 0,1 \
        'GIT_READ_COMMIT_TABLE={table} git.compile rev-list --count --objects --use-bitmap-index $commit' \
        --prepare='sync; echo 3 | sudo tee /proc/sys/vm/drop_caches'

  Benchmark #1: GIT_READ_COMMIT_TABLE=0 git.compile rev-list --count --objects --use-bitmap-index $commit
    Time (mean ± σ):     309.3 ms ±  61.0 ms    [User: 19.4 ms, System: 79.0 ms]
    Range (min … max):   154.4 ms … 386.7 ms    10 runs
   
  Benchmark #2: GIT_READ_COMMIT_TABLE=1 git.compile rev-list --count --objects --use-bitmap-index $commit
    Time (mean ± σ):      33.7 ms ±   2.5 ms    [User: 3.3 ms, System: 3.6 ms]
    Range (min … max):    31.5 ms …  46.5 ms    33 runs

  Summary
  'GIT_READ_COMMIT_TABLE=1 git.compile rev-list --count --objects --use-bitmap-index $commit' ran
    9.19 ± 1.94 times faster than 'GIT_READ_COMMIT_TABLE=0 git.compile rev-list --count --objects --use-bitmap-index $commit'

And the effect continues in the warm cache case, though the absolute
numbers are much tinier:

  Benchmark #1: GIT_READ_COMMIT_TABLE=0 git.compile rev-list --count --objects --use-bitmap-index $commit
    Time (mean ± σ):      12.0 ms ±   0.3 ms    [User: 4.6 ms, System: 7.4 ms]
    Range (min … max):    11.4 ms …  13.2 ms    219 runs

  Benchmark #2: GIT_READ_COMMIT_TABLE=1 git.compile rev-list --count --objects --use-bitmap-index $commit
    Time (mean ± σ):       3.0 ms ±   0.4 ms    [User: 2.3 ms, System: 0.8 ms]
    Range (min … max):     2.6 ms …   5.5 ms    851 runs

  Summary
    'GIT_READ_COMMIT_TABLE=1 git.compile rev-list --count --objects --use-bitmap-index $commit' ran
      3.95 ± 0.55 times faster than 'GIT_READ_COMMIT_TABLE=0 git.compile rev-list --count --objects --use-bitmap-index $commit'

That implies to me that yes, it really is saving time, especially in the
cold-cache case. But if you have to do any actual fill-in traversal, the
benefits get totally lost in the noise. _Especially_ in the cold-cache
case, because then we're faulting in the actual object data, etc.

By the way, one other thing I noticed is that having a fully-build
commit-graph also made a big difference (much bigger than this patch),
especially for the non-bitmapped commit. Which makes sense, since it is
saving us from loading commit objects from disk during fill-in
traversal.

So I dunno. There's absolutely savings for some cases, but I suspect in
practice it's not going to really be noticeable. Part of me says "well,
if it ever provides a benefit and there isn't a downside, why not?". So
just devil's advocating on downsides for a moment:

  - there's some extra complexity in the file format and code to read
    and write these (and still fall back to the old system when they're
    absent). I don't think it's a deal-breaker, as it's really not that
    complicated a feature.

  - we're using extra bytes on disk (and the associated cold block cache
    effects there). It's not very many bytes, though (I guess 20 for the
    hash, plus a few offset integers; if we wanted to really
    penny-pinch, we could probably store 32-bit pointers to the hashes
    in the associated .idx file, at the cost of an extra level of
    indirection while binary searching). But that is only for a few
    hundred commits that are bitmapped, not all of them. And it's
    balanced by not needing to allocate a similar in-memory lookup table
    in each command. So it's probably a net win.

> > I got an even more curious result when adding in "--not --all" (which
> > the connectivity check under discussion would do). There the improvement
> > from your patch should be even less, because we'd end up reading most of
> > the bitmaps anyway. But I got:
> 
> Interesting. Discussion about that series aside, I go from 28.6ms
> without reading the table to 35.1ms reading it, which is much better in
> absolute timings, but much worse relatively speaking.

I suspect that's mostly noise. With "--not --all" in a regular linux.git
repo, I don't find any statistical difference.

In a fake repo with one ref per commit, everything is even more lost in
the noise (because we spend like 2000ms loading up all the tip commits).
I suspect it's worse on a real repo with lots of refs (the "spiky
branches" thing I mentioned earlier in the thread), since there we'd
have to do even more fill-in traversal.

> I can't quite seem to make sense of the perf report on that command.
> Most of the time is spent faulting pages in, but most of the time
> measured in the "git" object is in ubc_check. I don't really know how to
> interpret that, but I'd be curious if you had any thoughts.

ubc_check() is basically "computing sha1" (we check the sha1 on every
object we call parse_object() for). I'd guess it's just time spent
loading the negative tips (commits if you don't have a commit graph,
plus tags to peel, though I guess we should be using the packed-refs
peel optimization here).

-Peff

  reply	other threads:[~2021-06-30  5:45 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-28  5:33 [PATCH v2 0/3] Speed up connectivity checks via bitmaps Patrick Steinhardt
2021-06-28  5:33 ` [PATCH v2 1/3] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
2021-06-28  7:49   ` Ævar Arnfjörð Bjarmason
2021-06-29  6:18     ` Patrick Steinhardt
2021-06-29 12:09       ` Ævar Arnfjörð Bjarmason
2021-06-28  5:33 ` [PATCH v2 2/3] receive-pack: skip connectivity checks on delete-only commands Patrick Steinhardt
2021-06-28  8:00   ` Ævar Arnfjörð Bjarmason
2021-06-28  8:06     ` Ævar Arnfjörð Bjarmason
2021-06-29  6:26     ` Patrick Steinhardt
2021-06-30  1:31   ` Jeff King
2021-06-30  1:35     ` Jeff King
2021-06-30 13:52     ` Patrick Steinhardt
2021-06-28  5:33 ` [PATCH v2 3/3] connected: implement connectivity check using bitmaps Patrick Steinhardt
2021-06-28 20:23   ` Taylor Blau
2021-06-29 22:44     ` Taylor Blau
2021-06-30  2:04       ` Jeff King
2021-06-30  3:07         ` Taylor Blau
2021-06-30  5:45           ` Jeff King [this message]
2021-07-02 17:44             ` Taylor Blau
2021-07-02 21:21               ` Jeff King
2021-06-30  1:51   ` Jeff King
2021-07-20 14:26     ` Patrick Steinhardt
2021-08-02  9:37 ` [PATCH v3 0/4] Speed up connectivity checks Patrick Steinhardt
2021-08-02  9:38   ` [PATCH v3 1/4] connected: do not sort input revisions Patrick Steinhardt
2021-08-02 12:49     ` Ævar Arnfjörð Bjarmason
2021-08-03  8:50       ` Patrick Steinhardt
2021-08-04 11:01         ` Ævar Arnfjörð Bjarmason
2021-08-02 19:00     ` Junio C Hamano
2021-08-03  8:55       ` Patrick Steinhardt
2021-08-03 21:47         ` Junio C Hamano
2021-08-02  9:38   ` [PATCH v3 2/4] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-02 12:53     ` Ævar Arnfjörð Bjarmason
2021-08-02  9:38   ` [PATCH v3 3/4] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-02 12:55     ` Ævar Arnfjörð Bjarmason
2021-08-05 10:12       ` Patrick Steinhardt
2021-08-02 19:40     ` Junio C Hamano
2021-08-03  9:07       ` Patrick Steinhardt
2021-08-06 14:17         ` Patrick Steinhardt
2021-08-02  9:38   ` [PATCH v3 4/4] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-02 20:01     ` Junio C Hamano
2021-08-03  9:16       ` Patrick Steinhardt
2021-08-03 21:56         ` Junio C Hamano
2021-08-05 11:01           ` Patrick Steinhardt
2021-08-05 16:16             ` Junio C Hamano
2021-08-04 10:51         ` Ævar Arnfjörð Bjarmason
2021-08-05 11:25   ` [PATCH v4 0/6] Speed up connectivity checks Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 1/6] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-05 18:47       ` Junio C Hamano
2021-08-05 11:25     ` [PATCH v4 2/6] connected: do not sort input revisions Patrick Steinhardt
2021-08-05 18:44       ` Junio C Hamano
2021-08-06  6:00         ` Patrick Steinhardt
2021-08-06 16:50           ` Junio C Hamano
2021-08-05 11:25     ` [PATCH v4 3/6] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 4/6] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 5/6] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 6/6] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-09  8:00 ` [PATCH v5 0/5] Speed up connectivity checks Patrick Steinhardt
2021-08-09  8:02   ` Patrick Steinhardt
2021-08-09  8:11 ` Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 1/5] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 2/5] connected: do not sort input revisions Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 3/5] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 4/5] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-09  8:12   ` [PATCH v5 5/5] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YNwE3wES3iv+Xynp@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=chris.torek@gmail.com \
    --cc=felipe.contreras@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=ps@pks.im \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).