All of lore.kernel.org
 help / color / mirror / Atom feed
* `git index-pack --strict` is *very* slow during pushes to large repos
       [not found] <CAF1M8pepgrnZWhx+CeMH85J-5oWx+w6r0w3KCcsG8dWgCT9K9Q@mail.gmail.com>
@ 2021-05-09 20:52 ` Craig de Stigter
  0 siblings, 0 replies; only message in thread
From: Craig de Stigter @ 2021-05-09 20:52 UTC (permalink / raw)
  To: git

Hey folks

(apologies if repost; my first post seemed to disappear entirely)

We're hosting a service with some fairly large repos (created by
Kart[1] ), and I've been looking into some poor
performance of `git push` on our service.

Background: We host repositories with a specific layout. I'll try and avoid
most of the technical details but a brief description of the repo layout
might be helpful:

- At each revision we have 256 trees
      - each containing 256 trees (so 65536 trees at this level)
      - each subtree contains a number of objects (distributed via a hash
      scheme, evenly across the subtrees)
- Some repos have up to 100 million blobs active in a given revision.
In that case each of the 65536 subtrees would contain ~1500 blobs.
- Blobs are usually a few bytes to a few KB in size.
- For various reasons we have disabled deltas entirely.
- Most repos have a few hundred commits, and a typical commit might
modify 100,000 features (again spread evenly across the 65536 trees),
thus modifying most of the trees also.
- Our largest repos are currently a few hundred GB on disk.

We've come across a curious performance issue with `git index-pack` when
invoked by `receive-pack` during a push operation. We have
`transfer.fsckObjects=true` in the server config, so the index-pack
invocation looks like:

```
git --shallow-file shallow_filename index-pack \
   --stdin --keep='receive-pack 1234 on <servername>' \
   --show-resolving-progress --report-end-of-input --fix-thin \
   --strict
```

For our largest repos, when pushing ~100K blobs and associated trees, this
takes a *long* time - sometimes over 12 hours. The process uses enormous
amounts of disk IO (all reads; I haven't measured how much per process, but
the server was doing many terabytes of IO in total)

Here is one that "only" took 45 minutes with a few tracing environment vars
enabled:

```
$ cat craig.pack | /opt/sno/libexec/git-core/git --shallow-file
myfilename index-pack --stdin --keep='receive-pack 159567 on
servername' --show-resolving-progress --report-end-of-input --fix-thin
--strict
07:48:20.781099 common-main.c:48                  version 2.29.2
07:48:20.781111 common-main.c:48             | d0 | main
      | version      |     |           |           |              |
2.29.2
07:48:20.781127 common-main.c:49                  start
/opt/sno/libexec/git-core/git --shallow-file indexed.pack index-pack
--stdin '--keep=receive-pack 159567 on cave-7dc7798cc9-qcvxd'
--show-resolving-progress --report-end-of-input --fix-thin --strict
07:48:20.781133 common-main.c:49             | d0 | main
      | start        |     |  0.000264 |           |              |
/opt/sno/libexec/git-core/git --shallow-file indexed.pack index-pack
--stdin '--keep=receive-pack 159567 on cave-7dc7798cc9-qcvxd'
--show-resolving-progress --report-end-of-input --fix-thin --strict
07:48:20.781296 git.c:444               trace: built-in: git
index-pack --stdin '--keep=receive-pack 159567 on
cave-7dc7798cc9-qcvxd' --show-resolving-progress --report-end-of-input
--fix-thin --strict
07:48:20.781306 git.c:445                         cmd_name index-pack
(index-pack)
07:48:20.781312 git.c:445                    | d0 | main
      | cmd_name     |     |           |           |              |
index-pack (index-pack)
07:48:20.781530 midx.c:184                   | d0 | main
      | data         | r0  |  0.000670 |  0.000670 | midx         |
load/num_packs:1
07:48:20.781542 midx.c:185                   | d0 | main
      | data         | r0  |  0.000683 |  0.000683 | midx         |
load/num_objects:42658742
pack    5aa14bbb43187b7dfd5f996514854c3dcdc66d71
08:27:33.724306 git.c:700                         exit
elapsed:2352.943441 code:0
08:27:33.724321 git.c:700                    | d0 | main
      | exit         |     | 2352.943441 |           |              |
code:0
08:27:33.724336 trace2/tr2_tgt_normal.c:123       atexit
elapsed:2352.943475 code:0
08:27:33.724341 trace2/tr2_tgt_perf.c:213    | d0 | main
      | atexit       |     | 2352.943475 |           |              |
code:0
```

Removing the `--strict` from the invocation by disabling
`transfer.fsckObjects` solves the problem - the process completes in less
than a minute, and uses less than a GB of read IO.

I can theorise why this operation is slightly expensive:

   - `--strict` causes `index-pack` to call `fsck_object()` on each object
   pushed
   - these large pushes that push 100K+ blobs actually touch almost every
   *tree* as well - so most/all of the 65K trees are pushed too.
   - calling `fsck_object` on a tree looks up all its children (blobs and
   trees) to ensure they're reachable [2]

What I can't understand is why that makes it take quite *so* much longer
and use so much IO. I think it *should* probably not be checking much about
objects that are already in the repo, other than that they exist. We
have multi-pack indexes enabled, so my assumption is that a "does
object xyz exist?" check should be very inexpensive.
What could I be missing here?

As a start of a possible theory, we found when using libgit2 that our
peculiar repo structure with so many trees requires that we expand the size
of the tree cache[3] - otherwise repeated operations on blobs would
cause tree cache misses
every time their path was traversed. I wonder if there is a similar tree
cache structure in git itself, and if so could it be relevant here?

Many thanks and sorry about the long winded post :)

Craig de Stigter
Platform Engineer
Koordinates


references:
[1]: https://kartproject.org
[2]: fsck_walk_tree:
https://github.com/git/git/blob/a0dda6023ed82b927fa205c474654699a5b07a82/fsck.c#L300
[3] GIT_OPT_SET_CACHE_OBJECT_LIMIT:
https://github.com/libgit2/libgit2/blob/508361401fbb5d87118045eaeae3356a729131aa/include/git2/common.h#L266-L272

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2021-05-09 20:52 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAF1M8pepgrnZWhx+CeMH85J-5oWx+w6r0w3KCcsG8dWgCT9K9Q@mail.gmail.com>
2021-05-09 20:52 ` `git index-pack --strict` is *very* slow during pushes to large repos Craig de Stigter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.