git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Strategy to deal with slow cloners
@ 2021-04-19 12:46 Konstantin Ryabitsev
  2021-04-19 18:08 ` Eric Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Konstantin Ryabitsev @ 2021-04-19 12:46 UTC (permalink / raw)
  To: git

Hello:

I try to keep repositories routinely repacked and optimized for clones, in
hopes that most operations needing lots of objects would be sending packs
straight from disk. However, every now and again a client from a slow
connection requests a large clone and then takes half a day downloading it,
resulting in gigabytes of RAM being occupied by a temporary pack.

Are there any strategies to reduce RAM usage in such cases, other than
vm.swappiness (which I'm not sure would work, since it's not a sleeping
process)? Is there a way to write large temporary packs somewhere to disk
before sendfile'ing them?

-K

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Strategy to deal with slow cloners
  2021-04-19 12:46 Strategy to deal with slow cloners Konstantin Ryabitsev
@ 2021-04-19 18:08 ` Eric Wong
  2021-04-21 20:08   ` Eric Wong
  2021-04-20 14:52 ` Thomas Braun
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2021-04-19 18:08 UTC (permalink / raw)
  To: git

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hello:
> 
> I try to keep repositories routinely repacked and optimized for clones, in
> hopes that most operations needing lots of objects would be sending packs
> straight from disk. However, every now and again a client from a slow
> connection requests a large clone and then takes half a day downloading it,
> resulting in gigabytes of RAM being occupied by a temporary pack.

Yeah, I'm familiar with the problem.

> Are there any strategies to reduce RAM usage in such cases, other than
> vm.swappiness (which I'm not sure would work, since it's not a sleeping
> process)? Is there a way to write large temporary packs somewhere to disk
> before sendfile'ing them?

public-inbox-httpd actually switched buffering strategies in
2019 to favor hitting ENOSPC instead of ENOMEM :)

  https://public-inbox.org/meta/20190629195951.32160-11-e@80x24.org/

It doesn't support sendfile, currently (I didn't want separate
HTTPS vs HTTP code paths), but that's probably not too big of a
deal, especially with slow clients.

It's capable of serving non-public-inbox coderepos (and running
cgit).  Instead of configuring every [coderepo "..."] manually,
publicinbox.cgitrc can be set in ~/.public-inbox/config to
mass-configure [coderepo] sections.  It's only lightly-tested
for my setup atm, though.

Mapping publicinbox.<name>.coderepo to [coderepo "..."]
entries for solver (blob reconstruction) isn't required;
it's a bit of a pain at a large scale and I haven't figured
out how to make it easier.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Strategy to deal with slow cloners
  2021-04-19 12:46 Strategy to deal with slow cloners Konstantin Ryabitsev
  2021-04-19 18:08 ` Eric Wong
@ 2021-04-20 14:52 ` Thomas Braun
  2021-04-22  9:16 ` Ævar Arnfjörð Bjarmason
  2021-04-23 10:02 ` Jeff King
  3 siblings, 0 replies; 6+ messages in thread
From: Thomas Braun @ 2021-04-20 14:52 UTC (permalink / raw)
  To: git; +Cc: Konstantin Ryabitsev

On 19.04.2021 14:46, Konstantin Ryabitsev wrote:

> I try to keep repositories routinely repacked and optimized for clones, in
> hopes that most operations needing lots of objects would be sending packs
> straight from disk. However, every now and again a client from a slow
> connection requests a large clone and then takes half a day downloading it,
> resulting in gigabytes of RAM being occupied by a temporary pack.
> 
> Are there any strategies to reduce RAM usage in such cases, other than
> vm.swappiness (which I'm not sure would work, since it's not a sleeping
> process)? Is there a way to write large temporary packs somewhere to disk
> before sendfile'ing them?

There is the packfile-uris feature which allows protocol v2 servers to
advertise static packfiles via http/https. But clients must explicitly
enable it via fetch.uriprotocols. So this does only work for newish
clients which explicitly ask for it. See
Documentation/technical/packfile-uri.txt.

From my limited understanding one clone/fetch the server can only send
one packfile at most.

What is the advertised git clone command on the website? Maybe something
like git clone --depth=$num would help reduce the load? Usually not
everyone needs the whole history.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Strategy to deal with slow cloners
  2021-04-19 18:08 ` Eric Wong
@ 2021-04-21 20:08   ` Eric Wong
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2021-04-21 20:08 UTC (permalink / raw)
  To: git

Eric Wong <e@80x24.org> wrote:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > Hello:
> > 
> > I try to keep repositories routinely repacked and optimized for clones, in
> > hopes that most operations needing lots of objects would be sending packs
> > straight from disk. However, every now and again a client from a slow
> > connection requests a large clone and then takes half a day downloading it,
> > resulting in gigabytes of RAM being occupied by a temporary pack.
> 
> Yeah, I'm familiar with the problem.

Also, AFAIK nginx has "proxy_buffering on" by default.  However,
I seem to recall that prevents clients from seeing a single byte
until the pack is completely generated.  It's been many years
since I've used nginx myself, so my knowledge about it could be
out-of-date.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Strategy to deal with slow cloners
  2021-04-19 12:46 Strategy to deal with slow cloners Konstantin Ryabitsev
  2021-04-19 18:08 ` Eric Wong
  2021-04-20 14:52 ` Thomas Braun
@ 2021-04-22  9:16 ` Ævar Arnfjörð Bjarmason
  2021-04-23 10:02 ` Jeff King
  3 siblings, 0 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-04-22  9:16 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: git


On Mon, Apr 19 2021, Konstantin Ryabitsev wrote:

> Hello:
>
> I try to keep repositories routinely repacked and optimized for clones, in
> hopes that most operations needing lots of objects would be sending packs
> straight from disk. However, every now and again a client from a slow
> connection requests a large clone and then takes half a day downloading it,
> resulting in gigabytes of RAM being occupied by a temporary pack.
>
> Are there any strategies to reduce RAM usage in such cases, other than
> vm.swappiness (which I'm not sure would work, since it's not a sleeping
> process)? Is there a way to write large temporary packs somewhere to disk
> before sendfile'ing them?

Aside from any Git-specific solutions, perhaps the right kernel settings
+ a cron script re-nicing such processes that have been active for more
than X amount of time will help?

I'm not familiar with the guts of Linux's swapping algorithm, but some
results online seem to suggest that it takes the nice level into account
when deciding what to swap out, i.e. with the right level it might give
preference to swapping out this mostly idle process.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Strategy to deal with slow cloners
  2021-04-19 12:46 Strategy to deal with slow cloners Konstantin Ryabitsev
                   ` (2 preceding siblings ...)
  2021-04-22  9:16 ` Ævar Arnfjörð Bjarmason
@ 2021-04-23 10:02 ` Jeff King
  3 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2021-04-23 10:02 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: git

On Mon, Apr 19, 2021 at 08:46:23AM -0400, Konstantin Ryabitsev wrote:

> I try to keep repositories routinely repacked and optimized for clones, in
> hopes that most operations needing lots of objects would be sending packs
> straight from disk. However, every now and again a client from a slow
> connection requests a large clone and then takes half a day downloading it,
> resulting in gigabytes of RAM being occupied by a temporary pack.
> 
> Are there any strategies to reduce RAM usage in such cases, other than
> vm.swappiness (which I'm not sure would work, since it's not a sleeping
> process)?


Do you know where the RAM is going? I.e., heap or mmap'd files in block
cache? Do you have recent reachability bitmaps built?

Traditionally, most of the heap usage in pack-objects went to:

  - the set of object structs used for traversal; likewise, internal
    caches like the delta-base cache that get filled during the
    traversal

  - the big book-keeping array of all of the objects we are planning to
    send (and all their metadata)

  - the reverse index we load in memory to find object offsets and
    sizes within the packfile

But with bitmaps, we can skip most of the traversal entirely. And
there's a "pack reuse" mechanism that tries to avoid even adding objects
to the book-keeping array when we are just sending the first chunk of
the pack verbatim anyway.

E.g., on a clone of torvalds/linux, running:

  git for-each-ref --format='%(objectname)' refs/heads/ refs/tags/ |
  valgrind --tool=massif git pack-objects --revs --delta-base-offset --stdout |
  wc -c

hits a peak heap of 1.9GB without bitmaps enabled but only 326MB with.

On top of that, if you have Git v2.31, try enabling pack.writeReverseIndex
and repacking. That drops the heap to just 23MB! (though note there's
some cheating here; we're mmap-ing 31MB of .rev file plus 47MB of
.bitmap file).

From previous conversations, I expect you're already using bitmaps, but
you might double-check that things are kicking in as you'd expect (you
can get a rough read on heap of running processes by subtracting shared
memory from rss). And probably you aren't using on-disk revindexes yet,
because they're not enabled by default.

If your problem is block cache (i.e., it's the total rss that's the
problem, not just the heap parts), that's harder. If you have a lot of
related repositories (say, forks of the kernel), your best bet is to use
alternates to share the storage. That opens up a whole other can of
complexity worms that I won't get into here.

Getting back to your other question:

> Is there a way to write large temporary packs somewhere to disk
> before sendfile'ing them?

The uploadpack.packObjectsHook config would let you wrap pack-objects
with a script that writes to a temporary file, and then just uses
something simple like "cat" to feed it back to upload-pack. You can't
use sendfile(), because there's some protocol framing that happens in
upload-pack, but its memory use is relatively low.

If block cache is your problem (and not heap), this _could_ make things
slightly worse, as now you're writing the same pack data out to an extra
file which isn't shared by multiple processes. So if you have multiple
clients hitting the same repository, you may increase your working set
size. The OS may be able to handle it better, though (e.g., the linear
read through the file by cat makes it obvious that it can drop pages
from earlier parts of the file under memory pressure).

Your wrapper of course can get more clever about such things, too. E.g.,
you can coalesce identical requests to use the same cached copy,
skipping even the extra call to pack-objects in the first place. We do
something like that at GitHub (unfortunately not with an open tool that
I can share at this point).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-04-23 10:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-19 12:46 Strategy to deal with slow cloners Konstantin Ryabitsev
2021-04-19 18:08 ` Eric Wong
2021-04-21 20:08   ` Eric Wong
2021-04-20 14:52 ` Thomas Braun
2021-04-22  9:16 ` Ævar Arnfjörð Bjarmason
2021-04-23 10:02 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).