All of lore.kernel.org
 help / color / mirror / Atom feed
* Race condition between repack and loose-objects maintenance task
@ 2022-06-29 16:55 Gregory Szorc
  2022-06-29 17:03 ` Taylor Blau
  0 siblings, 1 reply; 16+ messages in thread
From: Gregory Szorc @ 2022-06-29 16:55 UTC (permalink / raw)
  To: git, stolee

`git repack -A` (such as invoked by `git gc`) loosens unreachable
objects from packfiles.

`git maintenance`'s `loose-objects` task consolidates loose objects
into a packfile.

As neither process takes a lock or is otherwise aware of the existence
of the other, there is a race condition:

1. `git repack -A` creates loose objects
2. `git maintenance`'s `loose-objects` task deletes those loose objects
3. `git repack -A` fails to find the loose objects it just created and
aborts with `fatal: unable to add recent objects`

I see this failure with regularity in a custom environment where high
mutation volume repositories invoke `git maintenance run` with
regularity (effectively in reaction to repo mutations). We have a Git
housekeeping strategy that frequently invokes `git maintenance` with
all but the `gc` task. On a ~daily frequency, we invoke `git gc` to
purge accumulated garbage. This `git gc` frequently fails due to the
aforementioned race condition since the long wall time of `git gc`
practically guarantees a `git maintenance` would have been triggered
since the repo is mutated with such high frequency.

In my scenario, I believe we have a workaround by omitting
`--task=loose-objects` from the frequent `git maintenance` invocation
if there exists a `gc.pid` file. However, this is only a partial
solution, as `git repack` and the `loose-objects` task aren't aware of
each other. (I suppose running the `gc` maintenance task and relying
on the maintenance lock file could also work. However, we're invoking
`git gc` explicitly since `git maintenance` doesn't allow passing
custom arguments through to `git gc`.)

Gregory

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-06-29 16:55 Race condition between repack and loose-objects maintenance task Gregory Szorc
@ 2022-06-29 17:03 ` Taylor Blau
  2022-06-29 17:10   ` Taylor Blau
  2022-06-29 17:16   ` Gregory Szorc
  0 siblings, 2 replies; 16+ messages in thread
From: Taylor Blau @ 2022-06-29 17:03 UTC (permalink / raw)
  To: Gregory Szorc; +Cc: git, stolee

Hi Greg,

On Wed, Jun 29, 2022 at 09:55:54AM -0700, Gregory Szorc wrote:
> 1. `git repack -A` creates loose objects
> 2. `git maintenance`'s `loose-objects` task deletes those loose objects
> 3. `git repack -A` fails to find the loose objects it just created and
> aborts with `fatal: unable to add recent objects`

This is a somewhat well-known race that occurs when one Git process
decides unreachable objects are safe to be deleted, but an incoming push
or reference update makes those to-be-deleted objects reachable before
they are actually removed, leaving the repository in a corrupt state.

I'm surprised that the loose-objects maintenance task deletes those
objects, though, since it just runs `prune-packed` or (the equivalent
of) `repack -d`, neither of which will actually delete objects from the
repository.

I see that Stolee is already on the CC list, so perhaps he could chime
in on the above.

In either case, my recommendation would be to keep those unreachable
objects which haven't yet aged out of the repository around for longer,
which will decrease the likelihood of seeing the race. If your
repository has a lot of unreachable objects (such that storing each
one of them individually as loose is impractical), I would recommend
using cruft packs (e.g., by running either `git repack --cruft -d` or
`git gc --cruft`) to collect those unreachable objects together into a
single pack.

See Documentation/technical/cruft-packs.txt for more information about
cruft packs.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-06-29 17:03 ` Taylor Blau
@ 2022-06-29 17:10   ` Taylor Blau
  2022-06-29 17:16   ` Gregory Szorc
  1 sibling, 0 replies; 16+ messages in thread
From: Taylor Blau @ 2022-06-29 17:10 UTC (permalink / raw)
  To: stolee; +Cc: git, Gregory Szorc

On Wed, Jun 29, 2022 at 01:03:54PM -0400, Taylor Blau wrote:
> I see that Stolee is already on the CC list, so perhaps he could chime
> in on the above.

I haven't looked at the maintenance code too closely as of yet, but I
have a plausible explanation for why maintenance is removing loose
*unreachable* objects.

Ordinarily the loose-objects task alone will not remove any loose
objects which don't already appear in a pack. That's because we first
try to run prune-packed, and then the equivalent of `git repack -d` to
consolidate loose objects together into a single pack.

But, we only do that for the first 50,000 loose objects. Anything else
gets left around and then is likely removed by the `git gc` task which
is run separately, and after the loose-objects task.

Using cruft packs won't change the race fundamentally, but storing the
unreachable objects in a cruft pack will mean that you can keep more
objects around stored together in a single pack. In other words, making
it safe to increase the grace period beyond what you would ordinarily be
able to do with the classic pruning mechanism.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-06-29 17:03 ` Taylor Blau
  2022-06-29 17:10   ` Taylor Blau
@ 2022-06-29 17:16   ` Gregory Szorc
  2022-06-29 17:21     ` Taylor Blau
  1 sibling, 1 reply; 16+ messages in thread
From: Gregory Szorc @ 2022-06-29 17:16 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, stolee

On Wed, Jun 29, 2022 at 10:03 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> Hi Greg,
>
> On Wed, Jun 29, 2022 at 09:55:54AM -0700, Gregory Szorc wrote:
> > 1. `git repack -A` creates loose objects
> > 2. `git maintenance`'s `loose-objects` task deletes those loose objects
> > 3. `git repack -A` fails to find the loose objects it just created and
> > aborts with `fatal: unable to add recent objects`
>
> This is a somewhat well-known race that occurs when one Git process
> decides unreachable objects are safe to be deleted, but an incoming push
> or reference update makes those to-be-deleted objects reachable before
> they are actually removed, leaving the repository in a corrupt state.
>
> I'm surprised that the loose-objects maintenance task deletes those
> objects, though, since it just runs `prune-packed` or (the equivalent
> of) `repack -d`, neither of which will actually delete objects from the
> repository.
>
> I see that Stolee is already on the CC list, so perhaps he could chime
> in on the above.
>
> In either case, my recommendation would be to keep those unreachable
> objects which haven't yet aged out of the repository around for longer,
> which will decrease the likelihood of seeing the race.

We had to lower gc.pruneExpire from its default of 1 hour because
otherwise this would result in the creation of thousands of loose
objects. This is normally acceptable. However, on NFS, the churn from
lots of file creation and deletion resulted in acceptable performance
degradation. We had to lower gc.pruneExpire to minimize the damage.

> If your
> repository has a lot of unreachable objects (such that storing each
> one of them individually as loose is impractical), I would recommend
> using cruft packs (e.g., by running either `git repack --cruft -d` or
> `git gc --cruft`) to collect those unreachable objects together into a
> single pack.
>
> See Documentation/technical/cruft-packs.txt for more information about
> cruft packs.

Yes, I saw this new feature in the Git release this week and am hoping
to roll it out to better mitigate this general problem! With cruft
packs, I anticipate being able to restore a longer gc.pruneExpire
value as they should mitigate the NFS performance badness we're
working around. Thank you for confirming it should help with the
problem.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-06-29 17:16   ` Gregory Szorc
@ 2022-06-29 17:21     ` Taylor Blau
  2022-06-30  0:44       ` Gregory Szorc
  0 siblings, 1 reply; 16+ messages in thread
From: Taylor Blau @ 2022-06-29 17:21 UTC (permalink / raw)
  To: Gregory Szorc; +Cc: git, stolee

On Wed, Jun 29, 2022 at 10:16:38AM -0700, Gregory Szorc wrote:
> > In either case, my recommendation would be to keep those unreachable
> > objects which haven't yet aged out of the repository around for longer,
> > which will decrease the likelihood of seeing the race.
>
> We had to lower gc.pruneExpire from its default of 1 hour because
> otherwise this would result in the creation of thousands of loose
> objects. This is normally acceptable. However, on NFS, the churn from
> lots of file creation and deletion resulted in acceptable performance
> degradation. We had to lower gc.pruneExpire to minimize the damage.

Makes sense. Having a shorter gc.pruneExpire makes the race much more
likely to occur in practice, so I'm glad that we have a plausible
confirmation that that's what's going on in your repository.

> > See Documentation/technical/cruft-packs.txt for more information about
> > cruft packs.
>
> Yes, I saw this new feature in the Git release this week and am hoping
> to roll it out to better mitigate this general problem! With cruft
> packs, I anticipate being able to restore a longer gc.pruneExpire
> value as they should mitigate the NFS performance badness we're
> working around. Thank you for confirming it should help with the
> problem.

That should definitely help. Let me know if you have any questions, and
thanks for trying it out!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-06-29 17:21     ` Taylor Blau
@ 2022-06-30  0:44       ` Gregory Szorc
  2022-06-30  3:19         ` Taylor Blau
  0 siblings, 1 reply; 16+ messages in thread
From: Gregory Szorc @ 2022-06-30  0:44 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, stolee

On Wed, Jun 29, 2022 at 10:21 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Jun 29, 2022 at 10:16:38AM -0700, Gregory Szorc wrote:
> > > In either case, my recommendation would be to keep those unreachable
> > > objects which haven't yet aged out of the repository around for longer,
> > > which will decrease the likelihood of seeing the race.
> >
> > We had to lower gc.pruneExpire from its default of 1 hour because
> > otherwise this would result in the creation of thousands of loose
> > objects. This is normally acceptable. However, on NFS, the churn from
> > lots of file creation and deletion resulted in acceptable performance
> > degradation. We had to lower gc.pruneExpire to minimize the damage.
>
> Makes sense. Having a shorter gc.pruneExpire makes the race much more
> likely to occur in practice, so I'm glad that we have a plausible
> confirmation that that's what's going on in your repository.
>
> > > See Documentation/technical/cruft-packs.txt for more information about
> > > cruft packs.
> >
> > Yes, I saw this new feature in the Git release this week and am hoping
> > to roll it out to better mitigate this general problem! With cruft
> > packs, I anticipate being able to restore a longer gc.pruneExpire
> > value as they should mitigate the NFS performance badness we're
> > working around. Thank you for confirming it should help with the
> > problem.
>
> That should definitely help. Let me know if you have any questions, and
> thanks for trying it out!

Revising my initial post, not running `loose-objects` is insufficient:
we also need to prevent `incremental-repack` from running while a `git
gc` / `git repack` is in progress. If an `incremental-repack` runs
concurrently with `git repack`, `repack` can error after bitmap
generation but before the temp packfile is renamed with "could not
find 'pack-*.pack'." I suspect this has to do with
`incremental-repack` deleting packs that were initially in place when
`git repack` started running. But I haven't looked into where exactly
repack is failing.

I deployed Git 2.37 and enabled cruft packs via `gc.cruftPacks=true`
in the global gitconfig. I simultaneously increased `gc.pruneExpire`
to 1 day (as we're no longer concerned about each unreferenced object
turning into a single file). And I changed the frequently-invoked `git
maintenance` code to not run `loose-objects` or `incremental-repack`
if a gc.pid file is present (this is still racy but should hopefully
work enough of the time). This appeared to work: `git gc` created a
cruft pack and didn't complain about anything disappearing out from
under it, despite `git maintenance` running `commit-graph` and
`pack-refs` tasks during the long-running `git gc` / `git repack`.

However, I think there is yet another bug at play: running
`incremental-repack` appears to be able to repack the cruft packfile.
In doing so, it deletes its .mtimes file and the metadata inside. This
behavior seems wrong to me: I was expecting the cruft packfile to
linger until the next `git gc` and/or for its mtimes metadata to be
preserved across future repacks. Otherwise, what is the point in
retaining granular mtime metadata? Is this the behavior expected? Or
do cruft packfiles just not work as intended alongside use of the
`incremental-task` maintenance task / multi-pack index in Git 2.37?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-06-30  0:44       ` Gregory Szorc
@ 2022-06-30  3:19         ` Taylor Blau
  2022-06-30  3:23           ` Taylor Blau
  0 siblings, 1 reply; 16+ messages in thread
From: Taylor Blau @ 2022-06-30  3:19 UTC (permalink / raw)
  To: Gregory Szorc; +Cc: Taylor Blau, git, stolee

On Wed, Jun 29, 2022 at 05:44:25PM -0700, Gregory Szorc wrote:
> Revising my initial post, not running `loose-objects` is insufficient:
> we also need to prevent `incremental-repack` from running while a `git
> gc` / `git repack` is in progress. If an `incremental-repack` runs
> concurrently with `git repack`, `repack` can error after bitmap
> generation but before the temp packfile is renamed with "could not
> find 'pack-*.pack'." I suspect this has to do with
> `incremental-repack` deleting packs that were initially in place when
> `git repack` started running. But I haven't looked into where exactly
> repack is failing.

Yeah, you would need to prevent other writers from removing packs while
writing a cruft pack.

I think the canonical way to do this would be to let `git
maintenance` use its own locking to ensure that it wasn't trying to
remove packs while simultaneously generating a cruft pack.

But assuming that the cruft pack generation is running independently and
concurrently with the maintenance incremental-repack task (which will
delete packs in the background), then there is definitely a TOCTOU race
there.

The race is that `repack` will signal to `pack-objects` which packs will
stay and which packs will be deleted during the repack. To generate a
cruft pack, the packs that stay are formed by doing a reachability
traversal and writing a pack that contains all reachable objects. That
way anything that is in `all-packs \ reachable` (which form the contents
of the cruft pack) are going to be just the unreachable objects.

But if a pack that was going to be deleted by repack *after* generating
the cruft pack disappears while `repack` is running (particularly after
it starts, but before it generates the cruft pack), then the cruft pack
generation can't proceed, since it has no idea what objects it needs to
add.

Namely, if it is told that pack P is going to be deleted, any objects in
P which don't appear in a pack that *isn't* going to be deleted need to
be saved. If we don't have a copy of P anymore, then there is no way to
come up with what that set of objects is, meaning that it's impossible
to generate the cruft pack.

I should mention, BTW, that this isn't a problem unique to cruft packs.
Geometric repacking, which uses `pack-objects --stdin-packs` has a
similar issue where if a pack being combined is removed from underneath
while `repack` is running, `pack-objects` cannot successfully generate
the pack.

> However, I think there is yet another bug at play: running
> `incremental-repack` appears to be able to repack the cruft packfile.
> In doing so, it deletes its .mtimes file and the metadata inside.

That sounds like a bug to me. I'll take a look into it and see what I
can find.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-06-30  3:19         ` Taylor Blau
@ 2022-06-30  3:23           ` Taylor Blau
  2022-06-30 20:12             ` Junio C Hamano
                               ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Taylor Blau @ 2022-06-30  3:23 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Gregory Szorc, git, stolee

On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote:
> > However, I think there is yet another bug at play: running
> > `incremental-repack` appears to be able to repack the cruft packfile.
> > In doing so, it deletes its .mtimes file and the metadata inside.
>
> That sounds like a bug to me. I'll take a look into it and see what I
> can find.

I actually think that there are two bugs here.

One is that we run a MIDX repack and expire, which could lead to us
repacking the entire contents of the cruft pack and then expiring the
metadata file. This is a bug, and we should exclude cruft packs from
this computation.

Another bug can happen when the cruft pack gets written into the MIDX
and is MIDX-expireable (meaning that no objects are selected from the
cruft pack). In that case, the `git multi-pack-index expire` step would
remove the cruft pack entirely, which is also incorrect.

I'll take a look at fixing both of these, and thanks for pointing them
out!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-06-30  3:23           ` Taylor Blau
@ 2022-06-30 20:12             ` Junio C Hamano
  2022-07-05 18:43               ` Gregory Szorc
  2022-07-20  1:40             ` Gregory Szorc
  2022-07-20  1:41             ` Gregory Szorc
  2 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2022-06-30 20:12 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Gregory Szorc, git, stolee

Taylor Blau <me@ttaylorr.com> writes:

> On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote:
>> > However, I think there is yet another bug at play: running
>> > `incremental-repack` appears to be able to repack the cruft packfile.
>> > In doing so, it deletes its .mtimes file and the metadata inside.
>>
>> That sounds like a bug to me. I'll take a look into it and see what I
>> can find.
>
> I actually think that there are two bugs here.
>
> One is that we run a MIDX repack and expire, which could lead to us
> repacking the entire contents of the cruft pack and then expiring the
> metadata file. This is a bug, and we should exclude cruft packs from
> this computation.
>
> Another bug can happen when the cruft pack gets written into the MIDX
> and is MIDX-expireable (meaning that no objects are selected from the
> cruft pack). In that case, the `git multi-pack-index expire` step would
> remove the cruft pack entirely, which is also incorrect.
>
> I'll take a look at fixing both of these, and thanks for pointing them
> out!

Thanks, both.

The fact that the semantics of the .mtimes file being not equivalent
to the mtime on individual loose objects does not help thinking
about the possible ways the "cruft" pack can break, and both of the
possible issues you discuss above are indeed tricky ones.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-06-30 20:12             ` Junio C Hamano
@ 2022-07-05 18:43               ` Gregory Szorc
  2022-07-06  8:50                 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 16+ messages in thread
From: Gregory Szorc @ 2022-07-05 18:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, stolee

Thinking about this some more, it is clear that running `git gc` /
`git repack` *and* `git maintenance` simultaneously is prone to
badness since their locking mechanisms aren't aware of the other.

I think it makes sense to consolidate onto `git maintenace` going
forward as this seems to be where the inertia is. However, a barrier
to that for me is the objects/maintenance.lock file has no acquisition
timeout and will wait indefinitely. So in scenarios like mine where
you have multiple processes looping over repos invoking `git
maintenance run`, you can have extended execution stalls during
long-running operations like `gc`.

There should probably be a configurable timeout for the maintenance
lock like there is for other locks. This shouldn't be too much work
and I may throw up a patch if others feel this is a good idea.

On Thu, Jun 30, 2022 at 1:12 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Taylor Blau <me@ttaylorr.com> writes:
>
> > On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote:
> >> > However, I think there is yet another bug at play: running
> >> > `incremental-repack` appears to be able to repack the cruft packfile.
> >> > In doing so, it deletes its .mtimes file and the metadata inside.
> >>
> >> That sounds like a bug to me. I'll take a look into it and see what I
> >> can find.
> >
> > I actually think that there are two bugs here.
> >
> > One is that we run a MIDX repack and expire, which could lead to us
> > repacking the entire contents of the cruft pack and then expiring the
> > metadata file. This is a bug, and we should exclude cruft packs from
> > this computation.
> >
> > Another bug can happen when the cruft pack gets written into the MIDX
> > and is MIDX-expireable (meaning that no objects are selected from the
> > cruft pack). In that case, the `git multi-pack-index expire` step would
> > remove the cruft pack entirely, which is also incorrect.
> >
> > I'll take a look at fixing both of these, and thanks for pointing them
> > out!
>
> Thanks, both.
>
> The fact that the semantics of the .mtimes file being not equivalent
> to the mtime on individual loose objects does not help thinking
> about the possible ways the "cruft" pack can break, and both of the
> possible issues you discuss above are indeed tricky ones.
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-07-05 18:43               ` Gregory Szorc
@ 2022-07-06  8:50                 ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 16+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-07-06  8:50 UTC (permalink / raw)
  To: Gregory Szorc; +Cc: Junio C Hamano, Taylor Blau, git, stolee


On Tue, Jul 05 2022, Gregory Szorc wrote:

> Thinking about this some more, it is clear that running `git gc` /
> `git repack` *and* `git maintenance` simultaneously is prone to
> badness since their locking mechanisms aren't aware of the other.

Yes, there are outstanding issues with the "maintenance" and "gc" locks,
they should be unified, and the gc.lock has bugs & race conditions that
need to be fixed. See these past threads:

    https://lore.kernel.org/git/87d02fi75p.fsf@evledraar.gmail.com/
    https://lore.kernel.org/git/87ef3o7ws1.fsf@evledraar.gmail.com/

I also have some old WIP patches to add a "gc testing" mode to our test
suite, similar to the "commit graph" instrumentation if you're
interested in working on this.

I.e. almost every command will fork off a "gi gc --auto", and we'll see
based on the errors whether our locking is still buggy/racy.

> I think it makes sense to consolidate onto `git maintenace` going
> forward as this seems to be where the inertia is. However, a barrier
> to that for me is the objects/maintenance.lock file has no acquisition
> timeout and will wait indefinitely. So in scenarios like mine where
> you have multiple processes looping over repos invoking `git
> maintenance run`, you can have extended execution stalls during
> long-running operations like `gc`.
> [...]
> There should probably be a configurable timeout for the maintenance
> lock like there is for other locks. This shouldn't be too much work
> and I may throw up a patch if others feel this is a good idea.

...but while I think this is all worthwhile I think you're on the wrong
path if you think this will help much or at all with the issue being
reported here.

Fixing the gc.lock (and maintenance lock...) would be nice because we'd
have "gc" be less dumb, and it wouldn't get itself into lock races etc.

But you cannot hope to fix the underlying "object expiry as things
become reachable" race condition that way, because that's those tasks
racing with *other* object/ref creation.

Such a fix would either need some repository-global lock (which would
entail all sorts of nastyness), or other workarounds around the inherent
race between different object store & ref operations, see Taylor's
recent "cruft pack write-out" series & my links to some past discussions
of the race here:

    https://lore.kernel.org/git/220630.86y1xeeeik.gmgdl@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-06-30  3:23           ` Taylor Blau
  2022-06-30 20:12             ` Junio C Hamano
@ 2022-07-20  1:40             ` Gregory Szorc
  2022-07-20  9:52               ` Ævar Arnfjörð Bjarmason
  2022-07-20  1:41             ` Gregory Szorc
  2 siblings, 1 reply; 16+ messages in thread
From: Gregory Szorc @ 2022-07-20  1:40 UTC (permalink / raw)
  To: git; +Cc: git, stolee

On 6/29/2022 8:23 PM, Taylor Blau wrote:
> On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote:
>>> However, I think there is yet another bug at play: running
>>> `incremental-repack` appears to be able to repack the cruft packfile.
>>> In doing so, it deletes its .mtimes file and the metadata inside.
>>
>> That sounds like a bug to me. I'll take a look into it and see what I
>> can find.
> 
> I actually think that there are two bugs here.
> 
> One is that we run a MIDX repack and expire, which could lead to us
> repacking the entire contents of the cruft pack and then expiring the
> metadata file. This is a bug, and we should exclude cruft packs from
> this computation.
> 
> Another bug can happen when the cruft pack gets written into the MIDX
> and is MIDX-expireable (meaning that no objects are selected from the
> cruft pack). In that case, the `git multi-pack-index expire` step would
> remove the cruft pack entirely, which is also incorrect.
> 
> I'll take a look at fixing both of these, and thanks for pointing them
> out!

For posterity, when I disabled cruft packfiles after having it enabled 
for a few weeks, the next `git gc` invocation on a high traffic repo 
resulted in >100k loose objects/files being created before they were 
summarily deleted by the GC's prune. This is significantly greater than 
the unreferenced object creation rate of the underlying repo. So it 
appears as if the MIDX stripping of the cruft packfile mtimes
effectively disabled pruning, leading to a build-up of unreferenced objects.

Fortunately I hadn't deployed cruft packfiles to production. If I had, 
the excessive filesystem churn on NFS would have caused an incident due 
to degraded performance.

Since the interaction between cruft packfiles and MIDX appears to be 
buggy, I think I'm going to leave cruft packfiles disabled until these 
features work better together.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-06-30  3:23           ` Taylor Blau
  2022-06-30 20:12             ` Junio C Hamano
  2022-07-20  1:40             ` Gregory Szorc
@ 2022-07-20  1:41             ` Gregory Szorc
  2 siblings, 0 replies; 16+ messages in thread
From: Gregory Szorc @ 2022-07-20  1:41 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, stolee

On Wed, Jun 29, 2022 at 8:23 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote:
> > > However, I think there is yet another bug at play: running
> > > `incremental-repack` appears to be able to repack the cruft packfile.
> > > In doing so, it deletes its .mtimes file and the metadata inside.
> >
> > That sounds like a bug to me. I'll take a look into it and see what I
> > can find.
>
> I actually think that there are two bugs here.
>
> One is that we run a MIDX repack and expire, which could lead to us
> repacking the entire contents of the cruft pack and then expiring the
> metadata file. This is a bug, and we should exclude cruft packs from
> this computation.
>
> Another bug can happen when the cruft pack gets written into the MIDX
> and is MIDX-expireable (meaning that no objects are selected from the
> cruft pack). In that case, the `git multi-pack-index expire` step would
> remove the cruft pack entirely, which is also incorrect.
>
> I'll take a look at fixing both of these, and thanks for pointing them
> out!

For posterity, when I disabled cruft packfiles after having it enabled
for a few weeks, the next `git gc` invocation on a high traffic repo
resulted in >100k loose objects/files being created before they were
summarily deleted by the GC's prune. This is significantly greater
than the unreferenced object creation rate of the underlying repo. So
it appears as if the MIDX stripping of the cruft packfile mtimes
effectively disabled pruning, leading to a build-up of unreferenced objects.

Fortunately I hadn't deployed cruft packfiles to production. If I had,
the excessive filesystem churn on NFS would have caused an incident
due to degraded performance.

Since the interaction between cruft packfiles and MIDX appears to be
buggy, I think I'm going to leave cruft packfiles disabled until these
features work better together.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-07-20  1:40             ` Gregory Szorc
@ 2022-07-20  9:52               ` Ævar Arnfjörð Bjarmason
  2022-07-26 16:22                 ` Gregory Szorc
  0 siblings, 1 reply; 16+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-07-20  9:52 UTC (permalink / raw)
  To: Gregory Szorc; +Cc: git, stolee, Taylor Blau


On Tue, Jul 19 2022, Gregory Szorc wrote:

> On 6/29/2022 8:23 PM, Taylor Blau wrote:
>> On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote:
>>>> However, I think there is yet another bug at play: running
>>>> `incremental-repack` appears to be able to repack the cruft packfile.
>>>> In doing so, it deletes its .mtimes file and the metadata inside.
>>>
>>> That sounds like a bug to me. I'll take a look into it and see what I
>>> can find.
>> I actually think that there are two bugs here.
>> One is that we run a MIDX repack and expire, which could lead to us
>> repacking the entire contents of the cruft pack and then expiring the
>> metadata file. This is a bug, and we should exclude cruft packs from
>> this computation.
>> Another bug can happen when the cruft pack gets written into the
>> MIDX
>> and is MIDX-expireable (meaning that no objects are selected from the
>> cruft pack). In that case, the `git multi-pack-index expire` step would
>> remove the cruft pack entirely, which is also incorrect.
>> I'll take a look at fixing both of these, and thanks for pointing
>> them
>> out!
>
> For posterity, when I disabled cruft packfiles after having it enabled
> for a few weeks, the next `git gc` invocation on a high traffic repo 
> resulted in >100k loose objects/files being created before they were
> summarily deleted by the GC's prune. This is significantly greater
> than the unreferenced object creation rate of the underlying repo. So
> it appears as if the MIDX stripping of the cruft packfile mtimes
> effectively disabled pruning, leading to a build-up of unreferenced objects.
>
> Fortunately I hadn't deployed cruft packfiles to production. If I had,
> the excessive filesystem churn on NFS would have caused an incident
> due to degraded performance.
>
> Since the interaction between cruft packfiles and MIDX appears to be
> buggy, I think I'm going to leave cruft packfiles disabled until these 
> features work better together.

I haven't looked deeply into whether there's any cruft packfile & MIDX
interaction at play here, but I suspect based on past experience that
this has nothing whatsoever to do with the MIDX.

It's because git suddenly found a bunch of objects that should be
evicted from the packs, as you disabled the cruft pack feature.

Here's a past test-case & report of mine where I ran into that:

    https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/

That was way before cruft packfiles were integrated, but from my
understanding of how they work the mechanism is exactly the same.

I.e. in that case I deleted a bunch of refs (tag objects), which caused
those to become unreferenced, so on the next "gc/repack" they were all
evicted from their respective packs, at which point the "gc.pruneExpire"
clock started ticking for them.

Whereas in your case you disabled "cruft packs", which are basically a
mechanism where git keeps such "unused" objects in a pack. Once you
disabled it git stopped considering the new *.mtimes file, and started
extracting these loose objects en-masse.

Which b.t.w. is something I think we might want to reconsider. I.e. we
just pass "--cruft" to pack-objects (see b757353676d
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20)), but
don't have a "write cruft" v.s. "read cruft" setting (but see below).

The one thing that gives me pause here is your mention of "summarily
deleted by the GC's prune". Did you run "git prune --expire=now"
manually, or do you have a really aggressive "gc.pruneExpire" setting?
It's also posible that I'm entirely misunderstanding this whole thing...

Anyway.

Since that 2018-era post of mine I've thought a bit about how to best
handle this particular edge case.

I think the simplest implementation that would work well enough would be
to teach "git repack" to limit how many objects it's willing to extract
from a pack. I.e. we'd have the amount of "unreachable" objects we'd be
willing to eject from the pack limited, with check similar to how
"gc.auto" works (i.e. once we reach the limit we'd implicitly turn on
"--keep-unreachable").

It's far from perfect, and e.g. if the N objects you're extracting
happen to delta-compress particularly well you could still have cases
where e.g. a 1GB repo becomes a 10GB one with the new loose objects.

But it would allow us to limit the common case where we'd e.g. create 1k
new loose objects, not 100k or whatever.

It does have the downside that unless you'd disable that limit the "gc"
-> "gc.pruneExpire" window would become even more fuzzy. I.e. now if you
run "gc" the clock starts ticking right away, after this the clock would
start ticking whenever we happened to extract those objects.

So there's certainly less dumb ways to do this, and we'd need to make
sure that whatever limit we set is aggressive enough that we'd always
"stay ahead". I.e. if a repo just has a very high creation rate of
objects that become unreachable we'd need to ensure that we wouldn't
have an ever growing .git as our eviction rate wouldn't keep up with our
additions.

For cruft packs (this is a continuation of the "see below" above) we'd
do the same thing, but could benefit from knowing that such objects are
"already expired", so we could read the *.mtimes, and possibly expire
some right away.

Aside from some very obscure use-cases I think these these sort of
"loose explosion" are one-off events, so if we can smooth those out. The
various gc.* settings can be hard to reason about, but we really do try
to be gently by default, but notably fail very badly at that in some
known edge cases, such as this one.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-07-20  9:52               ` Ævar Arnfjörð Bjarmason
@ 2022-07-26 16:22                 ` Gregory Szorc
  2022-07-26 18:11                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 16+ messages in thread
From: Gregory Szorc @ 2022-07-26 16:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, stolee, Taylor Blau

On Wed, Jul 20, 2022 at 3:12 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Tue, Jul 19 2022, Gregory Szorc wrote:
>
> > On 6/29/2022 8:23 PM, Taylor Blau wrote:
> >> On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote:
> >>>> However, I think there is yet another bug at play: running
> >>>> `incremental-repack` appears to be able to repack the cruft packfile.
> >>>> In doing so, it deletes its .mtimes file and the metadata inside.
> >>>
> >>> That sounds like a bug to me. I'll take a look into it and see what I
> >>> can find.
> >> I actually think that there are two bugs here.
> >> One is that we run a MIDX repack and expire, which could lead to us
> >> repacking the entire contents of the cruft pack and then expiring the
> >> metadata file. This is a bug, and we should exclude cruft packs from
> >> this computation.
> >> Another bug can happen when the cruft pack gets written into the
> >> MIDX
> >> and is MIDX-expireable (meaning that no objects are selected from the
> >> cruft pack). In that case, the `git multi-pack-index expire` step would
> >> remove the cruft pack entirely, which is also incorrect.
> >> I'll take a look at fixing both of these, and thanks for pointing
> >> them
> >> out!
> >
> > For posterity, when I disabled cruft packfiles after having it enabled
> > for a few weeks, the next `git gc` invocation on a high traffic repo
> > resulted in >100k loose objects/files being created before they were
> > summarily deleted by the GC's prune. This is significantly greater
> > than the unreferenced object creation rate of the underlying repo. So
> > it appears as if the MIDX stripping of the cruft packfile mtimes
> > effectively disabled pruning, leading to a build-up of unreferenced objects.
> >
> > Fortunately I hadn't deployed cruft packfiles to production. If I had,
> > the excessive filesystem churn on NFS would have caused an incident
> > due to degraded performance.
> >
> > Since the interaction between cruft packfiles and MIDX appears to be
> > buggy, I think I'm going to leave cruft packfiles disabled until these
> > features work better together.
>
> I haven't looked deeply into whether there's any cruft packfile & MIDX
> interaction at play here, but I suspect based on past experience that
> this has nothing whatsoever to do with the MIDX.

Taylor seemed to acknowledge a bug here a few posts back. Essentially:

1. Cruft packfile created
2. MIDX repack and expire treats the cruft packfile as a regular pack
and folds its content into a new, regular packfile, deleting the
mtimes file/metadata in the process.

I don't fully grok the mechanism, but I believe the loss of the mtimes
metadata effectively results in indefinite retention of unreferenced
objects.

>
> It's because git suddenly found a bunch of objects that should be
> evicted from the packs, as you disabled the cruft pack feature.
>
> Here's a past test-case & report of mine where I ran into that:
>
>     https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/
>
> That was way before cruft packfiles were integrated, but from my
> understanding of how they work the mechanism is exactly the same.
>
> I.e. in that case I deleted a bunch of refs (tag objects), which caused
> those to become unreferenced, so on the next "gc/repack" they were all
> evicted from their respective packs, at which point the "gc.pruneExpire"
> clock started ticking for them.
>
> Whereas in your case you disabled "cruft packs", which are basically a
> mechanism where git keeps such "unused" objects in a pack. Once you
> disabled it git stopped considering the new *.mtimes file, and started
> extracting these loose objects en-masse.
>
> Which b.t.w. is something I think we might want to reconsider. I.e. we
> just pass "--cruft" to pack-objects (see b757353676d
> (builtin/pack-objects.c: --cruft without expiration, 2022-05-20)), but
> don't have a "write cruft" v.s. "read cruft" setting (but see below).
>
> The one thing that gives me pause here is your mention of "summarily
> deleted by the GC's prune". Did you run "git prune --expire=now"
> manually, or do you have a really aggressive "gc.pruneExpire" setting?
> It's also posible that I'm entirely misunderstanding this whole thing...
>
> Anyway.
>
> Since that 2018-era post of mine I've thought a bit about how to best
> handle this particular edge case.
>
> I think the simplest implementation that would work well enough would be
> to teach "git repack" to limit how many objects it's willing to extract
> from a pack. I.e. we'd have the amount of "unreachable" objects we'd be
> willing to eject from the pack limited, with check similar to how
> "gc.auto" works (i.e. once we reach the limit we'd implicitly turn on
> "--keep-unreachable").
>
> It's far from perfect, and e.g. if the N objects you're extracting
> happen to delta-compress particularly well you could still have cases
> where e.g. a 1GB repo becomes a 10GB one with the new loose objects.
>
> But it would allow us to limit the common case where we'd e.g. create 1k
> new loose objects, not 100k or whatever.
>
> It does have the downside that unless you'd disable that limit the "gc"
> -> "gc.pruneExpire" window would become even more fuzzy. I.e. now if you
> run "gc" the clock starts ticking right away, after this the clock would
> start ticking whenever we happened to extract those objects.
>
> So there's certainly less dumb ways to do this, and we'd need to make
> sure that whatever limit we set is aggressive enough that we'd always
> "stay ahead". I.e. if a repo just has a very high creation rate of
> objects that become unreachable we'd need to ensure that we wouldn't
> have an ever growing .git as our eviction rate wouldn't keep up with our
> additions.

I agree this is a hard problem! In my case, the underlying repo sees a
few thousand pushes daily. This results in ~30k new unreferenced
objects on a typical workday. With NFS/EFS, we've observed that the
entire filesystem can get slow once object loosening has written as
few as a few thousand files. Given our high volume, I worry that
imposing a loose object creation ceiling will perpetually throttle and
we'll accumulate unreferenced objects indefinitely.

> For cruft packs (this is a continuation of the "see below" above) we'd
> do the same thing, but could benefit from knowing that such objects are
> "already expired", so we could read the *.mtimes, and possibly expire
> some right away.

I don't fully grok how pruning works in a cruft packfile world. My
naive hope is that the process creating the cruft packfile is able to
elide expired objects so we don't even bother with them during cruft
packfile generation. (It seems inefficient/avoidable that we write an
expired object - loose or cruft packed - from the context of a gc only
to delete it moments later.)

> Aside from some very obscure use-cases I think these these sort of
> "loose explosion" are one-off events, so if we can smooth those out. The
> various gc.* settings can be hard to reason about, but we really do try
> to be gently by default, but notably fail very badly at that in some
> known edge cases, such as this one.

I agree that my scenario of creating an explosion of loose objects is
a one-off. IMO not worth spending too much time thinking about. That
being said, ~daily pruning on a large enough repo starts to resemble
these one-off events. And I think you need something like cruft
packfiles to mitigate it. if cruft packfiles are ever enabled by
default, it may have a nice side-effect of preventing these one-offs!
(But again, I don't grok how pruning works in a cruft packfile world.)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition between repack and loose-objects maintenance task
  2022-07-26 16:22                 ` Gregory Szorc
@ 2022-07-26 18:11                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 16+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-07-26 18:11 UTC (permalink / raw)
  To: Gregory Szorc; +Cc: git, stolee, Taylor Blau


On Tue, Jul 26 2022, Gregory Szorc wrote:

> On Wed, Jul 20, 2022 at 3:12 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Tue, Jul 19 2022, Gregory Szorc wrote:
>>
>> > On 6/29/2022 8:23 PM, Taylor Blau wrote:
>> >> On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote:
>>> [...]
>> > Since the interaction between cruft packfiles and MIDX appears to be
>> > buggy, I think I'm going to leave cruft packfiles disabled until these
>> > features work better together.
>>
>> I haven't looked deeply into whether there's any cruft packfile & MIDX
>> interaction at play here, but I suspect based on past experience that
>> this has nothing whatsoever to do with the MIDX.
>
> Taylor seemed to acknowledge a bug here a few posts back. Essentially:
>
> 1. Cruft packfile created
> 2. MIDX repack and expire treats the cruft packfile as a regular pack
> and folds its content into a new, regular packfile, deleting the
> mtimes file/metadata in the process.
>
> I don't fully grok the mechanism, but I believe the loss of the mtimes
> metadata effectively results in indefinite retention of unreferenced
> objects.

Maybe, but I don't see how indefinite retention would happen unless
you're flip-flopping back & forth.

If we're deleting the mtimes file but keeping the pack they're in we'd
notice that those packed objects are unreferenced on the next gc, so
we'd explode them into loose objects, and then the timer starts ticking
on their expiry.

It's a lot of churn, but once that timer expires we'll then expire those
objects.

>>
>> It's because git suddenly found a bunch of objects that should be
>> evicted from the packs, as you disabled the cruft pack feature.
>>
>> Here's a past test-case & report of mine where I ran into that:
>>
>>     https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/
>>
>> That was way before cruft packfiles were integrated, but from my
>> understanding of how they work the mechanism is exactly the same.
>>
>> I.e. in that case I deleted a bunch of refs (tag objects), which caused
>> those to become unreferenced, so on the next "gc/repack" they were all
>> evicted from their respective packs, at which point the "gc.pruneExpire"
>> clock started ticking for them.
>>
>> Whereas in your case you disabled "cruft packs", which are basically a
>> mechanism where git keeps such "unused" objects in a pack. Once you
>> disabled it git stopped considering the new *.mtimes file, and started
>> extracting these loose objects en-masse.
>>
>> Which b.t.w. is something I think we might want to reconsider. I.e. we
>> just pass "--cruft" to pack-objects (see b757353676d
>> (builtin/pack-objects.c: --cruft without expiration, 2022-05-20)), but
>> don't have a "write cruft" v.s. "read cruft" setting (but see below).
>>
>> The one thing that gives me pause here is your mention of "summarily
>> deleted by the GC's prune". Did you run "git prune --expire=now"
>> manually, or do you have a really aggressive "gc.pruneExpire" setting?
>> It's also posible that I'm entirely misunderstanding this whole thing...
>>
>> Anyway.
>>
>> Since that 2018-era post of mine I've thought a bit about how to best
>> handle this particular edge case.
>>
>> I think the simplest implementation that would work well enough would be
>> to teach "git repack" to limit how many objects it's willing to extract
>> from a pack. I.e. we'd have the amount of "unreachable" objects we'd be
>> willing to eject from the pack limited, with check similar to how
>> "gc.auto" works (i.e. once we reach the limit we'd implicitly turn on
>> "--keep-unreachable").
>>
>> It's far from perfect, and e.g. if the N objects you're extracting
>> happen to delta-compress particularly well you could still have cases
>> where e.g. a 1GB repo becomes a 10GB one with the new loose objects.
>>
>> But it would allow us to limit the common case where we'd e.g. create 1k
>> new loose objects, not 100k or whatever.
>>
>> It does have the downside that unless you'd disable that limit the "gc"
>> -> "gc.pruneExpire" window would become even more fuzzy. I.e. now if you
>> run "gc" the clock starts ticking right away, after this the clock would
>> start ticking whenever we happened to extract those objects.
>>
>> So there's certainly less dumb ways to do this, and we'd need to make
>> sure that whatever limit we set is aggressive enough that we'd always
>> "stay ahead". I.e. if a repo just has a very high creation rate of
>> objects that become unreachable we'd need to ensure that we wouldn't
>> have an ever growing .git as our eviction rate wouldn't keep up with our
>> additions.
>
> I agree this is a hard problem! In my case, the underlying repo sees a
> few thousand pushes daily. This results in ~30k new unreferenced
> objects on a typical workday. With NFS/EFS, we've observed that the
> entire filesystem can get slow once object loosening has written as
> few as a few thousand files. Given our high volume, I worry that
> imposing a loose object creation ceiling will perpetually throttle and
> we'll accumulate unreferenced objects indefinitely.

If you're feeling adventurous you may want to try to rebase or otherwise
experiment with this old series of mine, that never made it in:
https://lore.kernel.org/git/20181028225023.26427-1-avarab@gmail.com/

I don't run git on NFS anymore, but when I did I found that the main
contributor to loose object-related slowness on pushes was the collision
checking.

Whereas if you're willing to simply write duplicate objects ....

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-07-26 18:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-29 16:55 Race condition between repack and loose-objects maintenance task Gregory Szorc
2022-06-29 17:03 ` Taylor Blau
2022-06-29 17:10   ` Taylor Blau
2022-06-29 17:16   ` Gregory Szorc
2022-06-29 17:21     ` Taylor Blau
2022-06-30  0:44       ` Gregory Szorc
2022-06-30  3:19         ` Taylor Blau
2022-06-30  3:23           ` Taylor Blau
2022-06-30 20:12             ` Junio C Hamano
2022-07-05 18:43               ` Gregory Szorc
2022-07-06  8:50                 ` Ævar Arnfjörð Bjarmason
2022-07-20  1:40             ` Gregory Szorc
2022-07-20  9:52               ` Ævar Arnfjörð Bjarmason
2022-07-26 16:22                 ` Gregory Szorc
2022-07-26 18:11                   ` Ævar Arnfjörð Bjarmason
2022-07-20  1:41             ` Gregory Szorc

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.