* Race condition between repack and loose-objects maintenance task @ 2022-06-29 16:55 Gregory Szorc 2022-06-29 17:03 ` Taylor Blau 0 siblings, 1 reply; 16+ messages in thread From: Gregory Szorc @ 2022-06-29 16:55 UTC (permalink / raw) To: git, stolee `git repack -A` (such as invoked by `git gc`) loosens unreachable objects from packfiles. `git maintenance`'s `loose-objects` task consolidates loose objects into a packfile. As neither process takes a lock or is otherwise aware of the existence of the other, there is a race condition: 1. `git repack -A` creates loose objects 2. `git maintenance`'s `loose-objects` task deletes those loose objects 3. `git repack -A` fails to find the loose objects it just created and aborts with `fatal: unable to add recent objects` I see this failure with regularity in a custom environment where high mutation volume repositories invoke `git maintenance run` with regularity (effectively in reaction to repo mutations). We have a Git housekeeping strategy that frequently invokes `git maintenance` with all but the `gc` task. On a ~daily frequency, we invoke `git gc` to purge accumulated garbage. This `git gc` frequently fails due to the aforementioned race condition since the long wall time of `git gc` practically guarantees a `git maintenance` would have been triggered since the repo is mutated with such high frequency. In my scenario, I believe we have a workaround by omitting `--task=loose-objects` from the frequent `git maintenance` invocation if there exists a `gc.pid` file. However, this is only a partial solution, as `git repack` and the `loose-objects` task aren't aware of each other. (I suppose running the `gc` maintenance task and relying on the maintenance lock file could also work. However, we're invoking `git gc` explicitly since `git maintenance` doesn't allow passing custom arguments through to `git gc`.) Gregory ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-06-29 16:55 Race condition between repack and loose-objects maintenance task Gregory Szorc @ 2022-06-29 17:03 ` Taylor Blau 2022-06-29 17:10 ` Taylor Blau 2022-06-29 17:16 ` Gregory Szorc 0 siblings, 2 replies; 16+ messages in thread From: Taylor Blau @ 2022-06-29 17:03 UTC (permalink / raw) To: Gregory Szorc; +Cc: git, stolee Hi Greg, On Wed, Jun 29, 2022 at 09:55:54AM -0700, Gregory Szorc wrote: > 1. `git repack -A` creates loose objects > 2. `git maintenance`'s `loose-objects` task deletes those loose objects > 3. `git repack -A` fails to find the loose objects it just created and > aborts with `fatal: unable to add recent objects` This is a somewhat well-known race that occurs when one Git process decides unreachable objects are safe to be deleted, but an incoming push or reference update makes those to-be-deleted objects reachable before they are actually removed, leaving the repository in a corrupt state. I'm surprised that the loose-objects maintenance task deletes those objects, though, since it just runs `prune-packed` or (the equivalent of) `repack -d`, neither of which will actually delete objects from the repository. I see that Stolee is already on the CC list, so perhaps he could chime in on the above. In either case, my recommendation would be to keep those unreachable objects which haven't yet aged out of the repository around for longer, which will decrease the likelihood of seeing the race. If your repository has a lot of unreachable objects (such that storing each one of them individually as loose is impractical), I would recommend using cruft packs (e.g., by running either `git repack --cruft -d` or `git gc --cruft`) to collect those unreachable objects together into a single pack. See Documentation/technical/cruft-packs.txt for more information about cruft packs. Thanks, Taylor ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-06-29 17:03 ` Taylor Blau @ 2022-06-29 17:10 ` Taylor Blau 2022-06-29 17:16 ` Gregory Szorc 1 sibling, 0 replies; 16+ messages in thread From: Taylor Blau @ 2022-06-29 17:10 UTC (permalink / raw) To: stolee; +Cc: git, Gregory Szorc On Wed, Jun 29, 2022 at 01:03:54PM -0400, Taylor Blau wrote: > I see that Stolee is already on the CC list, so perhaps he could chime > in on the above. I haven't looked at the maintenance code too closely as of yet, but I have a plausible explanation for why maintenance is removing loose *unreachable* objects. Ordinarily the loose-objects task alone will not remove any loose objects which don't already appear in a pack. That's because we first try to run prune-packed, and then the equivalent of `git repack -d` to consolidate loose objects together into a single pack. But, we only do that for the first 50,000 loose objects. Anything else gets left around and then is likely removed by the `git gc` task which is run separately, and after the loose-objects task. Using cruft packs won't change the race fundamentally, but storing the unreachable objects in a cruft pack will mean that you can keep more objects around stored together in a single pack. In other words, making it safe to increase the grace period beyond what you would ordinarily be able to do with the classic pruning mechanism. Thanks, Taylor ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-06-29 17:03 ` Taylor Blau 2022-06-29 17:10 ` Taylor Blau @ 2022-06-29 17:16 ` Gregory Szorc 2022-06-29 17:21 ` Taylor Blau 1 sibling, 1 reply; 16+ messages in thread From: Gregory Szorc @ 2022-06-29 17:16 UTC (permalink / raw) To: Taylor Blau; +Cc: git, stolee On Wed, Jun 29, 2022 at 10:03 AM Taylor Blau <me@ttaylorr.com> wrote: > > Hi Greg, > > On Wed, Jun 29, 2022 at 09:55:54AM -0700, Gregory Szorc wrote: > > 1. `git repack -A` creates loose objects > > 2. `git maintenance`'s `loose-objects` task deletes those loose objects > > 3. `git repack -A` fails to find the loose objects it just created and > > aborts with `fatal: unable to add recent objects` > > This is a somewhat well-known race that occurs when one Git process > decides unreachable objects are safe to be deleted, but an incoming push > or reference update makes those to-be-deleted objects reachable before > they are actually removed, leaving the repository in a corrupt state. > > I'm surprised that the loose-objects maintenance task deletes those > objects, though, since it just runs `prune-packed` or (the equivalent > of) `repack -d`, neither of which will actually delete objects from the > repository. > > I see that Stolee is already on the CC list, so perhaps he could chime > in on the above. > > In either case, my recommendation would be to keep those unreachable > objects which haven't yet aged out of the repository around for longer, > which will decrease the likelihood of seeing the race. We had to lower gc.pruneExpire from its default of 1 hour because otherwise this would result in the creation of thousands of loose objects. This is normally acceptable. However, on NFS, the churn from lots of file creation and deletion resulted in acceptable performance degradation. We had to lower gc.pruneExpire to minimize the damage. > If your > repository has a lot of unreachable objects (such that storing each > one of them individually as loose is impractical), I would recommend > using cruft packs (e.g., by running either `git repack --cruft -d` or > `git gc --cruft`) to collect those unreachable objects together into a > single pack. > > See Documentation/technical/cruft-packs.txt for more information about > cruft packs. Yes, I saw this new feature in the Git release this week and am hoping to roll it out to better mitigate this general problem! With cruft packs, I anticipate being able to restore a longer gc.pruneExpire value as they should mitigate the NFS performance badness we're working around. Thank you for confirming it should help with the problem. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-06-29 17:16 ` Gregory Szorc @ 2022-06-29 17:21 ` Taylor Blau 2022-06-30 0:44 ` Gregory Szorc 0 siblings, 1 reply; 16+ messages in thread From: Taylor Blau @ 2022-06-29 17:21 UTC (permalink / raw) To: Gregory Szorc; +Cc: git, stolee On Wed, Jun 29, 2022 at 10:16:38AM -0700, Gregory Szorc wrote: > > In either case, my recommendation would be to keep those unreachable > > objects which haven't yet aged out of the repository around for longer, > > which will decrease the likelihood of seeing the race. > > We had to lower gc.pruneExpire from its default of 1 hour because > otherwise this would result in the creation of thousands of loose > objects. This is normally acceptable. However, on NFS, the churn from > lots of file creation and deletion resulted in acceptable performance > degradation. We had to lower gc.pruneExpire to minimize the damage. Makes sense. Having a shorter gc.pruneExpire makes the race much more likely to occur in practice, so I'm glad that we have a plausible confirmation that that's what's going on in your repository. > > See Documentation/technical/cruft-packs.txt for more information about > > cruft packs. > > Yes, I saw this new feature in the Git release this week and am hoping > to roll it out to better mitigate this general problem! With cruft > packs, I anticipate being able to restore a longer gc.pruneExpire > value as they should mitigate the NFS performance badness we're > working around. Thank you for confirming it should help with the > problem. That should definitely help. Let me know if you have any questions, and thanks for trying it out! Thanks, Taylor ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-06-29 17:21 ` Taylor Blau @ 2022-06-30 0:44 ` Gregory Szorc 2022-06-30 3:19 ` Taylor Blau 0 siblings, 1 reply; 16+ messages in thread From: Gregory Szorc @ 2022-06-30 0:44 UTC (permalink / raw) To: Taylor Blau; +Cc: git, stolee On Wed, Jun 29, 2022 at 10:21 AM Taylor Blau <me@ttaylorr.com> wrote: > > On Wed, Jun 29, 2022 at 10:16:38AM -0700, Gregory Szorc wrote: > > > In either case, my recommendation would be to keep those unreachable > > > objects which haven't yet aged out of the repository around for longer, > > > which will decrease the likelihood of seeing the race. > > > > We had to lower gc.pruneExpire from its default of 1 hour because > > otherwise this would result in the creation of thousands of loose > > objects. This is normally acceptable. However, on NFS, the churn from > > lots of file creation and deletion resulted in acceptable performance > > degradation. We had to lower gc.pruneExpire to minimize the damage. > > Makes sense. Having a shorter gc.pruneExpire makes the race much more > likely to occur in practice, so I'm glad that we have a plausible > confirmation that that's what's going on in your repository. > > > > See Documentation/technical/cruft-packs.txt for more information about > > > cruft packs. > > > > Yes, I saw this new feature in the Git release this week and am hoping > > to roll it out to better mitigate this general problem! With cruft > > packs, I anticipate being able to restore a longer gc.pruneExpire > > value as they should mitigate the NFS performance badness we're > > working around. Thank you for confirming it should help with the > > problem. > > That should definitely help. Let me know if you have any questions, and > thanks for trying it out! Revising my initial post, not running `loose-objects` is insufficient: we also need to prevent `incremental-repack` from running while a `git gc` / `git repack` is in progress. If an `incremental-repack` runs concurrently with `git repack`, `repack` can error after bitmap generation but before the temp packfile is renamed with "could not find 'pack-*.pack'." I suspect this has to do with `incremental-repack` deleting packs that were initially in place when `git repack` started running. But I haven't looked into where exactly repack is failing. I deployed Git 2.37 and enabled cruft packs via `gc.cruftPacks=true` in the global gitconfig. I simultaneously increased `gc.pruneExpire` to 1 day (as we're no longer concerned about each unreferenced object turning into a single file). And I changed the frequently-invoked `git maintenance` code to not run `loose-objects` or `incremental-repack` if a gc.pid file is present (this is still racy but should hopefully work enough of the time). This appeared to work: `git gc` created a cruft pack and didn't complain about anything disappearing out from under it, despite `git maintenance` running `commit-graph` and `pack-refs` tasks during the long-running `git gc` / `git repack`. However, I think there is yet another bug at play: running `incremental-repack` appears to be able to repack the cruft packfile. In doing so, it deletes its .mtimes file and the metadata inside. This behavior seems wrong to me: I was expecting the cruft packfile to linger until the next `git gc` and/or for its mtimes metadata to be preserved across future repacks. Otherwise, what is the point in retaining granular mtime metadata? Is this the behavior expected? Or do cruft packfiles just not work as intended alongside use of the `incremental-task` maintenance task / multi-pack index in Git 2.37? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-06-30 0:44 ` Gregory Szorc @ 2022-06-30 3:19 ` Taylor Blau 2022-06-30 3:23 ` Taylor Blau 0 siblings, 1 reply; 16+ messages in thread From: Taylor Blau @ 2022-06-30 3:19 UTC (permalink / raw) To: Gregory Szorc; +Cc: Taylor Blau, git, stolee On Wed, Jun 29, 2022 at 05:44:25PM -0700, Gregory Szorc wrote: > Revising my initial post, not running `loose-objects` is insufficient: > we also need to prevent `incremental-repack` from running while a `git > gc` / `git repack` is in progress. If an `incremental-repack` runs > concurrently with `git repack`, `repack` can error after bitmap > generation but before the temp packfile is renamed with "could not > find 'pack-*.pack'." I suspect this has to do with > `incremental-repack` deleting packs that were initially in place when > `git repack` started running. But I haven't looked into where exactly > repack is failing. Yeah, you would need to prevent other writers from removing packs while writing a cruft pack. I think the canonical way to do this would be to let `git maintenance` use its own locking to ensure that it wasn't trying to remove packs while simultaneously generating a cruft pack. But assuming that the cruft pack generation is running independently and concurrently with the maintenance incremental-repack task (which will delete packs in the background), then there is definitely a TOCTOU race there. The race is that `repack` will signal to `pack-objects` which packs will stay and which packs will be deleted during the repack. To generate a cruft pack, the packs that stay are formed by doing a reachability traversal and writing a pack that contains all reachable objects. That way anything that is in `all-packs \ reachable` (which form the contents of the cruft pack) are going to be just the unreachable objects. But if a pack that was going to be deleted by repack *after* generating the cruft pack disappears while `repack` is running (particularly after it starts, but before it generates the cruft pack), then the cruft pack generation can't proceed, since it has no idea what objects it needs to add. Namely, if it is told that pack P is going to be deleted, any objects in P which don't appear in a pack that *isn't* going to be deleted need to be saved. If we don't have a copy of P anymore, then there is no way to come up with what that set of objects is, meaning that it's impossible to generate the cruft pack. I should mention, BTW, that this isn't a problem unique to cruft packs. Geometric repacking, which uses `pack-objects --stdin-packs` has a similar issue where if a pack being combined is removed from underneath while `repack` is running, `pack-objects` cannot successfully generate the pack. > However, I think there is yet another bug at play: running > `incremental-repack` appears to be able to repack the cruft packfile. > In doing so, it deletes its .mtimes file and the metadata inside. That sounds like a bug to me. I'll take a look into it and see what I can find. Thanks, Taylor ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-06-30 3:19 ` Taylor Blau @ 2022-06-30 3:23 ` Taylor Blau 2022-06-30 20:12 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Taylor Blau @ 2022-06-30 3:23 UTC (permalink / raw) To: Taylor Blau; +Cc: Gregory Szorc, git, stolee On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote: > > However, I think there is yet another bug at play: running > > `incremental-repack` appears to be able to repack the cruft packfile. > > In doing so, it deletes its .mtimes file and the metadata inside. > > That sounds like a bug to me. I'll take a look into it and see what I > can find. I actually think that there are two bugs here. One is that we run a MIDX repack and expire, which could lead to us repacking the entire contents of the cruft pack and then expiring the metadata file. This is a bug, and we should exclude cruft packs from this computation. Another bug can happen when the cruft pack gets written into the MIDX and is MIDX-expireable (meaning that no objects are selected from the cruft pack). In that case, the `git multi-pack-index expire` step would remove the cruft pack entirely, which is also incorrect. I'll take a look at fixing both of these, and thanks for pointing them out! Thanks, Taylor ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-06-30 3:23 ` Taylor Blau @ 2022-06-30 20:12 ` Junio C Hamano 2022-07-05 18:43 ` Gregory Szorc 2022-07-20 1:40 ` Gregory Szorc 2022-07-20 1:41 ` Gregory Szorc 2 siblings, 1 reply; 16+ messages in thread From: Junio C Hamano @ 2022-06-30 20:12 UTC (permalink / raw) To: Taylor Blau; +Cc: Gregory Szorc, git, stolee Taylor Blau <me@ttaylorr.com> writes: > On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote: >> > However, I think there is yet another bug at play: running >> > `incremental-repack` appears to be able to repack the cruft packfile. >> > In doing so, it deletes its .mtimes file and the metadata inside. >> >> That sounds like a bug to me. I'll take a look into it and see what I >> can find. > > I actually think that there are two bugs here. > > One is that we run a MIDX repack and expire, which could lead to us > repacking the entire contents of the cruft pack and then expiring the > metadata file. This is a bug, and we should exclude cruft packs from > this computation. > > Another bug can happen when the cruft pack gets written into the MIDX > and is MIDX-expireable (meaning that no objects are selected from the > cruft pack). In that case, the `git multi-pack-index expire` step would > remove the cruft pack entirely, which is also incorrect. > > I'll take a look at fixing both of these, and thanks for pointing them > out! Thanks, both. The fact that the semantics of the .mtimes file being not equivalent to the mtime on individual loose objects does not help thinking about the possible ways the "cruft" pack can break, and both of the possible issues you discuss above are indeed tricky ones. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-06-30 20:12 ` Junio C Hamano @ 2022-07-05 18:43 ` Gregory Szorc 2022-07-06 8:50 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 16+ messages in thread From: Gregory Szorc @ 2022-07-05 18:43 UTC (permalink / raw) To: Junio C Hamano; +Cc: Taylor Blau, git, stolee Thinking about this some more, it is clear that running `git gc` / `git repack` *and* `git maintenance` simultaneously is prone to badness since their locking mechanisms aren't aware of the other. I think it makes sense to consolidate onto `git maintenace` going forward as this seems to be where the inertia is. However, a barrier to that for me is the objects/maintenance.lock file has no acquisition timeout and will wait indefinitely. So in scenarios like mine where you have multiple processes looping over repos invoking `git maintenance run`, you can have extended execution stalls during long-running operations like `gc`. There should probably be a configurable timeout for the maintenance lock like there is for other locks. This shouldn't be too much work and I may throw up a patch if others feel this is a good idea. On Thu, Jun 30, 2022 at 1:12 PM Junio C Hamano <gitster@pobox.com> wrote: > > Taylor Blau <me@ttaylorr.com> writes: > > > On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote: > >> > However, I think there is yet another bug at play: running > >> > `incremental-repack` appears to be able to repack the cruft packfile. > >> > In doing so, it deletes its .mtimes file and the metadata inside. > >> > >> That sounds like a bug to me. I'll take a look into it and see what I > >> can find. > > > > I actually think that there are two bugs here. > > > > One is that we run a MIDX repack and expire, which could lead to us > > repacking the entire contents of the cruft pack and then expiring the > > metadata file. This is a bug, and we should exclude cruft packs from > > this computation. > > > > Another bug can happen when the cruft pack gets written into the MIDX > > and is MIDX-expireable (meaning that no objects are selected from the > > cruft pack). In that case, the `git multi-pack-index expire` step would > > remove the cruft pack entirely, which is also incorrect. > > > > I'll take a look at fixing both of these, and thanks for pointing them > > out! > > Thanks, both. > > The fact that the semantics of the .mtimes file being not equivalent > to the mtime on individual loose objects does not help thinking > about the possible ways the "cruft" pack can break, and both of the > possible issues you discuss above are indeed tricky ones. > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-07-05 18:43 ` Gregory Szorc @ 2022-07-06 8:50 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 16+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-07-06 8:50 UTC (permalink / raw) To: Gregory Szorc; +Cc: Junio C Hamano, Taylor Blau, git, stolee On Tue, Jul 05 2022, Gregory Szorc wrote: > Thinking about this some more, it is clear that running `git gc` / > `git repack` *and* `git maintenance` simultaneously is prone to > badness since their locking mechanisms aren't aware of the other. Yes, there are outstanding issues with the "maintenance" and "gc" locks, they should be unified, and the gc.lock has bugs & race conditions that need to be fixed. See these past threads: https://lore.kernel.org/git/87d02fi75p.fsf@evledraar.gmail.com/ https://lore.kernel.org/git/87ef3o7ws1.fsf@evledraar.gmail.com/ I also have some old WIP patches to add a "gc testing" mode to our test suite, similar to the "commit graph" instrumentation if you're interested in working on this. I.e. almost every command will fork off a "gi gc --auto", and we'll see based on the errors whether our locking is still buggy/racy. > I think it makes sense to consolidate onto `git maintenace` going > forward as this seems to be where the inertia is. However, a barrier > to that for me is the objects/maintenance.lock file has no acquisition > timeout and will wait indefinitely. So in scenarios like mine where > you have multiple processes looping over repos invoking `git > maintenance run`, you can have extended execution stalls during > long-running operations like `gc`. > [...] > There should probably be a configurable timeout for the maintenance > lock like there is for other locks. This shouldn't be too much work > and I may throw up a patch if others feel this is a good idea. ...but while I think this is all worthwhile I think you're on the wrong path if you think this will help much or at all with the issue being reported here. Fixing the gc.lock (and maintenance lock...) would be nice because we'd have "gc" be less dumb, and it wouldn't get itself into lock races etc. But you cannot hope to fix the underlying "object expiry as things become reachable" race condition that way, because that's those tasks racing with *other* object/ref creation. Such a fix would either need some repository-global lock (which would entail all sorts of nastyness), or other workarounds around the inherent race between different object store & ref operations, see Taylor's recent "cruft pack write-out" series & my links to some past discussions of the race here: https://lore.kernel.org/git/220630.86y1xeeeik.gmgdl@evledraar.gmail.com/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-06-30 3:23 ` Taylor Blau 2022-06-30 20:12 ` Junio C Hamano @ 2022-07-20 1:40 ` Gregory Szorc 2022-07-20 9:52 ` Ævar Arnfjörð Bjarmason 2022-07-20 1:41 ` Gregory Szorc 2 siblings, 1 reply; 16+ messages in thread From: Gregory Szorc @ 2022-07-20 1:40 UTC (permalink / raw) To: git; +Cc: git, stolee On 6/29/2022 8:23 PM, Taylor Blau wrote: > On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote: >>> However, I think there is yet another bug at play: running >>> `incremental-repack` appears to be able to repack the cruft packfile. >>> In doing so, it deletes its .mtimes file and the metadata inside. >> >> That sounds like a bug to me. I'll take a look into it and see what I >> can find. > > I actually think that there are two bugs here. > > One is that we run a MIDX repack and expire, which could lead to us > repacking the entire contents of the cruft pack and then expiring the > metadata file. This is a bug, and we should exclude cruft packs from > this computation. > > Another bug can happen when the cruft pack gets written into the MIDX > and is MIDX-expireable (meaning that no objects are selected from the > cruft pack). In that case, the `git multi-pack-index expire` step would > remove the cruft pack entirely, which is also incorrect. > > I'll take a look at fixing both of these, and thanks for pointing them > out! For posterity, when I disabled cruft packfiles after having it enabled for a few weeks, the next `git gc` invocation on a high traffic repo resulted in >100k loose objects/files being created before they were summarily deleted by the GC's prune. This is significantly greater than the unreferenced object creation rate of the underlying repo. So it appears as if the MIDX stripping of the cruft packfile mtimes effectively disabled pruning, leading to a build-up of unreferenced objects. Fortunately I hadn't deployed cruft packfiles to production. If I had, the excessive filesystem churn on NFS would have caused an incident due to degraded performance. Since the interaction between cruft packfiles and MIDX appears to be buggy, I think I'm going to leave cruft packfiles disabled until these features work better together. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-07-20 1:40 ` Gregory Szorc @ 2022-07-20 9:52 ` Ævar Arnfjörð Bjarmason 2022-07-26 16:22 ` Gregory Szorc 0 siblings, 1 reply; 16+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-07-20 9:52 UTC (permalink / raw) To: Gregory Szorc; +Cc: git, stolee, Taylor Blau On Tue, Jul 19 2022, Gregory Szorc wrote: > On 6/29/2022 8:23 PM, Taylor Blau wrote: >> On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote: >>>> However, I think there is yet another bug at play: running >>>> `incremental-repack` appears to be able to repack the cruft packfile. >>>> In doing so, it deletes its .mtimes file and the metadata inside. >>> >>> That sounds like a bug to me. I'll take a look into it and see what I >>> can find. >> I actually think that there are two bugs here. >> One is that we run a MIDX repack and expire, which could lead to us >> repacking the entire contents of the cruft pack and then expiring the >> metadata file. This is a bug, and we should exclude cruft packs from >> this computation. >> Another bug can happen when the cruft pack gets written into the >> MIDX >> and is MIDX-expireable (meaning that no objects are selected from the >> cruft pack). In that case, the `git multi-pack-index expire` step would >> remove the cruft pack entirely, which is also incorrect. >> I'll take a look at fixing both of these, and thanks for pointing >> them >> out! > > For posterity, when I disabled cruft packfiles after having it enabled > for a few weeks, the next `git gc` invocation on a high traffic repo > resulted in >100k loose objects/files being created before they were > summarily deleted by the GC's prune. This is significantly greater > than the unreferenced object creation rate of the underlying repo. So > it appears as if the MIDX stripping of the cruft packfile mtimes > effectively disabled pruning, leading to a build-up of unreferenced objects. > > Fortunately I hadn't deployed cruft packfiles to production. If I had, > the excessive filesystem churn on NFS would have caused an incident > due to degraded performance. > > Since the interaction between cruft packfiles and MIDX appears to be > buggy, I think I'm going to leave cruft packfiles disabled until these > features work better together. I haven't looked deeply into whether there's any cruft packfile & MIDX interaction at play here, but I suspect based on past experience that this has nothing whatsoever to do with the MIDX. It's because git suddenly found a bunch of objects that should be evicted from the packs, as you disabled the cruft pack feature. Here's a past test-case & report of mine where I ran into that: https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/ That was way before cruft packfiles were integrated, but from my understanding of how they work the mechanism is exactly the same. I.e. in that case I deleted a bunch of refs (tag objects), which caused those to become unreferenced, so on the next "gc/repack" they were all evicted from their respective packs, at which point the "gc.pruneExpire" clock started ticking for them. Whereas in your case you disabled "cruft packs", which are basically a mechanism where git keeps such "unused" objects in a pack. Once you disabled it git stopped considering the new *.mtimes file, and started extracting these loose objects en-masse. Which b.t.w. is something I think we might want to reconsider. I.e. we just pass "--cruft" to pack-objects (see b757353676d (builtin/pack-objects.c: --cruft without expiration, 2022-05-20)), but don't have a "write cruft" v.s. "read cruft" setting (but see below). The one thing that gives me pause here is your mention of "summarily deleted by the GC's prune". Did you run "git prune --expire=now" manually, or do you have a really aggressive "gc.pruneExpire" setting? It's also posible that I'm entirely misunderstanding this whole thing... Anyway. Since that 2018-era post of mine I've thought a bit about how to best handle this particular edge case. I think the simplest implementation that would work well enough would be to teach "git repack" to limit how many objects it's willing to extract from a pack. I.e. we'd have the amount of "unreachable" objects we'd be willing to eject from the pack limited, with check similar to how "gc.auto" works (i.e. once we reach the limit we'd implicitly turn on "--keep-unreachable"). It's far from perfect, and e.g. if the N objects you're extracting happen to delta-compress particularly well you could still have cases where e.g. a 1GB repo becomes a 10GB one with the new loose objects. But it would allow us to limit the common case where we'd e.g. create 1k new loose objects, not 100k or whatever. It does have the downside that unless you'd disable that limit the "gc" -> "gc.pruneExpire" window would become even more fuzzy. I.e. now if you run "gc" the clock starts ticking right away, after this the clock would start ticking whenever we happened to extract those objects. So there's certainly less dumb ways to do this, and we'd need to make sure that whatever limit we set is aggressive enough that we'd always "stay ahead". I.e. if a repo just has a very high creation rate of objects that become unreachable we'd need to ensure that we wouldn't have an ever growing .git as our eviction rate wouldn't keep up with our additions. For cruft packs (this is a continuation of the "see below" above) we'd do the same thing, but could benefit from knowing that such objects are "already expired", so we could read the *.mtimes, and possibly expire some right away. Aside from some very obscure use-cases I think these these sort of "loose explosion" are one-off events, so if we can smooth those out. The various gc.* settings can be hard to reason about, but we really do try to be gently by default, but notably fail very badly at that in some known edge cases, such as this one. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-07-20 9:52 ` Ævar Arnfjörð Bjarmason @ 2022-07-26 16:22 ` Gregory Szorc 2022-07-26 18:11 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 16+ messages in thread From: Gregory Szorc @ 2022-07-26 16:22 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: git, stolee, Taylor Blau On Wed, Jul 20, 2022 at 3:12 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Tue, Jul 19 2022, Gregory Szorc wrote: > > > On 6/29/2022 8:23 PM, Taylor Blau wrote: > >> On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote: > >>>> However, I think there is yet another bug at play: running > >>>> `incremental-repack` appears to be able to repack the cruft packfile. > >>>> In doing so, it deletes its .mtimes file and the metadata inside. > >>> > >>> That sounds like a bug to me. I'll take a look into it and see what I > >>> can find. > >> I actually think that there are two bugs here. > >> One is that we run a MIDX repack and expire, which could lead to us > >> repacking the entire contents of the cruft pack and then expiring the > >> metadata file. This is a bug, and we should exclude cruft packs from > >> this computation. > >> Another bug can happen when the cruft pack gets written into the > >> MIDX > >> and is MIDX-expireable (meaning that no objects are selected from the > >> cruft pack). In that case, the `git multi-pack-index expire` step would > >> remove the cruft pack entirely, which is also incorrect. > >> I'll take a look at fixing both of these, and thanks for pointing > >> them > >> out! > > > > For posterity, when I disabled cruft packfiles after having it enabled > > for a few weeks, the next `git gc` invocation on a high traffic repo > > resulted in >100k loose objects/files being created before they were > > summarily deleted by the GC's prune. This is significantly greater > > than the unreferenced object creation rate of the underlying repo. So > > it appears as if the MIDX stripping of the cruft packfile mtimes > > effectively disabled pruning, leading to a build-up of unreferenced objects. > > > > Fortunately I hadn't deployed cruft packfiles to production. If I had, > > the excessive filesystem churn on NFS would have caused an incident > > due to degraded performance. > > > > Since the interaction between cruft packfiles and MIDX appears to be > > buggy, I think I'm going to leave cruft packfiles disabled until these > > features work better together. > > I haven't looked deeply into whether there's any cruft packfile & MIDX > interaction at play here, but I suspect based on past experience that > this has nothing whatsoever to do with the MIDX. Taylor seemed to acknowledge a bug here a few posts back. Essentially: 1. Cruft packfile created 2. MIDX repack and expire treats the cruft packfile as a regular pack and folds its content into a new, regular packfile, deleting the mtimes file/metadata in the process. I don't fully grok the mechanism, but I believe the loss of the mtimes metadata effectively results in indefinite retention of unreferenced objects. > > It's because git suddenly found a bunch of objects that should be > evicted from the packs, as you disabled the cruft pack feature. > > Here's a past test-case & report of mine where I ran into that: > > https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/ > > That was way before cruft packfiles were integrated, but from my > understanding of how they work the mechanism is exactly the same. > > I.e. in that case I deleted a bunch of refs (tag objects), which caused > those to become unreferenced, so on the next "gc/repack" they were all > evicted from their respective packs, at which point the "gc.pruneExpire" > clock started ticking for them. > > Whereas in your case you disabled "cruft packs", which are basically a > mechanism where git keeps such "unused" objects in a pack. Once you > disabled it git stopped considering the new *.mtimes file, and started > extracting these loose objects en-masse. > > Which b.t.w. is something I think we might want to reconsider. I.e. we > just pass "--cruft" to pack-objects (see b757353676d > (builtin/pack-objects.c: --cruft without expiration, 2022-05-20)), but > don't have a "write cruft" v.s. "read cruft" setting (but see below). > > The one thing that gives me pause here is your mention of "summarily > deleted by the GC's prune". Did you run "git prune --expire=now" > manually, or do you have a really aggressive "gc.pruneExpire" setting? > It's also posible that I'm entirely misunderstanding this whole thing... > > Anyway. > > Since that 2018-era post of mine I've thought a bit about how to best > handle this particular edge case. > > I think the simplest implementation that would work well enough would be > to teach "git repack" to limit how many objects it's willing to extract > from a pack. I.e. we'd have the amount of "unreachable" objects we'd be > willing to eject from the pack limited, with check similar to how > "gc.auto" works (i.e. once we reach the limit we'd implicitly turn on > "--keep-unreachable"). > > It's far from perfect, and e.g. if the N objects you're extracting > happen to delta-compress particularly well you could still have cases > where e.g. a 1GB repo becomes a 10GB one with the new loose objects. > > But it would allow us to limit the common case where we'd e.g. create 1k > new loose objects, not 100k or whatever. > > It does have the downside that unless you'd disable that limit the "gc" > -> "gc.pruneExpire" window would become even more fuzzy. I.e. now if you > run "gc" the clock starts ticking right away, after this the clock would > start ticking whenever we happened to extract those objects. > > So there's certainly less dumb ways to do this, and we'd need to make > sure that whatever limit we set is aggressive enough that we'd always > "stay ahead". I.e. if a repo just has a very high creation rate of > objects that become unreachable we'd need to ensure that we wouldn't > have an ever growing .git as our eviction rate wouldn't keep up with our > additions. I agree this is a hard problem! In my case, the underlying repo sees a few thousand pushes daily. This results in ~30k new unreferenced objects on a typical workday. With NFS/EFS, we've observed that the entire filesystem can get slow once object loosening has written as few as a few thousand files. Given our high volume, I worry that imposing a loose object creation ceiling will perpetually throttle and we'll accumulate unreferenced objects indefinitely. > For cruft packs (this is a continuation of the "see below" above) we'd > do the same thing, but could benefit from knowing that such objects are > "already expired", so we could read the *.mtimes, and possibly expire > some right away. I don't fully grok how pruning works in a cruft packfile world. My naive hope is that the process creating the cruft packfile is able to elide expired objects so we don't even bother with them during cruft packfile generation. (It seems inefficient/avoidable that we write an expired object - loose or cruft packed - from the context of a gc only to delete it moments later.) > Aside from some very obscure use-cases I think these these sort of > "loose explosion" are one-off events, so if we can smooth those out. The > various gc.* settings can be hard to reason about, but we really do try > to be gently by default, but notably fail very badly at that in some > known edge cases, such as this one. I agree that my scenario of creating an explosion of loose objects is a one-off. IMO not worth spending too much time thinking about. That being said, ~daily pruning on a large enough repo starts to resemble these one-off events. And I think you need something like cruft packfiles to mitigate it. if cruft packfiles are ever enabled by default, it may have a nice side-effect of preventing these one-offs! (But again, I don't grok how pruning works in a cruft packfile world.) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-07-26 16:22 ` Gregory Szorc @ 2022-07-26 18:11 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 16+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-07-26 18:11 UTC (permalink / raw) To: Gregory Szorc; +Cc: git, stolee, Taylor Blau On Tue, Jul 26 2022, Gregory Szorc wrote: > On Wed, Jul 20, 2022 at 3:12 AM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> >> On Tue, Jul 19 2022, Gregory Szorc wrote: >> >> > On 6/29/2022 8:23 PM, Taylor Blau wrote: >> >> On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote: >>> [...] >> > Since the interaction between cruft packfiles and MIDX appears to be >> > buggy, I think I'm going to leave cruft packfiles disabled until these >> > features work better together. >> >> I haven't looked deeply into whether there's any cruft packfile & MIDX >> interaction at play here, but I suspect based on past experience that >> this has nothing whatsoever to do with the MIDX. > > Taylor seemed to acknowledge a bug here a few posts back. Essentially: > > 1. Cruft packfile created > 2. MIDX repack and expire treats the cruft packfile as a regular pack > and folds its content into a new, regular packfile, deleting the > mtimes file/metadata in the process. > > I don't fully grok the mechanism, but I believe the loss of the mtimes > metadata effectively results in indefinite retention of unreferenced > objects. Maybe, but I don't see how indefinite retention would happen unless you're flip-flopping back & forth. If we're deleting the mtimes file but keeping the pack they're in we'd notice that those packed objects are unreferenced on the next gc, so we'd explode them into loose objects, and then the timer starts ticking on their expiry. It's a lot of churn, but once that timer expires we'll then expire those objects. >> >> It's because git suddenly found a bunch of objects that should be >> evicted from the packs, as you disabled the cruft pack feature. >> >> Here's a past test-case & report of mine where I ran into that: >> >> https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/ >> >> That was way before cruft packfiles were integrated, but from my >> understanding of how they work the mechanism is exactly the same. >> >> I.e. in that case I deleted a bunch of refs (tag objects), which caused >> those to become unreferenced, so on the next "gc/repack" they were all >> evicted from their respective packs, at which point the "gc.pruneExpire" >> clock started ticking for them. >> >> Whereas in your case you disabled "cruft packs", which are basically a >> mechanism where git keeps such "unused" objects in a pack. Once you >> disabled it git stopped considering the new *.mtimes file, and started >> extracting these loose objects en-masse. >> >> Which b.t.w. is something I think we might want to reconsider. I.e. we >> just pass "--cruft" to pack-objects (see b757353676d >> (builtin/pack-objects.c: --cruft without expiration, 2022-05-20)), but >> don't have a "write cruft" v.s. "read cruft" setting (but see below). >> >> The one thing that gives me pause here is your mention of "summarily >> deleted by the GC's prune". Did you run "git prune --expire=now" >> manually, or do you have a really aggressive "gc.pruneExpire" setting? >> It's also posible that I'm entirely misunderstanding this whole thing... >> >> Anyway. >> >> Since that 2018-era post of mine I've thought a bit about how to best >> handle this particular edge case. >> >> I think the simplest implementation that would work well enough would be >> to teach "git repack" to limit how many objects it's willing to extract >> from a pack. I.e. we'd have the amount of "unreachable" objects we'd be >> willing to eject from the pack limited, with check similar to how >> "gc.auto" works (i.e. once we reach the limit we'd implicitly turn on >> "--keep-unreachable"). >> >> It's far from perfect, and e.g. if the N objects you're extracting >> happen to delta-compress particularly well you could still have cases >> where e.g. a 1GB repo becomes a 10GB one with the new loose objects. >> >> But it would allow us to limit the common case where we'd e.g. create 1k >> new loose objects, not 100k or whatever. >> >> It does have the downside that unless you'd disable that limit the "gc" >> -> "gc.pruneExpire" window would become even more fuzzy. I.e. now if you >> run "gc" the clock starts ticking right away, after this the clock would >> start ticking whenever we happened to extract those objects. >> >> So there's certainly less dumb ways to do this, and we'd need to make >> sure that whatever limit we set is aggressive enough that we'd always >> "stay ahead". I.e. if a repo just has a very high creation rate of >> objects that become unreachable we'd need to ensure that we wouldn't >> have an ever growing .git as our eviction rate wouldn't keep up with our >> additions. > > I agree this is a hard problem! In my case, the underlying repo sees a > few thousand pushes daily. This results in ~30k new unreferenced > objects on a typical workday. With NFS/EFS, we've observed that the > entire filesystem can get slow once object loosening has written as > few as a few thousand files. Given our high volume, I worry that > imposing a loose object creation ceiling will perpetually throttle and > we'll accumulate unreferenced objects indefinitely. If you're feeling adventurous you may want to try to rebase or otherwise experiment with this old series of mine, that never made it in: https://lore.kernel.org/git/20181028225023.26427-1-avarab@gmail.com/ I don't run git on NFS anymore, but when I did I found that the main contributor to loose object-related slowness on pushes was the collision checking. Whereas if you're willing to simply write duplicate objects .... ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Race condition between repack and loose-objects maintenance task 2022-06-30 3:23 ` Taylor Blau 2022-06-30 20:12 ` Junio C Hamano 2022-07-20 1:40 ` Gregory Szorc @ 2022-07-20 1:41 ` Gregory Szorc 2 siblings, 0 replies; 16+ messages in thread From: Gregory Szorc @ 2022-07-20 1:41 UTC (permalink / raw) To: Taylor Blau; +Cc: git, stolee On Wed, Jun 29, 2022 at 8:23 PM Taylor Blau <me@ttaylorr.com> wrote: > > On Wed, Jun 29, 2022 at 11:19:09PM -0400, Taylor Blau wrote: > > > However, I think there is yet another bug at play: running > > > `incremental-repack` appears to be able to repack the cruft packfile. > > > In doing so, it deletes its .mtimes file and the metadata inside. > > > > That sounds like a bug to me. I'll take a look into it and see what I > > can find. > > I actually think that there are two bugs here. > > One is that we run a MIDX repack and expire, which could lead to us > repacking the entire contents of the cruft pack and then expiring the > metadata file. This is a bug, and we should exclude cruft packs from > this computation. > > Another bug can happen when the cruft pack gets written into the MIDX > and is MIDX-expireable (meaning that no objects are selected from the > cruft pack). In that case, the `git multi-pack-index expire` step would > remove the cruft pack entirely, which is also incorrect. > > I'll take a look at fixing both of these, and thanks for pointing them > out! For posterity, when I disabled cruft packfiles after having it enabled for a few weeks, the next `git gc` invocation on a high traffic repo resulted in >100k loose objects/files being created before they were summarily deleted by the GC's prune. This is significantly greater than the unreferenced object creation rate of the underlying repo. So it appears as if the MIDX stripping of the cruft packfile mtimes effectively disabled pruning, leading to a build-up of unreferenced objects. Fortunately I hadn't deployed cruft packfiles to production. If I had, the excessive filesystem churn on NFS would have caused an incident due to degraded performance. Since the interaction between cruft packfiles and MIDX appears to be buggy, I think I'm going to leave cruft packfiles disabled until these features work better together. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-07-26 18:18 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-06-29 16:55 Race condition between repack and loose-objects maintenance task Gregory Szorc 2022-06-29 17:03 ` Taylor Blau 2022-06-29 17:10 ` Taylor Blau 2022-06-29 17:16 ` Gregory Szorc 2022-06-29 17:21 ` Taylor Blau 2022-06-30 0:44 ` Gregory Szorc 2022-06-30 3:19 ` Taylor Blau 2022-06-30 3:23 ` Taylor Blau 2022-06-30 20:12 ` Junio C Hamano 2022-07-05 18:43 ` Gregory Szorc 2022-07-06 8:50 ` Ævar Arnfjörð Bjarmason 2022-07-20 1:40 ` Gregory Szorc 2022-07-20 9:52 ` Ævar Arnfjörð Bjarmason 2022-07-26 16:22 ` Gregory Szorc 2022-07-26 18:11 ` Ævar Arnfjörð Bjarmason 2022-07-20 1:41 ` Gregory Szorc
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.