From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.0 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, FROM_EXCESS_BASE64,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 1C57E1F45F for ; Mon, 6 May 2019 09:44:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726085AbfEFJoN (ORCPT ); Mon, 6 May 2019 05:44:13 -0400 Received: from mail-ed1-f66.google.com ([209.85.208.66]:34290 "EHLO mail-ed1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725883AbfEFJoN (ORCPT ); Mon, 6 May 2019 05:44:13 -0400 Received: by mail-ed1-f66.google.com with SMTP id w35so12961981edd.1 for ; Mon, 06 May 2019 02:44:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:references:user-agent:in-reply-to:date :message-id:mime-version:content-transfer-encoding; bh=lWFycCqpq98sPeSI10ZuBVjIRVJ+St9t+0DTR03Atv0=; b=UnOZLoYGEhah35AhhmfmDLKM6o65LFuWpM8XFDmwSXobpnz8OR12Xlc6uMw9H06/Yd Yl6EjrTbOzLODvpeqD7aKGILIXOwp7ubc0dPyxEmIHmVemYd6MoxvW+XT9PGuY6GF+2e JPvsNoQP44kgDIREdDVVdGuQHPEMC5teMYehmdA3JctRb6yXvBVZ1j/uHAX/2kgKnZZC RHIptj1afS2IMqhHKax9nFXjR4ct0H9ACf58/qtKPfnNYtcWYVLvYlbmqCinJmVPbcsH klAXrL87nqmBIpF1AITQ7x2gJVMkVGxdkq/Mpgc6bIRgfLWAwb2d6zijnKy1QrRZy+XN Yprw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:user-agent :in-reply-to:date:message-id:mime-version:content-transfer-encoding; bh=lWFycCqpq98sPeSI10ZuBVjIRVJ+St9t+0DTR03Atv0=; b=hgcWnjWpvizHnt6af3Af6RuCEnyh2tRvCoR2FnqKmKMynADJqo2huPkSC0ZJLhx6hX lZMTawXpAiDDmsuv7IkVp5xFTTvU1T4Hde7ZH9XnrR9oeHhetfZ2jJfvx6qXLMU9ATof kzt+p4Eb5UW3zXcOZ5UwKzwK/LXhMHPnOBCAgXIaAhieznqXZ2z7jHN408pS1JkuGM0o LbHXGv0H+VuAYpg/1cIdjcIwyERQMo9yf7HNxKor7WFAK8x3/xlWZfjz/k2dkRXFct9I So5p1DScTqEMo2zRCDiL3pK76fMAQ+pNXA9SfeMhaX2Emx6bywBNLGdgGCMPC4w6nZUW lhcg== X-Gm-Message-State: APjAAAXkpJwErCeGMq5MfOhyPeWX354RZGlre/jX1jHO5nC7HiSsJisl k0c83RQJC8wMQ33TUaNWcEbjoV0A X-Google-Smtp-Source: APXvYqwqefdFAwYt/jJrK1n62QltnvwuhLwIdM4fT7f/dYL+bzHgc1FMU1cvcIHybomUhGlG9zg15Q== X-Received: by 2002:a50:991d:: with SMTP id k29mr23980026edb.29.1557135851162; Mon, 06 May 2019 02:44:11 -0700 (PDT) Received: from evledraar ([5.57.21.49]) by smtp.gmail.com with ESMTPSA id l6sm1512560eja.91.2019.05.06.02.44.09 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 06 May 2019 02:44:09 -0700 (PDT) From: =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason To: Jeff King Cc: git@vger.kernel.org, Junio C Hamano , =?utf-8?B?Tmd1eeG7hW4gVGjDoWkgTmfhu41j?= Duy , Michael Haggerty , Stefan Beller , Jonathan Nieder , Matt McCutchen Subject: Re: [PATCH 0/4] gc docs: modernize and fix the documentation References: <20190318161502.7979-1-avarab@gmail.com> <20190318215107.GH29661@sigill.intra.peff.net> <87ftrjer8s.fsf@evledraar.gmail.com> <20190319001829.GL29661@sigill.intra.peff.net> User-agent: Debian GNU/Linux buster/sid; Emacs 26.1; mu4e 1.1.0 In-reply-to: <20190319001829.GL29661@sigill.intra.peff.net> Date: Mon, 06 May 2019 11:44:06 +0200 Message-ID: <878svjj4t5.fsf@evledraar.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Tue, Mar 19 2019, Jeff King wrote: > On Mon, Mar 18, 2019 at 11:45:39PM +0100, =C3=86var Arnfj=C3=B6r=C3=B0 Bj= armason wrote: > >> > I don't think the quarantine stuff should impact contention at all. It= 's >> > only quarantining the objects, which are the least contentious part of >> > Git (because object content is idempotent, so we don't do any locking >> > there, and with two racing processes, one will just "win"). >> >> Without the quarantine, isn't there the race that the NOTES section >> talks about (unless I've misread it). > > Ah, OK, I wasn't quite sure which documentation you were talking about. > I see the discussion now in the "NOTES" section of git-gc(1). > >> I.e. we have some loose object "ABCD" not referrred to by anything for >> the last 2 weeks, as we're gc-ing a ref update comes in that makes it >> referenced again. We then delete "ABCD" (not used!) at the same time the >> ref update happens, and get corruption. >> >> Whereas the quarantine might work around since the client will have sent >> ABCD with no reference pointing to it to the server in the temp pack, >> which we then rename in-place and then update the ref, so we don't care >> if "ABCD" goes away. > > tl;dr I don't think quarantine impacts this, but if you really want gory > details, read on. > > This is a problem with or without the quarantine. It's fundamentally a > race because we do not atomically say "is anybody using X? If not, we > can delete it" and some other process saying "I'd like to use X". > > Pushes are actually better off than most operations, because we only > advertise what's reachable, and the client is expected to send > everything else. So with just a normal update-ref call, we could race > like this: > > 1. ABCD is ancient. > > 2. Process 1 (update-ref) wants to reference ABCD. It sees that we > have it. > > 3. Process 2 (gc/prune) sees that nobody references it. It deletes > ABCD. > > 4. Process 1 writes out the reference. > > That doesn't happen with a push, because the server never would have > told the client that it has ABCD in the first place (so process 1 here > is the client). That is true with or without quarantine. > > But pushes aren't foolproof either. You said "loose object ABCD not > referred t oby anything for the last 2 weeks". But that's not exactly > how it works. It's "object with an mtime of more than 2 weeks which is > not currently referenced". So imagine a sequence like: > > 1. ABCD is ancient. > > 2. refs/heads/foo points to ABCD. > > 3. Server receive-pack advertises foo pointing to ABCD. > > 4. Simultaneous process on the server deletes refs/heads/foo (or > perhaps somebody force-pushes over it). > > 5. Client prepares and sends pack without ABCD. > > 6. Server receive-pack checks that yes, we still have ABCD (i.e., the > usual connectivity check). > > 7. Server gc drops ABCD, which is now unreachable (reflogs can help > here, if you've enabled them; but we do delete reflogs when the > branch is deleted). > > 8. Server receive-pack writes corrupt repo mentioning ABCD. > > That's a lot more steps, though they might not be as implausible as you > think (e.g., consider moving "refs/heads/foo" to "refs/heads/bar" in a > single push; that's actually a delete and an update, which is all you > need to race with a simultaneous gc). > > I have no idea how often this happens in practice. My subjective > recollection is that most of the race corruptions I've seen were from > local operations on the server. E.g., we compute a tentative merge for > somebody's pull request which shares objects with an older tentative > merge. They click the "merge" button and we reference that commit, which > is recent, but unbeknownst to us, while we were creating our new > tentative merge, a "gc" was deleting the old one. > > We're sometimes saved by the "transitive freshness" rules in d3038d22f9 > (prune: keep objects reachable from recent objects, 2014-10-15). But > they're far from perfect: > > - some operations (like the push rename example) aren't writing new > objects, so the ref write _is_ the moment that gc would find out > something is reachable > > - the "is it reachable?" and "no, then delete it" steps aren't atomic. > Unless you want a whole-repo stop-the-world lock, somebody can > reference the object in between. And since it may take many seconds > to compute reachability, stop-the-world is not great. > > I think there are probably ways to make it better. Perhaps some kind of > lockless delete-but-be-able-to-rollback scheme (but keep in mind this > has to be implemented no top of POSIX filesystem semantics). Or even > just a "compute reachability, mark for deletion, and then hold a > stop-the-world lock briefly to double-check that our reachability is > still up to date". > > At least those seem plausible to me. I've never worked out the details, > and our solution was to just stop deleting objects during routine > maintenance (using "repack -adk"). We do still occasionally prune > manually (e.g., when somebody writes to support to remove a confidential > mistake). > > Anyway, that was more than you probably wanted to know. The short of it > is that I don't think quarantines help (they may even make things worse > by slightly increasing the length of the race window, though in practice > I doubt it). > >> Unless that interacts racily with the receive.unpackLimit, but then I >> have no idea that section is trying to say... > > No, I don't think unpackLimit really affects it much either way. > >> Also, surely the part where "NOTES" says something to the effect of "you >> are subject to races unless gc.auto=3D0" is wrong. To the extent that >> there's races it won't matter that you invoke "git gc" or "git gc >> --auto", it's the concurrency that matters. So if there's still races we >> should be saying the repo needs to be locked for writes for the duration >> of the "gc". > > Correct. It's the very act of pruning that is problematic. I think the > point is that if you are manually running "git gc", you'd presumably do > it at a time when the repository is not otherwise active. Thanks for that E-Mail. I'm hoping to get around to another set of "gc doc" updates and will hopefully be able to steal liberally from it. Maybe there's some case I haven't thought of that makes this stupid, but I wonder if something like a "gc quarantine" might be a fix fo both of the the issues you noted above. I.e. it seems to me that the main issue is that we conflate "mtime 2 weeks old because it's unreferenced for 2 weeks" v.s. "mtime 2 weeks old because we haven't gotten around to a 'gc'". So in such a "gc quarantine" mode when we discover an object/pack that's unreachable/purely made up of unreachable objects we'd move the relevant loose object/"loose" pack to such a quarantine, which would just be .git/unreferenced-objects/{??,pack}/ or whatever. AFAICT both cases you mentioned above would be mitigated by this because we'd no longer conflate "haven't gc'd this yet and it's 2 weeks old" v.s. "hasn't been referenced in 2 weeks". I started looking at this initially because I was wondering if the --keep-unreachable mode you modified in e26a8c4721 ("repack: extend --keep-unreachable to loose objects", 2016-06-13) could be made to write out such "unreferenced" objects into their *own* pack, so we could delete them all at once as a batch, and wouldn't create the "ref explosions" mentioned in [1]. But of course without an accompanying quarantine described above doing that would just make this race condition worse. 1. https://public-inbox.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/