From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B721C433F5 for ; Thu, 30 Sep 2021 07:26:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E36AE611C0 for ; Thu, 30 Sep 2021 07:26:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1348733AbhI3H23 (ORCPT ); Thu, 30 Sep 2021 03:28:29 -0400 Received: from cloud.peff.net ([104.130.231.41]:58106 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1348577AbhI3H20 (ORCPT ); Thu, 30 Sep 2021 03:28:26 -0400 Received: (qmail 22261 invoked by uid 109); 30 Sep 2021 07:26:43 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 30 Sep 2021 07:26:43 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 24204 invoked by uid 111); 30 Sep 2021 07:26:43 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 30 Sep 2021 03:26:43 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 30 Sep 2021 03:26:42 -0400 From: Jeff King To: Elijah Newren Cc: Elijah Newren via GitGitGadget , Git Mailing List , Jonathan Nieder , Sergey Organov Subject: Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs Message-ID: References: <67d3b2b09f9ddda616cdd0d1b12ab7afc73670ed.1630376800.git.gitgitgadget@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Tue, Sep 28, 2021 at 10:05:31PM -0700, Elijah Newren wrote: > So there are other parts of running `git log` that are not read-only, > besides what I'm doing. That's unfortunate. Can we just disable > those things in this section of the code? It might be possible for the cache (we'd still generate entries, but just not write them). It's "just" an optimization, though it's possible there are cases where it's really annoying to do so. I'd imagine those are a stretch. I do use textconv to decrypt gpg blobs with a key on a hardware token, for example. I don't require a touch for each one, though, nor would I use textconv caching anyway (since the point is not to store the unencrypted contents). It wouldn't necessarily be OK to disable object writes in general. Though I'd think since a diff is _conceptually_ a read-only operation that any writes it wants to do would be in the same caching / optimization boat. I'd also worry about the maintenance cost of having to annotate any such case. > Also, I think it makes sense that textconv caching causes a problem > via writing new blobs and trees *and* refs that reference these blobs > and trees. However, I'm not sure I'm really grasping it, because I > don't see how your solutions are safe either. I'll mention more after > each one. Heh. So I originally wrote more on this in my earlier email, but worried I was getting too into the weeds and would confuse you, so I pared it down. But I managed to be confusing anyway. :) Here's one mental model. If you have a temporary object store, then these are the bad things that can happen: 1. Some code may write an object to it, thinking the data is saved, but we later throw it away. 2. Some code may _skip_ writing an object, thinking that we already have that object available (when in fact we'll later throw it away). 3. Some code may record the fact that we have an object in memory, and then later (after the temporary object store is not in use), may point to it when writing an object or reference. Pointing the primary object-dir at a temporary directory (like your patch does) means we're subject to all three. Problem (1) happens because unrelated code may not realize we've swapped out the primary object writing code for this throw-away directory. But we could require writers to explicitly indicate that they want to write to the temporary area, while allowing other writes to go to the regular object store. This could be done with a flag or variant of write_object_file(). Or if the temporary area isn't a regular object directory at all (like the whole tempfile / pretend thing), then this happens automatically, because the regular object-write paths don't even know about our temporary area. Problem (2) is mostly handled inside write_object_file(). It will optimize out writes of objects we already have. So it needs to know when we have an object "for real", and when we just have it in our temporary store. If we make the tmp_objdir the primary, then it doesn't know this (though we could teach it to check. In the pretend_object_file() system, it already knows this (it uses the "freshen" code paths which actually want to find the entry in the filesystem). Problem (3) is the hardest one, because we don't really distinguish between readers and writers. So if I call has_object_file(), is it because I want to access the object? Or is it because I want to generate a new reference to it, which must ensure that it exists? One of those is OK to look at the tmp_objdir (and indeed, is the whole point of making it in the first place), and the other is the first step to corrupting the repository. ;) With the environment variables we set up for running external commands (like hooks) in the receive-pack quarantine, one bit of safety we have is to prevent ref writes entirely in those commands. That works reasonably well. Even if they may write new objects that might get thrown away (if the push is rejected), they'd ultimately need to point to those objects with a ref. And because those hooks spend their _entire_ run-time in the quarantine state, they can't ever write a ref. Whereas doing it in-process is a little dicier. Right now the textconv cache writes out a new tree for every entry we add, so it happens to do the ref write while the tmp-objdir is still in place. But as this comment notes: $ git grep -B6 notes_cache_write diff.c diff.c- /* diff.c- * we could save up changes and flush them all at the end, diff.c- * but we would need an extra call after all diffing is done. diff.c- * Since generating a cache entry is the slow path anyway, diff.c- * this extra overhead probably isn't a big deal. diff.c- */ diff.c: notes_cache_write(textconv->cache); this is slow, and it would be perfectly reasonable to flush the cache at the end of the process (e.g., using an atexit() handler). In which case we'd hit this problem (3) exactly: we'd generate and remember objects during the tmp-objdir period, but only later actually reference them. This is more likely than the receive-pack hook case, because we're doing it all in process. > > If you remove the tmp_objdir as the primary as soon as you're done with > > the merge, but before you run the diff, you might be OK, though. > > It has to be after I run the diff, because the diff needs access to > the temporary files to diff against them. Right, of course. I was too fixated on the object-write part, forgetting that the whole point of the exercise is to later read them back. :) > > If not, then I think the solution is probably not to install this as the > > "primary", but rather: > > > > - do the specific remerge-diff writes we want using a special "write > > to this object dir" variant of write_object_file() > > > > - install the tmp_objdir as an alternate, so the reading side (which > > is diff code that doesn't otherwise know about our remerge-diff > > trickery) can find them > > > > And that lets other writers avoid writing into the temporary area > > accidentally. > > Doesn't this have the same problem? If something similar to textconv > caching were to create new trees that referenced the existing blobs > and then added a ref that referred to that tree, then you have the > exact same problem, right? The new tree and the new ref would be > corrupt as soon as the tmp-objdir went away. Whether or not other > callers write to the temporary area isn't the issue, it's whether they > write refs that can refer to anything from the temporary area. Or am > I missing something? The key thing here is in the first step, where remerge-diff is explicitly saying "I want to write to this temporary object-dir". But the textconv-cache code does not; it gets to write to the normal spot. So it avoids problem (1). You're right that it does not avoid problem (3) exactly. But triggering that would require some code not just writing other objects or references while the tmp-objdir is in place, but specifically referencing the objects that remerge-diff did put into the tmp-objdir. That seems a lot less likely to me (because the thing we're most worried about is unrelated code that just happens to write while the tmp-objdir is in place). > > In that sense this is kind of like the pretend_object_file() interface, > > except that it's storing the objects on disk instead of in memory. Of > > course another way of doing that would be to stuff the object data into > > tempfiles and just put pointers into the in-memory cached_objects array. > > > > It's also not entirely foolproof (nor is the existing > > pretend_object_file()). Any operation which is fooled into thinking we > > have object X because it's in the fake-object list or the tmp_objdir may > > reference that object erroneously, creating a corruption. But it's still > > safer than allowing arbitrary writes into the tmp_objdir. > > Why is the pretend_object_file() interface safer? Does it disable the > textconv caching somehow? I don't see why it helps avoid this > problem. So hopefully it's clearer now from what I wrote above, but just connecting the dots: 1. Unrelated code calling write_object_file() would still write real, durable objects, as usual. 2. The write_object_file() "skip this write" optimization already ignores the pretend_object_file() objects while checking "do we have this object". > Interesting. I'm not familiar with the pretend_object_file() code, > but I think I catch what you're saying. I don't see anything in > object-file.c that allows removing items from the cache, as I would > want to do, but I'm sure I could add something. Right, it's pretty limited now, and there's only one caller. And in fact I've wanted to get rid of it, exactly because it can create the danger we're discussing here. But I still think it's _less_ dangerous than fully replacing the object directory. > (I'm still not sure how this avoids the problem of things like > textconv caching trying to write an object that references one of > these, though. Should we just manually disable textconv caching > during a remerge-diff?) It can't avoid it completely, but in general, the textconv cache should be writing and referencing its own objects. Even if they happen to be the same as something remerge-diff generated, it won't know that until it has tried to write it (and so that's why having write_object_file() continue to really write the object is important). Hopefully that clears up my line of thinking. I do think a lot of this is kind of theoretical, in the sense that: - textconv caching is an obscure feature that we _could_ just disable during remerge-diffs - there's a reasonable chance that there isn't any other code that wants to write during a diff But this is all scary and error-prone enough that I'd prefer an approach that has the "least surprise". So any solution where random code calling write_object_file() _can't_ accidentally write to a throw-away directory seems like a safer less surprising thing. It does mean that the remerge-diff code needs to be explicit in its object writes to say "this needs to go to the temporary area" (whether it's a flag, or calling into a totally different function or subsystem). I'm hoping that's not hard to do (because its writes are done explicitly by the remerge-diff code), but I worry that it may be (because you are relying on more generic tree code to the write under the hood). -Peff