Re: reftable: new ref storage format

From: Shawn Pearce <spearce@spearce.org>
To: Jeff King <peff@peff.net>
Cc: git <git@vger.kernel.org>
Subject: Re: reftable: new ref storage format
Date: Sat, 15 Jul 2017 23:01:47 -0700	[thread overview]
Message-ID: <CAJo=hJtMo4OSxcYbq4oecTQYnwTR0zK8HgyqVEhOYZ-4eu4S9w@mail.gmail.com> (raw)
In-Reply-To: <20170714200830.iks5drqu72cypkny@sigill.intra.peff.net>

On Fri, Jul 14, 2017 at 1:08 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Jul 13, 2017 at 05:11:52PM -0700, Shawn Pearce wrote:
>
> I think the "stack" implementation is what makes me most uncomfortable
> with this proposal. Atomicity with filesystem operations and especially
> readdir() is one of the things I think is most flaky about the current
> system. Here's an idea for an alternative implementation.
>
>   1. Write out reftables to files named after the hash of their content
>      (e.g., $GIT_DIR/reftables/1234abcd...).
>
>   2. The first block of the each reftable has a backpointer to the
>      previous table.
>
>   3. There's a well-known name (e.g., $GIT_DIR/reftable) that represents
>      the current state. We update it with the usual .lock/rename dance.
>
> That gives readers an easy atomic view; once they've called open() on
> "reftable", they follow back-pointers instead of computing the names
> (n-1, n-2, etc). They may still find that a compaction has removed a
> file they need, but:
>
>   - they only have to restart due to an actual compaction. They'll never
>     be tricked by a regular update.
>
>   - you can compact without immediately deleting the old reftables. So
>     you might compact, and then delete the reftables N seconds later. Any
>     reader which completes the read within N seconds wouldn't have to
>     restart.
>
> I think I can anticipate your answer, though. If you have a system where
> the latency to open and read a file is high, then you've just serialized
> the latencies as you walk the chain. Whereas with predictable names, you
> can pre-fetch refname.i through refname.j in parallel.
>
> How deep would you anticipate stacks getting? Would it be feasible for
> the tip to contain the names of the tables in the entire chain? If we're
> talking about 20 (or even 32) bytes per name, you could still fit over a
> hundred names in a 4K inode.

I think we'd want to keep the stacks under 32, which is a reasonable
amount of space used in the header of each reftable. I don't have this
yet in my updated document + implementation, but I'll look at trying
to add it over the next couple of days. Your idea to hold the explicit
list of the stack in each reftable makes for a very safe atomic reader
view.

> It doesn't escape me that I'm basically reinventing RefTree here, with
> reftables instead of tree objects. But I think breaking away from using
> real Git objects opens up a lot of efficiency tricks (like the prefix
> compression, and the parallel-fetch thing above). And it removes a lot
> of the gc complexity.

Yes, I agree. Also RefTree has trouble scaling due to the flat tree
object format. It depends on the user to sensibly break up the
reference space with '/' characters sprinkled about. This reftable
proposal does not suffer from that limitation, a user can use any
valid ref name structuring.

> So I'm worried about live-locking with a regular updater, not even a
> compacting one.

Ok. I think your idea of tracking an explicit list of the stack in the
top of every reftable solves this in a very neat way, so I'll look to
switch to that.

>> block  restart  size  lookup
>> 4k     16       29M    90.2 usec
>> 8k     16       29M    76.4 usec
>
> Thanks for these numbers. I was really thinking that blocks would be on
> the order of 4K (where you can see that the restarts help very little).
> For local disk that's a pretty reasonable size. For high-latency fetches
> to a specialized database, maybe not.
...
>> > One thing I didn't see is reflogs. They don't strictly have to be part
>> > of a ref-storage solution. But they do still consume at least one inode
>> > per ref in the current system. If you reflog everything, which in my
>> > opinion you should. Having an audit trail of ref updates is very useful
>> > for debugging (both bugs in Git, and trying to figure out what went
>> > wrong when a user goes berserk with "git push --prune").
>>
>> Yea, I glossed over that and ignored them. Mostly because one system
>> where I want to use reftable already has a separate database handling
>> the reflogs. In another (Gerrit Code Review), we disable reflogs for
>> the insane refs/changes/ namespace, as nearly every reference is
>> created once, and never modified.
>
> Even for created-once refs, I've found an audit trail of created when,
> by whom, using what program to be quite valuable.

Dave Borowitz agrees with you, Gerrit Code Review should be recording
reflog data, even for create-once refs. Its a limitation of the
current $GIT_DIR/logs that has forced it to be disabled. I'd really
like something like reftable, so we can record reflog.

>> One could abuse the reftable format to store a reflog. If the refname
>> field is just a sequence of bytes, one could store keys like refname +
>> '\0' + timestamp, and reuse the symbolic ref value format to store the
>> old/new/message, as its just a length delimited string.
>
> Gross, but it could work. I actually think David's LMDB proposal did
> something similar (encoding the entries in the keyname), but I'd have to
> double-check.

I added log support to the reftable format. I updated [1] to reflect
log blocks at the end of the file. I ran a year's worth of log
records, 149,932 log entries on 43,061 refs to test:

format                size
$GIT_DIR/logs  173 M
reftable                  4 M  (avg 30 bytes)

[1]: https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md

reftable gets these kinds of savings by packing many logs into a
single file (so no disk block overheads), clustering log records by
ref name, prefix compressing, and deflating log records using 128k
input blocks. I'm OK with using deflate for log records (and bigger
block sizes), as log records are infrequently accessed compared to
current refs.

There is a log index to perform efficient point-in-time lookup for a
single ref, and then iterate over its log records from that time, and
older. That answers most reflog queries quickly. Large batch log reads
like gc reachability can simply iterate through all log records,
clustered by ref, ordered by time descending.