All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC/Pull Request: Refs db backend
@ 2015-06-23  0:50 David Turner
  2015-06-23  5:36 ` Junio C Hamano
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: David Turner @ 2015-06-23  0:50 UTC (permalink / raw)
  To: git mailing list

I've revived and modified Ronnie Sahlberg's work on the refs db
backend.  

The work is on top of be3c13e5564, Junio's "First batch for 2.5 cycle".
I recognize that there have been changes to the refs code since then,
and that there are some further changes in-flight from e.g. Michael
Haggerty.  If there is interest in this, I can rebase once Michael's
changes land.

The changes can be found here:
https://github.com/dturner-tw/git.git on the dturner/pluggable-backends
branch

The db backend code was added in the penultimate commit; the rest is
just code rearrangement and minor changes to make alternate backends
possible.  There ended up being a fair amount of this rearrangement, but
the end result is that almost the entire git test suite runs under the
db backend without error (see below for details).

The db backend runs git for-each-ref about 30% faster than the files
backend with fully-packed refs on a repo with ~120k refs.  It's also
about 4x faster than using fully-unpacked refs.  In addition, and
perhaps more importantly, it avoids case-conflict issues on OS X.

I chose to use LMDB for the database.  LMDB has a few features that make
it suitable for usage in git:

1. It is relatively lightweight; it requires only one header file, and
the library itself is under 300k (as opposed to 700k for
e.g. sqlite).

2. It is well-tested: it's been used in OpenLDAP for years.

3. It's very fast.  LMDB's benchmarks show that it is among
the fastest key-value stores.

4. It has a relatively simple concurrency story; readers don't
block writers and writers don't block readers.

Ronnie Sahlberg's original version of this patchset used tdb.  The
advantage of tdb is that it's smaller (~125k).  The disadvantages are
that tdb is hard to build on OS X.  It's also not in homebrew.  So lmdb
seemed simpler.

To test this backend's correctness, I hacked test-lib.sh and
test-lib-functions.sh to run all tests under the refs backend. Dozens
of tests use manual ref/reflog reading/writing, or create submodules
without passing --refs-backend-type to git init.  If those tests are
changed to use the update-ref machinery or test-refs-be-db (or, in the
case of packed-refs, corrupt refs, and dumb fetch tests, are skipped),
the only remaining failing tests are the git-new-workdir tests and the
gitweb tests.  

Please let me know how it would be best to proceed. 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23  0:50 RFC/Pull Request: Refs db backend David Turner
@ 2015-06-23  5:36 ` Junio C Hamano
  2015-06-23 10:23   ` Duy Nguyen
  2015-06-23 17:29   ` David Turner
  2015-06-23 11:47 ` Jeff King
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2015-06-23  5:36 UTC (permalink / raw)
  To: David Turner; +Cc: git mailing list

David Turner <dturner@twopensource.com> writes:

> I've revived and modified Ronnie Sahlberg's work on the refs db
> backend.  
>
> The work is on top of be3c13e5564, Junio's "First batch for 2.5 cycle".
> I recognize that there have been changes to the refs code since then,
> and that there are some further changes in-flight from e.g. Michael
> Haggerty.  If there is interest in this, I can rebase once Michael's
> changes land.
> ...
> The db backend runs git for-each-ref about 30% faster than the files
> backend with fully-packed refs on a repo with ~120k refs.  It's also
> about 4x faster than using fully-unpacked refs.  In addition, and
> perhaps more importantly, it avoids case-conflict issues on OS X.
>
> I chose to use LMDB for the database...
> ...
> Ronnie Sahlberg's original version of this patchset used tdb.  The
> advantage of tdb is that it's smaller (~125k).  The disadvantages are
> that tdb is hard to build on OS X.  It's also not in homebrew.  So lmdb
> seemed simpler.

"If there is interest"?  Shut up and take my money ;-)

More seriously, that's great that you stepped up to resurrect this
topic.  In a sense, the choice of sample database backend does not
matter.  I do not care if it is tdb, lmdb, or even Berkeley DB as
long as it functions. ;-)

As long as the interface between ref-transaction system on the Git
side and the database backend is designed right, your lmdb thing can
serve as a reference implementation for other people to plug other
database backends to the same interface, right?  As one step to
validate the interface to the database backends, it would be nice to
eventually have at least two backends that talk to meaningfully
different systems, but we have to start somewhere, and "for now we
have lmdb" is as good a place to start as any other db backend.

I wonder if we can do a "filesystem" backend on top of the same
backend interface---is that too much impedance mismatch to make it
unpractical?

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23  5:36 ` Junio C Hamano
@ 2015-06-23 10:23   ` Duy Nguyen
  2015-06-23 18:47     ` David Turner
  2015-06-23 17:29   ` David Turner
  1 sibling, 1 reply; 26+ messages in thread
From: Duy Nguyen @ 2015-06-23 10:23 UTC (permalink / raw)
  To: Junio C Hamano, David Turner; +Cc: git mailing list

On Tue, Jun 23, 2015 at 12:36 PM, Junio C Hamano <gitster@pobox.com> wrote:
> "If there is interest"?  Shut up and take my money ;-)

Yeah. This may be the next big thing since pack bitmap. It's even
better if it enters 'master' hand in hand with pack protocol v2, but I
think v2 needs more time.

On Tue, Jun 23, 2015 at 7:50 AM, David Turner <dturner@twopensource.com> wrote:
> To test this backend's correctness, I hacked test-lib.sh and
> test-lib-functions.sh to run all tests under the refs backend.

Now we have two. split-index also benefits from running through full
test suite like this. I propose we make "make test" run the test suite
twice. The first run is with default configuration, no split index, no
fancy ref backend. The second run enables split-index and switches to
new backend, running through all test cases. In future we can also
enable packv4 in this second run. There won't be a third run.

When the second ref backend comes, we can switch between the two
backends using a random number generator where we control both
algorithm and seed, so that when a test fails, the user can give us
their seed and we can re-run with the same configuration.

> Dozens of tests use manual ref/reflog reading/writing, or create submodules
> without passing --refs-backend-type to git init.  If those tests are
> changed to use the update-ref machinery or test-refs-be-db (or, in the
> case of packed-refs, corrupt refs, and dumb fetch tests, are skipped),
> the only remaining failing tests are the git-new-workdir tests and the
> gitweb tests.

I haven't read the series, but I guess you should also add a few tests
to run on the first run, so new code is exercised a bit even if people
skip the second run.
-- 
Duy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23  0:50 RFC/Pull Request: Refs db backend David Turner
  2015-06-23  5:36 ` Junio C Hamano
@ 2015-06-23 11:47 ` Jeff King
  2015-06-23 13:10   ` Duy Nguyen
                     ` (2 more replies)
  2015-06-23 15:51 ` Michael Haggerty
  2015-06-23 17:16 ` Stefan Beller
  3 siblings, 3 replies; 26+ messages in thread
From: Jeff King @ 2015-06-23 11:47 UTC (permalink / raw)
  To: David Turner; +Cc: git mailing list

On Mon, Jun 22, 2015 at 08:50:56PM -0400, David Turner wrote:

> The db backend runs git for-each-ref about 30% faster than the files
> backend with fully-packed refs on a repo with ~120k refs.  It's also
> about 4x faster than using fully-unpacked refs.  In addition, and
> perhaps more importantly, it avoids case-conflict issues on OS X.

Neat.

Can you describe a bit more about the reflog handling?

One of the problems we've had with large-ref repos is that the reflog
storage is quite inefficient. You can pack all the refs, but you may
still be stuck with a bunch of reflog files with one entry, wasting a
whole inode. Doing a "git repack" when you have a million of those has
horrible cold-cache performance. Basically anything that isn't
one-file-per-reflog would be a welcome change. :)

It has also been a dream of mine to stop tying the reflogs specifically
to the refs. I.e., have a spot for reflogs of branches that no longer
exist, which allows us to retain them for deleted branches. Then you can
possibly recover from a branch deletion, whereas now you have to dig
through "git fsck"'s dangling output. And the reflog, if you don't
expire it, becomes a suitable audit log to find out what happened to
each branch when (whereas now it is full of holes when things get
deleted).

Does your solution handle something like that?

I was thinking of actually moving to a log-structured ref storage.
Something like:

  - any ref write puts a line at the end of a single logfile that
    contains the ref name, along with the normal reflog data

  - the logfile is the source of truth for the ref state. If you want to
    know the value of any ref, you can read it backwards to find the
    last entry for the ref. Everything else is an optimization.

    Let's call the number of refs N, and the number of ref updates in
    the log U.

  - we keep a key/value index mapping the name of any branch that exists
    to the byte offset of its entry in the logfile. This would probably
    be in some binary key/value store (like LMDB). Without this,
    resolving a ref is O(U), which is horrible. With it, it should be
    O(1) or O(lg N), depending on the index data structure.

  - the index can also contain other optimizations. E.g., rather than
    point to the entry in the logfile, it can include the sha1 directly
    (to avoid an extra level of indirection). It may want to include the
    "peeled" value, as the current packed-refs file does.

  - Reading all of the reflogs (e.g., for repacking) is O(U), just like
    it is today. Except the storage for the logfile is a lot more
    compact than what we store today, with one reflog per file.

  - Reading a single reflog is _also_ O(U), which is not as good as
    today. But if each log entry contains a byte offset of the previous
    entry, you can follow the chain (it is still slightly worse, because
    you are jumping all over the file, rather than reading a compact set
    of lines).

  - Pruning the reflog entries from the logfile requires rewriting the
    whole thing. That's similar to today, where we rewrite each of the
    reflog files.

One of the nice properties of this system is that it should be very
resilient to corruption and races. Most of the operations are either
appending to a file, or writing to a tempfile and renaming in place.
The exception is the key/value index, but if we run into any problems
there, it can be rebuilt by walking over the logfile (for a cost of
O(U)).

I dunno. Maybe I am overthinking it. But it really feels like the _refs_
are a key/value thing, but the _reflogs_ are not. You can cram them into
a key/value store, but you're probably operating on them as a big blob,
then.

> I chose to use LMDB for the database.  LMDB has a few features that make
> it suitable for usage in git:

One of the complaints that Shawn had about sqlite is that there is no
native Java implementation, which makes it hard for JGit to ship a
compatible backend. I suspect the same is true for LMDB, but it is
probably a lot simpler than sqlite (so reimplementation might be
possible).

But it may also be worth going with a slightly slower database if we can
get wider compatibility for free.

> To test this backend's correctness, I hacked test-lib.sh and
> test-lib-functions.sh to run all tests under the refs backend. Dozens
> of tests use manual ref/reflog reading/writing, or create submodules
> without passing --refs-backend-type to git init.  If those tests are
> changed to use the update-ref machinery or test-refs-be-db (or, in the
> case of packed-refs, corrupt refs, and dumb fetch tests, are skipped),
> the only remaining failing tests are the git-new-workdir tests and the
> gitweb tests.

I think we'll need to bump core.repositoryformatversion, too. See the
patches I just posted here:

  http://thread.gmane.org/gmane.comp.version-control.git/272447

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 11:47 ` Jeff King
@ 2015-06-23 13:10   ` Duy Nguyen
  2015-06-24  8:51     ` Jeff King
  2015-06-23 18:18   ` David Turner
  2015-06-24  6:09   ` Shawn Pearce
  2 siblings, 1 reply; 26+ messages in thread
From: Duy Nguyen @ 2015-06-23 13:10 UTC (permalink / raw)
  To: Jeff King; +Cc: David Turner, git mailing list

On Tue, Jun 23, 2015 at 6:47 PM, Jeff King <peff@peff.net> wrote:
> I was thinking of actually moving to a log-structured ref storage.
> Something like:
>
>   - any ref write puts a line at the end of a single logfile that
>     contains the ref name, along with the normal reflog data
>
>   - the logfile is the source of truth for the ref state. If you want to
>     know the value of any ref, you can read it backwards to find the
>     last entry for the ref. Everything else is an optimization.
>
>     Let's call the number of refs N, and the number of ref updates in
>     the log U.
>
>   - we keep a key/value index mapping the name of any branch that exists
>     to the byte offset of its entry in the logfile. This would probably

One key/value mapping per branch, pointing to the latest reflog entry,
or one key/valye mapping for each reflog entry?

>     be in some binary key/value store (like LMDB). Without this,
>     resolving a ref is O(U), which is horrible. With it, it should be
>     O(1) or O(lg N), depending on the index data structure.

I'm thinking of the user with small or medium repos, in terms of refs,
who does not want an extra dependency. If we store one mapping per
branch, then the size of this mapping is small enough that the index
in a text file is ok. If we also store the offset to the previous
reflog entry of the same branch in the current reflog entry, like a
back pointer, then we could jump back faster.

Or do you have something else in mind? Current reflog structure won't
work because I think you bring back the reflog graveyard with this,
and I don't want to lose that

>   - the index can also contain other optimizations. E.g., rather than
>     point to the entry in the logfile, it can include the sha1 directly
>     (to avoid an extra level of indirection). It may want to include the
>     "peeled" value, as the current packed-refs file does.
>
>   - Reading all of the reflogs (e.g., for repacking) is O(U), just like
>     it is today. Except the storage for the logfile is a lot more
>     compact than what we store today, with one reflog per file.
>
>   - Reading a single reflog is _also_ O(U), which is not as good as
>     today. But if each log entry contains a byte offset of the previous
>     entry, you can follow the chain (it is still slightly worse, because
>     you are jumping all over the file, rather than reading a compact set
>     of lines).
>
>   - Pruning the reflog entries from the logfile requires rewriting the
>     whole thing. That's similar to today, where we rewrite each of the
>     reflog files.
-- 
Duy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23  0:50 RFC/Pull Request: Refs db backend David Turner
  2015-06-23  5:36 ` Junio C Hamano
  2015-06-23 11:47 ` Jeff King
@ 2015-06-23 15:51 ` Michael Haggerty
  2015-06-23 19:53   ` David Turner
  2015-06-23 17:16 ` Stefan Beller
  3 siblings, 1 reply; 26+ messages in thread
From: Michael Haggerty @ 2015-06-23 15:51 UTC (permalink / raw)
  To: David Turner, git mailing list

On 06/23/2015 02:50 AM, David Turner wrote:
> I've revived and modified Ronnie Sahlberg's work on the refs db
> backend.  
> 
> The work is on top of be3c13e5564, Junio's "First batch for 2.5 cycle".
> I recognize that there have been changes to the refs code since then,
> and that there are some further changes in-flight from e.g. Michael
> Haggerty.  If there is interest in this, I can rebase once Michael's
> changes land.

It's awesome that you are working on this!

I'm reading through your commits and will add comments as they pop into
my head...

* I initially read "refs-be-files" to be a short version of "references,
they be files". I might never be able to get that pronunciation out of
my head :-)

* It would be more modest to call the files implementing the LMDB
backend "refs-be-lmdb.[c,h]" rather than "refs-be-db.[c,h]".

* I wonder whether `refname_is_safe()` might eventually have to become
backend-specific. For example, maybe one backend will have to impose a
limit of 128 characters or something? No matter, though...it can be
moved later.

* You have put `format_reflog_msg()` in the public interface. It
probably makes sense, because more than one backend might want to use
it. But another backend might want to store (refname, old_sha1,
new_sha1, ...) as separate columns in a database. As long as
`format_reflog_msg()` is seen as a helper function and is not called by
any of the shared code, it shouldn't be a problem.

* I wonder whether `init_backend()` will be general enough. I'm thinking
by analogy with object constructors, which usually need class-specific
arguments during their initialization even though afterwards objects of
different classes can be used interchangeably. So I guess the idea is
that a typical `init_backend()` function will have to dig around itself
to find whatever additional information that it needs (e.g., from the
git configuration or the filesystem or whatever). So I think this is OK.

* Your "methods for bulk updates" are I think analogous to the
`initial_ref_transaction_commit()` function that I recently submitted
[1]. Either way, the goal is to abstract away the fact that the
file-based backend uses packed and loose references with tradeoffs that
callers currently have to know about.

* I don't like the fact that you have replaced `struct ref_transaction
*` with `void *` in the public interface. On a practical level, I like
the bit of type-safety that comes with the more specific declaration.
But on a more abstract level, I think that the concept of a transaction
could be useful across backends, for example in utility functions that
verify that a proposed set of updates are internally consistent. I would
rather see either

  * backends "extend" a basic `struct ref_transaction` to suit their
needs, and upcast/downcast pointers at the module boundary, or

  * `struct ref_transaction` itself gets a `void *` member that backends
can use for whatever purposes they want.

* Regarding MERGE_HEAD: you take the point of view that it must continue
to be stored as a file. And yet it must also behave somewhat like a
reference; for example, `git rev-parse MERGE_HEAD` works today.
MERGE_HEAD is also used for reachability, right?

Another point of view is that MERGE_HEAD is a plain old boring
reference, but there is some other metadata related to it that the refs
backend has to store. The file-based backend would have special-case
code to read the additional data from the tail of the loose refs file
(and be sure to write the metadata when writing the reference), but
other backends could store the reference with the rest but do their own
thing with the metadata. So I guess I'm wondering whether the refs API
needs a MERGE_HEAD-specific way to read and write MERGE_HEAD along with
its metadata.

* Don't the same considerations that apply to MERGE_HEAD also apply to
FETCH_HEAD?

* I'm showing my ignorance of LMDB, but I don't see where the LMDB
backend initializes its database during `git init-db`. Is that what
`init_env()` does? But it looks like `init_env()` is called on every git
invocation (via `git_config_early()`). Puzzled.

* Meanwhile, `create_default_files()` (in `builtin/init-db`) still
creates directories `refs`, `refs/heads`, and `refs/tags`.

* Rehash of the last two points: I expected one backend function that is
used to initialize the refs backend when a new repository is created
(e.g., in `git init`). The file-based backend would use this function to
create the `refs`, `refs/heads`, and `refs/tags` directories. I expected
a second function that is called once every time git runs in an existing
repository (this one might, for example, open a database connection).
And maybe even a third one that closes down the database connection
before git exits. Would you please explain how this actually works?

* `lmdb_init_backend()` leaks `path` if `env` is already set (in which
case it needn't compute `path` in the first place).

* You have the constraint that submodules need to use the same reference
backend as the main repository. But it looks like each submodule has its
own independent database. So why the restriction?

It might be a sign that the design is not optimal if it is only able to
handle one *type* of reference backend in a single run.

In object-oriented language, I would expect a `Refs` object to represent
the reference storage for a single repository or submodule. The VTABLE
for the object would be a `struct ref_be`. But the object should also
have an object pointer that can store per-instance data. I think the
per-instance data is missing in your design.

When I start up, I would instantiate a `Refs` instance for the main
repository, which creates either a `FileRefs` or a `LMDBRefs` instance.
This instance would be used to access the refs for the main repository.
It could be stored in a global variable, similarly to how `ref_cache` is
currently stored. (Indeed, `struct ref_cache` would be subsumed into
`FileRefs`.)

If I want to access references in a submodule, there would again be an
initialization process for that sub-repository, which checks the type of
references backend that the repository uses and instantiates either a
`FileRefs` or a `LMDBRefs` instance to represent *that* repository. The
submodule instances could be stored by submodule name in a lookup table,
like submodule_ref_caches is currently stored. Since it has its own
instance data, the `Refs` instance for a submodule can coexist with the
`Refs` instance for the main repository even if they are of different types.

But even aside from supporting heterogeneous submodules, I think adding
instance data to the design would make it quite a bit more flexible,
because it would also allow reference backends to be composed. For
example, one could implement a reference backend that maps
`refs/remotes/origin` to a namespace in another repository, for a
transparent view of an upstream repo (using alternates to share objects)
that doesn't have to be updated when references are updated in the
origin. Or we could implement loose and packed references as two
separate backends that are layered on each other. Or we could implement
a lightweight "mirror" clone with copy-on-write semantics for
references. We could arrange to store all of the references for a
top-level repo and its submodules in a single database, potentially
allowing atomic upgrades across repositories [2].

* You explain in the docstring to `lmdb_transaction_begin_flags()` that
it is dangerous to call a callback function while a write transaction is
open if the callback function might want to initiate another write
transaction. This would obviously also apply to running hooks. This is a
limitation of LMDB because writers block each other. I can't think of
anyplace that this would be a problem in our codebase. But if it were,
it seems to me that you could take an approach like the file-based
backend, which collects the transaction in a `ref_transaction` data
structure, and executes the entire transaction within a single call to
`ref_transaction_commit()`. This approach would prevent callers outside
of the refs module from ever bumping against this limitation.

So, that was my stream-of-consciousness about your patch series. Overall
I like it very much. I have only skimmed it so far, and hardly read the
last two patches at all, but what I saw all looked good and well-organized.

Please CC me on future versions of this patch series, because it is very
close to my interests. I've put a lot of effort into encapsulating and
abstracting the refs module with the goal of getting to pluggable
reference backends (plus some other stuff), so it's great to see what
you have accomplished!

Let me know if you need any help rebasing your work onto my recent
changes. It would probably work best if you break your patch series into
smaller pieces to make them easier for the mailing list to digest. For
example, the first installment could be the patches that make sense even
independent of the plan to add support for multiple backends: the first
two patches, plus the ones related to CHERRY_PICK_HEAD and its cousins,
the abstraction of the reflog functions, and the `git reflog create` and
`git reflog exists` subcommands.

Michael

[1] http://article.gmane.org/gmane.comp.version-control.git/272362
[2] To implement this feature, it might be necessary to make the `Refs`
instance for the main repository responsible for instantiating the
`Refs` instance for submodules.

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23  0:50 RFC/Pull Request: Refs db backend David Turner
                   ` (2 preceding siblings ...)
  2015-06-23 15:51 ` Michael Haggerty
@ 2015-06-23 17:16 ` Stefan Beller
  2015-06-23 20:04   ` David Turner
  3 siblings, 1 reply; 26+ messages in thread
From: Stefan Beller @ 2015-06-23 17:16 UTC (permalink / raw)
  To: David Turner; +Cc: git mailing list, ronnie sahlberg

[+<ronniesahlberg@gmail.com>, FYI]

On Mon, Jun 22, 2015 at 5:50 PM, David Turner <dturner@twopensource.com> wrote:
> I've revived and modified Ronnie Sahlberg's work on the refs db
> backend.

Awesome!

>
> The work is on top of be3c13e5564, Junio's "First batch for 2.5 cycle".
> I recognize that there have been changes to the refs code since then,
> and that there are some further changes in-flight from e.g. Michael
> Haggerty.  If there is interest in this, I can rebase once Michael's
> changes land.

Originally I wanted to continue on Ronnies work, but because of the churn
in refs I stopped it for a while and took care of other projects (and wanted
to come back eventually). Thanks for reviving this topic!

> The changes can be found here:
> https://github.com/dturner-tw/git.git on the dturner/pluggable-backends
> branch
>
> The db backend code was added in the penultimate commit; the rest is
> just code rearrangement and minor changes to make alternate backends
> possible.  There ended up being a fair amount of this rearrangement, but
> the end result is that almost the entire git test suite runs under the
> db backend without error (see below for details).

Looking at the end result in refs-be-db.c it feels like there are more
functions in the refs_be_db struct, did this originate from other design
choices? IIRC Ronnie wanted to have as least functions in there as
possible, and share as much of the code between the databases, such
that the glue between the db and the refs code is minimal.

Some random comments from looking over the branch briefly:

In the latest commit, (refs: tests for db backend), I am unsure about the
copyright annotations. At least a sole "Copyright (c) 2007 Junio C Hamano"
doesn't make sense to me. ;)

Typo in commit message "bisect: use refs insfrastructure for BISECT_START"

Some commits contain a ChangeId, which is a Gerrit leftover. :(

Thanks,
Stefan

>
> The db backend runs git for-each-ref about 30% faster than the files
> backend with fully-packed refs on a repo with ~120k refs.  It's also
> about 4x faster than using fully-unpacked refs.  In addition, and
> perhaps more importantly, it avoids case-conflict issues on OS X.
>
> I chose to use LMDB for the database.  LMDB has a few features that make
> it suitable for usage in git:
>
> 1. It is relatively lightweight; it requires only one header file, and
> the library itself is under 300k (as opposed to 700k for
> e.g. sqlite).
>
> 2. It is well-tested: it's been used in OpenLDAP for years.
>
> 3. It's very fast.  LMDB's benchmarks show that it is among
> the fastest key-value stores.
>
> 4. It has a relatively simple concurrency story; readers don't
> block writers and writers don't block readers.
>
> Ronnie Sahlberg's original version of this patchset used tdb.  The
> advantage of tdb is that it's smaller (~125k).  The disadvantages are
> that tdb is hard to build on OS X.  It's also not in homebrew.  So lmdb
> seemed simpler.
>
> To test this backend's correctness, I hacked test-lib.sh and
> test-lib-functions.sh to run all tests under the refs backend. Dozens
> of tests use manual ref/reflog reading/writing, or create submodules
> without passing --refs-backend-type to git init.  If those tests are
> changed to use the update-ref machinery or test-refs-be-db (or, in the
> case of packed-refs, corrupt refs, and dumb fetch tests, are skipped),
> the only remaining failing tests are the git-new-workdir tests and the
> gitweb tests.
>
> Please let me know how it would be best to proceed.
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23  5:36 ` Junio C Hamano
  2015-06-23 10:23   ` Duy Nguyen
@ 2015-06-23 17:29   ` David Turner
  1 sibling, 0 replies; 26+ messages in thread
From: David Turner @ 2015-06-23 17:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git mailing list

On Mon, 2015-06-22 at 22:36 -0700, Junio C Hamano wrote:
> David Turner <dturner@twopensource.com> writes:
> 
> > I've revived and modified Ronnie Sahlberg's work on the refs db
> > backend.  
> >
> > The work is on top of be3c13e5564, Junio's "First batch for 2.5 cycle".
> > I recognize that there have been changes to the refs code since then,
> > and that there are some further changes in-flight from e.g. Michael
> > Haggerty.  If there is interest in this, I can rebase once Michael's
> > changes land.
> > ...
> > The db backend runs git for-each-ref about 30% faster than the files
> > backend with fully-packed refs on a repo with ~120k refs.  It's also
> > about 4x faster than using fully-unpacked refs.  In addition, and
> > perhaps more importantly, it avoids case-conflict issues on OS X.
> >
> > I chose to use LMDB for the database...
> > ...
> > Ronnie Sahlberg's original version of this patchset used tdb.  The
> > advantage of tdb is that it's smaller (~125k).  The disadvantages are
> > that tdb is hard to build on OS X.  It's also not in homebrew.  So lmdb
> > seemed simpler.
> 
> "If there is interest"?  Shut up and take my money ;-)
> 
> More seriously, that's great that you stepped up to resurrect this
> topic.  In a sense, the choice of sample database backend does not
> matter.  I do not care if it is tdb, lmdb, or even Berkeley DB as
> long as it functions. ;-)
> 
> As long as the interface between ref-transaction system on the Git
> side and the database backend is designed right, your lmdb thing can
> serve as a reference implementation for other people to plug other
> database backends to the same interface, right? 

Yes.

>  As one step to
> validate the interface to the database backends, it would be nice to
> eventually have at least two backends that talk to meaningfully
> different systems, but we have to start somewhere, and "for now we
> have lmdb" is as good a place to start as any other db backend.
> 
> I wonder if we can do a "filesystem" backend on top of the same
> backend interface---is that too much impedance mismatch to make it
> unpractical?

The patch series does include a filesystem backend, which is simply the
current ref infrastructure with extremely minor changes.  

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 11:47 ` Jeff King
  2015-06-23 13:10   ` Duy Nguyen
@ 2015-06-23 18:18   ` David Turner
  2015-06-24  9:14     ` Jeff King
  2015-06-24  6:09   ` Shawn Pearce
  2 siblings, 1 reply; 26+ messages in thread
From: David Turner @ 2015-06-23 18:18 UTC (permalink / raw)
  To: Jeff King; +Cc: git mailing list

On Tue, 2015-06-23 at 07:47 -0400, Jeff King wrote:
> On Mon, Jun 22, 2015 at 08:50:56PM -0400, David Turner wrote:
> 
> > The db backend runs git for-each-ref about 30% faster than the files
> > backend with fully-packed refs on a repo with ~120k refs.  It's also
> > about 4x faster than using fully-unpacked refs.  In addition, and
> > perhaps more importantly, it avoids case-conflict issues on OS X.
> 
> Neat.
> 
> Can you describe a bit more about the reflog handling?
> 
> One of the problems we've had with large-ref repos is that the reflog
> storage is quite inefficient. You can pack all the refs, but you may
> still be stuck with a bunch of reflog files with one entry, wasting a
> whole inode. Doing a "git repack" when you have a million of those has
> horrible cold-cache performance. Basically anything that isn't
> one-file-per-reflog would be a welcome change. :)

Reflogs are stored in the database as well.  There is one header entry
per ref to indicate that a reflog is present, and then one database
entry per reflog entry; the entries are stored consecutively and
immediately following the header so that it's fast to iterate over them.

> It has also been a dream of mine to stop tying the reflogs specifically
> to the refs. I.e., have a spot for reflogs of branches that no longer
> exist, which allows us to retain them for deleted branches. Then you can
> possibly recover from a branch deletion, whereas now you have to dig
> through "git fsck"'s dangling output. And the reflog, if you don't
> expire it, becomes a suitable audit log to find out what happened to
> each branch when (whereas now it is full of holes when things get
> deleted).

That would be cool, and I don't think it would be hard to add to my
current code; we could simply replace the header with a "tombstone".
But I would prefer to wait until the series is merged; then we can build
on top of it.

> I dunno. Maybe I am overthinking it. But it really feels like the _refs_
> are a key/value thing, but the _reflogs_ are not. You can cram them into
> a key/value store, but you're probably operating on them as a big blob,
> then.

Reflogs are, conceptually, queues. I agree that a raw key-value store is
not a good way to store queues, but a B-Tree is not so terrible, since
it offers relatively fast iteration (amortized constant time IIRC).

> > I chose to use LMDB for the database.  LMDB has a few features that make
> > it suitable for usage in git:
> 
> One of the complaints that Shawn had about sqlite is that there is no
> native Java implementation, which makes it hard for JGit to ship a
> compatible backend. I suspect the same is true for LMDB, but it is
> probably a lot simpler than sqlite (so reimplementation might be
> possible).
> 
> But it may also be worth going with a slightly slower database if we can
> get wider compatibility for free.

There's a JNI interface to LMDB, which is, of course, not native.  I
don't think it would be too hard to entirely rewrite LMDB in Java, but
I'm not going to have time to do it for the forseeable future.  I've
asked Howard Chu if he knows of any efforts in progress.

> > To test this backend's correctness, I hacked test-lib.sh and
> > test-lib-functions.sh to run all tests under the refs backend. Dozens
> > of tests use manual ref/reflog reading/writing, or create submodules
> > without passing --refs-backend-type to git init.  If those tests are
> > changed to use the update-ref machinery or test-refs-be-db (or, in the
> > case of packed-refs, corrupt refs, and dumb fetch tests, are skipped),
> > the only remaining failing tests are the git-new-workdir tests and the
> > gitweb tests.
> 
> I think we'll need to bump core.repositoryformatversion, too. See the
> patches I just posted here:
> 
>   http://thread.gmane.org/gmane.comp.version-control.git/272447

Thanks, that's valuable.  For the refs backend, opening the LMDB
database for writing is sufficient to block other writers.  Do you think
it would be valuable to provide a git hold-ref-lock command that simply
reads refs from stdin and keeps them locked until it reads EOF from
stdin?  That would allow cross-backend ref locking. 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 10:23   ` Duy Nguyen
@ 2015-06-23 18:47     ` David Turner
  0 siblings, 0 replies; 26+ messages in thread
From: David Turner @ 2015-06-23 18:47 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, git mailing list

On Tue, 2015-06-23 at 17:23 +0700, Duy Nguyen wrote:
> On Tue, Jun 23, 2015 at 7:50 AM, David Turner
<dturner@twopensource.com> wrote:
> > To test this backend's correctness, I hacked test-lib.sh and
> > test-lib-functions.sh to run all tests under the refs backend.
> 
> Now we have two. split-index also benefits from running through full
> test suite like this. I propose we make "make test" run the test suite
> twice. The first run is with default configuration, no split index, no
> fancy ref backend. The second run enables split-index and switches to
> new backend, running through all test cases. In future we can also
> enable packv4 in this second run. There won't be a third run.
> 
> When the second ref backend comes, we can switch between the two
> backends using a random number generator where we control both
> algorithm and seed, so that when a test fails, the user can give us
> their seed and we can re-run with the same configuration.

I'm not in love with this idea, because it makes it hard to do
exhaustive testing efficiently.  I would rather have make test run
through all tests under all combinations -- or at least all relevant
tests.  We could perhaps mark tests with a list of features that they
exercise, so that we don't have to run e.g. t8xxx with alternate refs
backends.  

> Dozens of tests use manual ref/reflog reading/writing, or create
submodules
> > without passing --refs-backend-type to git init.  If those tests are
> > changed to use the update-ref machinery or test-refs-be-db (or, in
the
> > case of packed-refs, corrupt refs, and dumb fetch tests, are
skipped),
> > the only remaining failing tests are the git-new-workdir tests and
the
> > gitweb tests.
> 
> I haven't read the series, but I guess you should also add a few tests
> to run on the first run, so new code is exercised a bit even if people
> skip the second run.

I did this already, yes.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 15:51 ` Michael Haggerty
@ 2015-06-23 19:53   ` David Turner
  2015-06-23 21:27     ` Michael Haggerty
  2015-06-23 21:35     ` David Turner
  0 siblings, 2 replies; 26+ messages in thread
From: David Turner @ 2015-06-23 19:53 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git mailing list

On Tue, 2015-06-23 at 17:51 +0200, Michael Haggerty wrote:
> On 06/23/2015 02:50 AM, David Turner wrote:
> > I've revived and modified Ronnie Sahlberg's work on the refs db
> > backend.  
> > 
> > The work is on top of be3c13e5564, Junio's "First batch for 2.5 cycle".
> > I recognize that there have been changes to the refs code since then,
> > and that there are some further changes in-flight from e.g. Michael
> > Haggerty.  If there is interest in this, I can rebase once Michael's
> > changes land.
> 
> It's awesome that you are working on this!
> 
> I'm reading through your commits and will add comments as they pop into
> my head...
> 
> * I initially read "refs-be-files" to be a short version of "references,
> they be files". I might never be able to get that pronunciation out of
> my head :-)

That's OK so long as I can keep pronouncing "reflog" as "re-flog". ;)

> * It would be more modest to call the files implementing the LMDB
> backend "refs-be-lmdb.[c,h]" rather than "refs-be-db.[c,h]".

Agreed.  Will fix.

> * I wonder whether `refname_is_safe()` might eventually have to become
> backend-specific. For example, maybe one backend will have to impose a
> limit of 128 characters or something? No matter, though...it can be
> moved later.

I think it would be an error to allow backends to impose additional
limits on ref names.  The limits imposed by the current backend have
been the cause of much sadness here at Twitter (primarily,
case-conflicts combined with d/f conflicts).

> * You have put `format_reflog_msg()` in the public interface. It
> probably makes sense, because more than one backend might want to use
> it. But another backend might want to store (refname, old_sha1,
> new_sha1, ...) as separate columns in a database. As long as
> `format_reflog_msg()` is seen as a helper function and is not called by
> any of the shared code, it shouldn't be a problem.

Agreed.

> * I wonder whether `init_backend()` will be general enough. 

We can always upgrade it later.

> * Your "methods for bulk updates" are I think analogous to the
> `initial_ref_transaction_commit()` function that I recently submitted
> [1]. Either way, the goal is to abstract away the fact that the
> file-based backend uses packed and loose references with tradeoffs that
> callers currently have to know about.

Yes, I saw your work after I had already started mine.

> * I don't like the fact that you have replaced `struct ref_transaction
> *` with `void *` in the public interface. On a practical level, I like
> the bit of type-safety that comes with the more specific declaration.
> But on a more abstract level, I think that the concept of a transaction
> could be useful across backends, for example in utility functions that
> verify that a proposed set of updates are internally consistent. I would
> rather see either
> 
>   * backends "extend" a basic `struct ref_transaction` to suit their
> needs, and upcast/downcast pointers at the module boundary, or
> 
>   * `struct ref_transaction` itself gets a `void *` member that backends
> can use for whatever purposes they want.

There are no common fields between refs-be-file transactions and
refs-be-lmdb transactions.  I don't see much gain from adding an empty
ref_transaction that backends could extend, since we would have to
explicitly upcast/downcast all over the place.

> * Regarding MERGE_HEAD: you take the point of view that it must continue
> to be stored as a file. And yet it must also behave somewhat like a
> reference; for example, `git rev-parse MERGE_HEAD` works today.
> MERGE_HEAD is also used for reachability, right?
> 
> Another point of view is that MERGE_HEAD is a plain old boring
> reference, but there is some other metadata related to it that the refs
> backend has to store. The file-based backend would have special-case
> code to read the additional data from the tail of the loose refs file
> (and be sure to write the metadata when writing the reference), but
> other backends could store the reference with the rest but do their own
> thing with the metadata. So I guess I'm wondering whether the refs API
> needs a MERGE_HEAD-specific way to read and write MERGE_HEAD along with
> its metadata.

You are probably right that this is a good idea.

> * Don't the same considerations that apply to MERGE_HEAD also apply to
> FETCH_HEAD?

All of the tests pass without any special handling of FETCH_HEAD.

> * I'm showing my ignorance of LMDB, but I don't see where the LMDB
> backend initializes its database during `git init-db`. Is that what
> `init_env()` does? But it looks like `init_env()` is called on every git
> invocation (via `git_config_early()`). Puzzled.

There is no need to explicitly create the database (other than with
mkdir); init_env does everything for you.

> * Meanwhile, `create_default_files()` (in `builtin/init-db`) still
> creates directories `refs`, `refs/heads`, and `refs/tags`.

Yeah, that's legit.  I'll create a backend initdb function, and rename
init to setup.

> * Rehash of the last two points: I expected one backend function that is
> used to initialize the refs backend when a new repository is created
> (e.g., in `git init`). The file-based backend would use this function to
> create the `refs`, `refs/heads`, and `refs/tags` directories. I expected
> a second function that is called once every time git runs in an existing
> repository (this one might, for example, open a database connection).
> And maybe even a third one that closes down the database connection
> before git exits. Would you please explain how this actually works?

LMDB doesn't really have the concept of a "connection".  It's basically
just a couple of files that communicate using shared memory (and maybe
some other locking that I haven't paid attention to).  There is the
concept of a "transaction", which is the unit of concurrency (each
thread may only have one open transaction).  Transactions are either
read-only or read-write, and there can only be one read-write
transaction open at a time (across the entire system).  Read-only
transactions take a snapshot of the DB state at transaction start time.
This combination of features means that we need to be a bit clever about
read-only transactions; if a read-write transaction occurs in a separate
process, we need to restart any read-only transactions to pick up its
changes.

Requiring an explicit disconnect from the database would be problematic
because of the number of situations in which git just calls die(). If a
backend desires a disconnect, it would be best to just call atexit().

> * `lmdb_init_backend()` leaks `path` if `env` is already set (in which
> case it needn't compute `path` in the first place).

Will fix, thanks.

> * You have the constraint that submodules need to use the same reference
> backend as the main repository. But it looks like each submodule has its
> own independent database. So why the restriction?
>
> It might be a sign that the design is not optimal if it is only able to
> handle one *type* of reference backend in a single run.

Yes, that is the reason.  I think it would be rather difficult to fix
this, but I guess it's possible.

> In object-oriented language, I would expect a `Refs` object to represent
> the reference storage for a single repository or submodule. The VTABLE
> for the object would be a `struct ref_be`. But the object should also
> have an object pointer that can store per-instance data. I think the
> per-instance data is missing in your design.

For some of the code, that's the transaction.  But since we only ever
have one transaction, we could just move all that to the `Refs` object.

<snip arguments for this> 

I'll try to write some code and see what this looks like.

> * You explain in the docstring to `lmdb_transaction_begin_flags()` that
> it is dangerous to call a callback function while a write transaction is
> open if the callback function might want to initiate another write
> transaction. This would obviously also apply to running hooks.

I carefully limit the scope of write transactions to avoid problems like
this.

>  This is a
> limitation of LMDB because writers block each other. I can't think of
> anyplace that this would be a problem in our codebase. But if it were,
> it seems to me that you could take an approach like the file-based
> backend, which collects the transaction in a `ref_transaction` data
> structure, and executes the entire transaction within a single call to
> `ref_transaction_commit()`. This approach would prevent callers outside
> of the refs module from ever bumping against this limitation.

The file-based backend does per-ref locking early, and then applies the
transactions. Here, taking the write transaction is how the lmdb backend
does locking.  So the situations are not quite the same.  But I think
keeping the scope of transactions small is the best plan.

> So, that was my stream-of-consciousness about your patch series. Overall
> I like it very much. I have only skimmed it so far, and hardly read the
> last two patches at all, but what I saw all looked good and well-organized.

Thanks.  A fair amount of it is Ronnie's work, and I tried to copy his
approach as much as possible.

> Please CC me on future versions of this patch series, because it is very
> close to my interests. I've put a lot of effort into encapsulating and
> abstracting the refs module with the goal of getting to pluggable
> reference backends (plus some other stuff), so it's great to see what
> you have accomplished!

Will do!

> Let me know if you need any help rebasing your work onto my recent
> changes. It would probably work best if you break your patch series into
> smaller pieces to make them easier for the mailing list to digest. For
> example, the first installment could be the patches that make sense even
> independent of the plan to add support for multiple backends: the first
> two patches, plus the ones related to CHERRY_PICK_HEAD and its cousins,
> the abstraction of the reflog functions, and the `git reflog create` and
> `git reflog exists` subcommands.

I would love some help rebasing. I'll break out the patches you suggest
and send them to the list, then create a new branch with the rest of the
changes.  Would that be a good place for you to start?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 17:16 ` Stefan Beller
@ 2015-06-23 20:04   ` David Turner
  2015-06-23 20:10     ` Randall S. Becker
  0 siblings, 1 reply; 26+ messages in thread
From: David Turner @ 2015-06-23 20:04 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git mailing list, ronnie sahlberg

On Tue, 2015-06-23 at 10:16 -0700, Stefan Beller wrote:
> > The db backend code was added in the penultimate commit; the rest is
> > just code rearrangement and minor changes to make alternate backends
> > possible.  There ended up being a fair amount of this 
> > rearrangement,  but the end result is that almost the entire git 
> > test suite runs under the db backend without error (see below for 
> details).
> 
> Looking at the end result in refs-be-db.c it feels like there are more
> functions in the refs_be_db struct, did this originate from other 
> design choices? IIRC Ronnie wanted to have as least functions in 
> there as possible, and share as much of the code between the 
> databases, such that the glue between the db and the refs code is 
> minimal.

I didn't actually spend that much time reading Ronnie's backend code.
My code aims to be extremely thoroughly compatible.  I spent a ton of
time making sure that the git test suite passed.  I don't know if an
alternate approach would have been as compatible.

The requirement for reflog storage did complicate things a bit.

I also didn't see a strong need to abstract the database, since LMDB is
common, widely compatible, and tiny.  

> Some random comments from looking over the branch briefly:
> 
> In the latest commit, (refs: tests for db backend), I am unsure about 
> the copyright annotations. At least a sole "Copyright (c) 2007 Junio C
> Hamano" doesn't make sense to me. ;)

Will fix, thanks.

> Typo in commit message "bisect: use refs insfrastructure for 
> BISECT_START"

Will fix, thanks.

> Some commits contain a ChangeId, which is a Gerrit leftover. :(

Those were leftover from Ronnie's patches; since you are a Googler and
you think we don't need them, I'll remove them. 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: RFC/Pull Request: Refs db backend
  2015-06-23 20:04   ` David Turner
@ 2015-06-23 20:10     ` Randall S. Becker
  2015-06-23 20:22       ` David Turner
  0 siblings, 1 reply; 26+ messages in thread
From: Randall S. Becker @ 2015-06-23 20:10 UTC (permalink / raw)
  To: 'David Turner', 'Stefan Beller'
  Cc: 'git mailing list', 'ronnie sahlberg'

> -----Original Message-----
> From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On
> Behalf Of David Turner
> Sent: June 23, 2015 4:05 PM
> To: Stefan Beller
> Cc: git mailing list; ronnie sahlberg
> Subject: Re: RFC/Pull Request: Refs db backend
> 
> On Tue, 2015-06-23 at 10:16 -0700, Stefan Beller wrote:
> > > The db backend code was added in the penultimate commit; the rest is
> > > just code rearrangement and minor changes to make alternate backends
> > > possible.  There ended up being a fair amount of this rearrangement,
> > > but the end result is that almost the entire git test suite runs
> > > under the db backend without error (see below for
> > details).
> >
> > Looking at the end result in refs-be-db.c it feels like there are more
> > functions in the refs_be_db struct, did this originate from other
> > design choices? IIRC Ronnie wanted to have as least functions in there
> > as possible, and share as much of the code between the databases, such
> > that the glue between the db and the refs code is minimal.
> 
> I didn't actually spend that much time reading Ronnie's backend code.
> My code aims to be extremely thoroughly compatible.  I spent a ton of time
> making sure that the git test suite passed.  I don't know if an alternate
> approach would have been as compatible.
> 
> The requirement for reflog storage did complicate things a bit.
> 
> I also didn't see a strong need to abstract the database, since LMDB is common,
> widely compatible, and tiny.

Just to beg a request: LMDB is not available on some MPP architectures to which git has been ported. If it comes up, I beg you not to add this as a dependency to base git components.

Thanks,
Randall

-- Brief whoami: NonStop&UNIX developer since approximately UNIX(421664400)/NonStop(211288444200000000)
-- In my real life, I talk too much.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 20:10     ` Randall S. Becker
@ 2015-06-23 20:22       ` David Turner
  2015-06-23 20:27         ` Randall S. Becker
  0 siblings, 1 reply; 26+ messages in thread
From: David Turner @ 2015-06-23 20:22 UTC (permalink / raw)
  To: Randall S. Becker
  Cc: 'Stefan Beller', 'git mailing list',
	'ronnie sahlberg'

> Just to beg a request: LMDB is not available on some MPP architectures to which git has been ported. If it comes up, I beg you not to add this as a dependency to base git components.

My changes make `configure` check for the presence of liblmdb. The LMDB
code is only built if liblmdb is present.  So, I think we're good.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: RFC/Pull Request: Refs db backend
  2015-06-23 20:22       ` David Turner
@ 2015-06-23 20:27         ` Randall S. Becker
  0 siblings, 0 replies; 26+ messages in thread
From: Randall S. Becker @ 2015-06-23 20:27 UTC (permalink / raw)
  To: 'David Turner'
  Cc: 'Stefan Beller', 'git mailing list',
	'ronnie sahlberg'

> -----Original Message-----
> From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On
> Behalf Of David Turner
> Sent: June 23, 2015 4:22 PM
> To: Randall S. Becker
> Cc: 'Stefan Beller'; 'git mailing list'; 'ronnie sahlberg'
> Subject: Re: RFC/Pull Request: Refs db backend
> 
> > Just to beg a request: LMDB is not available on some MPP architectures to
> which git has been ported. If it comes up, I beg you not to add this as a
> dependency to base git components.
> 
> My changes make `configure` check for the presence of liblmdb. The LMDB
> code is only built if liblmdb is present.  So, I think we're good.

Thanks :) You have no idea how much, in a "burnt by that in other projects" POV.

Cheers,
Randall

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 19:53   ` David Turner
@ 2015-06-23 21:27     ` Michael Haggerty
  2015-06-24 17:31       ` David Turner
  2015-06-23 21:35     ` David Turner
  1 sibling, 1 reply; 26+ messages in thread
From: Michael Haggerty @ 2015-06-23 21:27 UTC (permalink / raw)
  To: David Turner; +Cc: git mailing list

On 06/23/2015 09:53 PM, David Turner wrote:
> On Tue, 2015-06-23 at 17:51 +0200, Michael Haggerty wrote:
> [...]
>> * I don't like the fact that you have replaced `struct ref_transaction
>> *` with `void *` in the public interface. On a practical level, I like
>> the bit of type-safety that comes with the more specific declaration.
>> But on a more abstract level, I think that the concept of a transaction
>> could be useful across backends, for example in utility functions that
>> verify that a proposed set of updates are internally consistent. I would
>> rather see either
>>
>>   * backends "extend" a basic `struct ref_transaction` to suit their
>> needs, and upcast/downcast pointers at the module boundary, or
>>
>>   * `struct ref_transaction` itself gets a `void *` member that backends
>> can use for whatever purposes they want.
> 
> There are no common fields between refs-be-file transactions and
> refs-be-lmdb transactions.  I don't see much gain from adding an empty
> ref_transaction that backends could extend, since we would have to
> explicitly upcast/downcast all over the place.

If you ask me, it would be better to do a bunch of up/downcasts within
the single module (via two helper functions that could even do
consistency checks) than have no help from the compiler in preventing
people from passing unrelated pointer types into the `void *transaction`
argument. Plus the `struct ref_transaction *` variables scattered
throughout the code are a lot more self-explanatory than `void *`.

>> * Regarding MERGE_HEAD: you take the point of view that it must continue
>> to be stored as a file. And yet it must also behave somewhat like a
>> reference; for example, `git rev-parse MERGE_HEAD` works today.
>> MERGE_HEAD is also used for reachability, right?
>>
>> Another point of view is that MERGE_HEAD is a plain old boring
>> reference, but there is some other metadata related to it that the refs
>> backend has to store. The file-based backend would have special-case
>> code to read the additional data from the tail of the loose refs file
>> (and be sure to write the metadata when writing the reference), but
>> other backends could store the reference with the rest but do their own
>> thing with the metadata. So I guess I'm wondering whether the refs API
>> needs a MERGE_HEAD-specific way to read and write MERGE_HEAD along with
>> its metadata.
> 
> You are probably right that this is a good idea.
> 
>> * Don't the same considerations that apply to MERGE_HEAD also apply to
>> FETCH_HEAD?
> 
> All of the tests pass without any special handling of FETCH_HEAD.

That's odd. From git-fetch.txt:

    The names of refs that are fetched, together with the object names
    they point at, are written to `.git/FETCH_HEAD`.  This information
    may be used by scripts or other git commands, such as
    linkgit:git-pull[1].

It seems like the test suite is reading FETCH_HEAD via the refs API in a
couple of places. I don't understand why these don't fail when LMDB is
being used...

>> * I'm showing my ignorance of LMDB, but I don't see where the LMDB
>> backend initializes its database during `git init-db`. Is that what
>> `init_env()` does? But it looks like `init_env()` is called on every git
>> invocation (via `git_config_early()`). Puzzled.
> 
> There is no need to explicitly create the database (other than with
> mkdir); init_env does everything for you.

OK.

>> * Rehash of the last two points: I expected one backend function that is
>> used to initialize the refs backend when a new repository is created
>> (e.g., in `git init`). The file-based backend would use this function to
>> create the `refs`, `refs/heads`, and `refs/tags` directories. I expected
>> a second function that is called once every time git runs in an existing
>> repository (this one might, for example, open a database connection).
>> And maybe even a third one that closes down the database connection
>> before git exits. Would you please explain how this actually works?
> 
> LMDB doesn't really have the concept of a "connection".  It's basically
> just a couple of files that communicate using shared memory (and maybe
> some other locking that I haven't paid attention to).  There is the
> concept of a "transaction", which is the unit of concurrency (each
> thread may only have one open transaction).  Transactions are either
> read-only or read-write, and there can only be one read-write
> transaction open at a time (across the entire system).  Read-only
> transactions take a snapshot of the DB state at transaction start time.
> This combination of features means that we need to be a bit clever about
> read-only transactions; if a read-write transaction occurs in a separate
> process, we need to restart any read-only transactions to pick up its
> changes.

If you are thinking about an *unrelated* separate process, then Git's
philosophy is that if our process is reading *some* valid state of the
references, it's all good even if that state is not quite the newest.
After all, who's to say whether our process ran before or after the
other process? As long as each process sees self-consistent views of the
world as it existed at some recent time, we're satisfied.

To be sure, we even fail at that unambitious goal, because loose
reference reads are not atomic. It is possible that we read some
references in the state they had before the other process ran, and
others in the state they had after the other process was finished. This
can get ugly if, for example, the other process renamed a reference,
because we might not see the reference under either its old *or* its new
name. We might therefore conclude that the objects reachable from that
reference are dangling and garbage-collect them.

If the other process is one that we started ourselves, then that is a
different situation and we would want, for example, to invalidate our
reference cache after the other process is done.

One of the long-standing hopes of a DB-backed reference backend would be
to improve this situation--allowing atomic writes *and* reads.

> [...]
> 
>> * You explain in the docstring to `lmdb_transaction_begin_flags()` that
>> it is dangerous to call a callback function while a write transaction is
>> open if the callback function might want to initiate another write
>> transaction. This would obviously also apply to running hooks.
> 
> I carefully limit the scope of write transactions to avoid problems like
> this.
> 
>>  This is a
>> limitation of LMDB because writers block each other. I can't think of
>> anyplace that this would be a problem in our codebase. But if it were,
>> it seems to me that you could take an approach like the file-based
>> backend, which collects the transaction in a `ref_transaction` data
>> structure, and executes the entire transaction within a single call to
>> `ref_transaction_commit()`. This approach would prevent callers outside
>> of the refs module from ever bumping against this limitation.
> 
> The file-based backend does per-ref locking early, and then applies the
> transactions. Here, taking the write transaction is how the lmdb backend
> does locking.  So the situations are not quite the same.  But I think
> keeping the scope of transactions small is the best plan.

The file-based backend locks the references early *within
ref_transaction_commit()*, not as the transaction is being built up
using ref_transaction_update() etc. This is a big difference, because
code anywhere can call

    ref_transaction_begin(...);
    ANYTHING
    ref_transaction_update(...);
    ANYTHING
    ref_transaction_commit(...);

The only way to be sure that ANYTHING can't create a deadlock with the
open transaction (for example by calling a hook script that runs a git
command) is to audit all of that code now and in the future. Whereas the
file-based backend doesn't do anything that is externally observable or
deadlocky except within the single ref_transaction_commit() function
call, so only that one function has to be audited for actions that could
cause a deadlock.

> [...]
>> Let me know if you need any help rebasing your work onto my recent
>> changes. It would probably work best if you break your patch series into
>> smaller pieces to make them easier for the mailing list to digest. For
>> example, the first installment could be the patches that make sense even
>> independent of the plan to add support for multiple backends: the first
>> two patches, plus the ones related to CHERRY_PICK_HEAD and its cousins,
>> the abstraction of the reflog functions, and the `git reflog create` and
>> `git reflog exists` subcommands.
> 
> I would love some help rebasing. I'll break out the patches you suggest
> and send them to the list, then create a new branch with the rest of the
> changes.  Would that be a good place for you to start?

That sounds like a good next step, maybe after waiting a day or so to
see if there are any fundamental objections to what you have done so far.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 19:53   ` David Turner
  2015-06-23 21:27     ` Michael Haggerty
@ 2015-06-23 21:35     ` David Turner
  2015-06-23 21:41       ` Junio C Hamano
  1 sibling, 1 reply; 26+ messages in thread
From: David Turner @ 2015-06-23 21:35 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git mailing list

On Tue, 2015-06-23 at 15:53 -0400, David Turner wrote:
> > * Regarding MERGE_HEAD: you take the point of view that it must continue
> > to be stored as a file. And yet it must also behave somewhat like a
> > reference; for example, `git rev-parse MERGE_HEAD` works today.
> > MERGE_HEAD is also used for reachability, right?
> > 
> > Another point of view is that MERGE_HEAD is a plain old boring
> > reference, but there is some other metadata related to it that the refs
> > backend has to store. The file-based backend would have special-case
> > code to read the additional data from the tail of the loose refs file
> > (and be sure to write the metadata when writing the reference), but
> > other backends could store the reference with the rest but do their own
> > thing with the metadata. So I guess I'm wondering whether the refs API
> > needs a MERGE_HEAD-specific way to read and write MERGE_HEAD along with
> > its metadata.
> 
> You are probably right that this is a good idea.

On reflection, I think it might make sense to keep MERGE_HEAD as a file.
The problem is that not only would refs backends have to add new
MERGE_HEAD-handling functions, but we would also need new plumbing
commands to allow scripts to access the complete contents of MERGE_HEAD.
That seems more complicated to me.  

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 21:35     ` David Turner
@ 2015-06-23 21:41       ` Junio C Hamano
  0 siblings, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2015-06-23 21:41 UTC (permalink / raw)
  To: David Turner; +Cc: Michael Haggerty, git mailing list

David Turner <dturner@twopensource.com> writes:

> On Tue, 2015-06-23 at 15:53 -0400, David Turner wrote:
>> > * Regarding MERGE_HEAD: you take the point of view that it must continue
>> > to be stored as a file. And yet it must also behave somewhat like a
>> > reference; for example, `git rev-parse MERGE_HEAD` works today.
>> > MERGE_HEAD is also used for reachability, right?
>> > 
>> > Another point of view is that MERGE_HEAD is a plain old boring
>> > reference, but there is some other metadata related to it that the refs
>> > backend has to store. The file-based backend would have special-case
>> > code to read the additional data from the tail of the loose refs file
>> > (and be sure to write the metadata when writing the reference), but
>> > other backends could store the reference with the rest but do their own
>> > thing with the metadata. So I guess I'm wondering whether the refs API
>> > needs a MERGE_HEAD-specific way to read and write MERGE_HEAD along with
>> > its metadata.
>> 
>> You are probably right that this is a good idea.
>
> On reflection, I think it might make sense to keep MERGE_HEAD as a file.
> The problem is that not only would refs backends have to add new
> MERGE_HEAD-handling functions, but we would also need new plumbing
> commands to allow scripts to access the complete contents of MERGE_HEAD.
> That seems more complicated to me.  

I think you are talking about FETCH_HEAD, but I tend to agree.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 11:47 ` Jeff King
  2015-06-23 13:10   ` Duy Nguyen
  2015-06-23 18:18   ` David Turner
@ 2015-06-24  6:09   ` Shawn Pearce
  2015-06-24  9:49     ` Jeff King
  2015-06-24 10:18     ` Duy Nguyen
  2 siblings, 2 replies; 26+ messages in thread
From: Shawn Pearce @ 2015-06-24  6:09 UTC (permalink / raw)
  To: Jeff King; +Cc: David Turner, git mailing list

On Tue, Jun 23, 2015 at 4:47 AM, Jeff King <peff@peff.net> wrote:
>
> One of the problems we've had with large-ref repos is that the reflog
> storage is quite inefficient.

Yup. We ran into this with Gerrit Code Review years ago. The
refs/changes/... namespace created by Gerrit Code Review is 1 ref per
snapshot per code review, and never modified. Reflogs for these are
always exactly one record. We broke down and modified JGit to add an
API that allowed Gerrit Code Review to disable recording reflogs for
specific updates just to avoid creating reflogs under refs/changes/.

In our JGit DFS implementation we store reflogs in databases to
eliminate these overheads. It works well for us. Hopefully the feature
can come to git-core through this series.

> It has also been a dream of mine to stop tying the reflogs specifically
> to the refs. I.e., have a spot for reflogs of branches that no longer
> exist, which allows us to retain them for deleted branches. Then you can
> possibly recover from a branch deletion, whereas now you have to dig
> through "git fsck"'s dangling output. And the reflog, if you don't
> expire it, becomes a suitable audit log to find out what happened to
> each branch when (whereas now it is full of holes when things get
> deleted).

Yes. $DAY_JOB's DFS implementation never expires reflogs, allowing it
to be used as a history to inspect what happened. Its been useful a
couple of times to investigate and recover from a few accidental
deletions.

Once you never expire reflog records you now have to consider at what
point do you stop paying attention to the reflog entries for graph
reachability during repack and fsck. Users still expect to be able to
force push or delete a branch and have a set of objects disappear from
the repository.

I am looking forward to something like this in git-core. I delete
branches in my local repos and then regret that. Then remember HEAD
has a reflog and hope I can find it somewhere in there. Usually I
fail, and am sad. :(

> I was thinking of actually moving to a log-structured ref storage.
> Something like:
>
>   - any ref write puts a line at the end of a single logfile that
>     contains the ref name, along with the normal reflog data
>
>   - the logfile is the source of truth for the ref state. If you want to
>     know the value of any ref, you can read it backwards to find the
>     last entry for the ref. Everything else is an optimization.
>
>     Let's call the number of refs N, and the number of ref updates in
>     the log U.
>
>   - we keep a key/value index mapping the name of any branch that exists
>     to the byte offset of its entry in the logfile. This would probably
>     be in some binary key/value store (like LMDB). Without this,
>     resolving a ref is O(U), which is horrible. With it, it should be
>     O(1) or O(lg N), depending on the index data structure.

This ... would be fantastic.

There are some issues with append. Before appending we would need to
verify the last record actually ends with an LF. If there was a power
failure and only part of the last record wrote, you can't append
without that record separator in place.

If that last record was truncated, and an LF was wedged in to do a new
append, we can't trust that intermediate record. A CRC at the end of
the record might make it safer to know the record is intact or bogus
due to an earlier failed write that wasn't completed.

What about the case of never expiring the reflog? This log would grow
forever. You may eventually need to archive old sections of it (e.g. 1
year ago?) to maintain an audit log, while keeping the "latest" entry
for each ref to rebuild the index.

>   - the index can also contain other optimizations. E.g., rather than
>     point to the entry in the logfile, it can include the sha1 directly
>     (to avoid an extra level of indirection). It may want to include the
>     "peeled" value, as the current packed-refs file does.

+1 to always storing the peeled value. This was a major improvement
for $DAY_JOB's Git servers as peeling tags on the fly can be costly
when your storage is something remote, such as NFS. Unfortunately the
current wire protocol demands peeled information to serve a ref
advertisement.

One thing we do is always peel all refs. We record a bit to state its
been peeled, but there is no peeled value because the ref is pointing
to a non-tag object (e.g. refs/heads/master points to a commit).

I guess this puts an index structure at something like:

  refname \0 log_idx_4 sha1_20 ('n' | 'p' sha1_20)

Or refname + 26 bytes for heads and refname + 46 bytes for tags.


Updating the index on updates to a ref would be costly, as its O(N).
You could skip some index updates. Record in the header of the index
the length of the reflog file used to build it. When reading the
index, scan the reflog from that position to the end and patch those
updates in memory. Rewrites of the index could then be deferred until
the scan delta on the log is high, or the next gc.

>   - Reading all of the reflogs (e.g., for repacking) is O(U), just like
>     it is today. Except the storage for the logfile is a lot more
>     compact than what we store today, with one reflog per file.
>
>   - Reading a single reflog is _also_ O(U), which is not as good as
>     today. But if each log entry contains a byte offset of the previous
>     entry, you can follow the chain (it is still slightly worse, because
>     you are jumping all over the file, rather than reading a compact set
>     of lines).

But this is like saying reading `git log` is bad because we jump all
over the pack file to parse ancestors and insert them into the
revqueue at the correct position. Feh.

I think given the typical size of reflogs, this is irrelevant.

>   - Pruning the reflog entries from the logfile requires rewriting the
>     whole thing. That's similar to today, where we rewrite each of the
>     reflog files.
>
> One of the nice properties of this system is that it should be very
> resilient to corruption and races. Most of the operations are either
> appending to a file, or writing to a tempfile and renaming in place.
> The exception is the key/value index, but if we run into any problems
> there, it can be rebuilt by walking over the logfile (for a cost of
> O(U)).
>
> I dunno. Maybe I am overthinking it.

Not really. Your idea is quite simple. I like it.

> But it really feels like the _refs_
> are a key/value thing, but the _reflogs_ are not. You can cram them into
> a key/value store, but you're probably operating on them as a big blob,
> then.

+1. Refs are key/value but you need all of the key/value pairs fast
for the current wire protocol.

Reflogs are a long queue that is more or less just table scanned when
its accessed.

>> I chose to use LMDB for the database.  LMDB has a few features that make
>> it suitable for usage in git:
>
> One of the complaints that Shawn had about sqlite is that there is no
> native Java implementation, which makes it hard for JGit to ship a
> compatible backend. I suspect the same is true for LMDB, but it is
> probably a lot simpler than sqlite (so reimplementation might be
> possible).

Yes. Whatever the default standard format is for git-core, we need
that format to be easily supportable from JGit. Loading code via JNI
is not "easily supportable".

Non-default formats that the user can opt-into (and opt-out of) don't
need JGit compatibility. Users can choose to use $FANCY_DB or JGit and
make the tradeoff that is best for them. If JGit is also able to do
$FANCY_DB, great. If not, that's fine too. Not everyone needs JGit.
Not everyone needs $FANCY_DB.

IIRC some part of Ronnie's series was about setting up a socket
protocol between Git and the ref backend. If non-default backends are
like this, JGit could spawn the backend binary and then speak the
socket protocol just like git-core. This would be preferable to
linking JNI into the JVM.

Think remote helper protocol between transport.c and the helpers. JGit
doesn't yet speak that, but it should, and there is no technical
reason why it can't. Same for a ref helper protocol.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 13:10   ` Duy Nguyen
@ 2015-06-24  8:51     ` Jeff King
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff King @ 2015-06-24  8:51 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: David Turner, git mailing list

On Tue, Jun 23, 2015 at 08:10:03PM +0700, Duy Nguyen wrote:

> >   - we keep a key/value index mapping the name of any branch that exists
> >     to the byte offset of its entry in the logfile. This would probably
> 
> One key/value mapping per branch, pointing to the latest reflog entry,
> or one key/valye mapping for each reflog entry?

Yeah, sorry, I meant to point only to the latest entry (and then from
there if you want to actually walk the reflog, you can do so by
following the backreference to the previous entry).

> >     be in some binary key/value store (like LMDB). Without this,
> >     resolving a ref is O(U), which is horrible. With it, it should be
> >     O(1) or O(lg N), depending on the index data structure.
> 
> I'm thinking of the user with small or medium repos, in terms of refs,
> who does not want an extra dependency. If we store one mapping per
> branch, then the size of this mapping is small enough that the index
> in a text file is ok. If we also store the offset to the previous
> reflog entry of the same branch in the current reflog entry, like a
> back pointer, then we could jump back faster.
> 
> Or do you have something else in mind? Current reflog structure won't
> work because I think you bring back the reflog graveyard with this,
> and I don't want to lose that

I hadn't really thought about having multiple formats for the index. But
in theory, yes, you could, and the lowest common denominator could just
use the filesystem. Or even something similar to the packed-refs file,
where we have to write the whole thing to make a single update. That
doesn't perform well, but it's dirt simple and might be OK if you have
only a handful of refs.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 18:18   ` David Turner
@ 2015-06-24  9:14     ` Jeff King
  2015-06-24 17:29       ` David Turner
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2015-06-24  9:14 UTC (permalink / raw)
  To: David Turner; +Cc: git mailing list

On Tue, Jun 23, 2015 at 02:18:36PM -0400, David Turner wrote:

> > Can you describe a bit more about the reflog handling?
> > 
> > One of the problems we've had with large-ref repos is that the reflog
> > storage is quite inefficient. You can pack all the refs, but you may
> > still be stuck with a bunch of reflog files with one entry, wasting a
> > whole inode. Doing a "git repack" when you have a million of those has
> > horrible cold-cache performance. Basically anything that isn't
> > one-file-per-reflog would be a welcome change. :)
> 
> Reflogs are stored in the database as well.  There is one header entry
> per ref to indicate that a reflog is present, and then one database
> entry per reflog entry; the entries are stored consecutively and
> immediately following the header so that it's fast to iterate over them.

OK, that make sense. I did notice that the storage for the refdb grows
rapidly. If I add a millions refs (like refs/tags/$i) with a simple
reflog message "foo", I ended up with a 500MB database file.

That's _probably_ OK, because a million is getting into crazy
territory[1].  But it's 500 bytes per ref, each with one reflog entry.
Our ideal lower bound is probably something like 100 bytes per reflog
entry:

  - 20 bytes for old sha1
  - 20 bytes for new sha1
  - ~50 bytes for name, email, timestamp
  - ~6 bytes for refname (1000000 is the longest unique part)

That assumes we store binary[2] (and not just the raw reflog lines), and
reconstruct the reflog lines on the fly. It also assumes we use some
kind of trie-like storage (where we can amortize the cost of storing
"refs/tags/" across all of the entries).

Of course that neglects lmdb's overhead, and the storage of the ref tip
itself. But it would hopefully give us a ballpark for an optimal
solution. We don't have to hit that, of course, but it's food for
thought.

[1] The homebrew/homebrew repository on GitHub has almost half a million
    ref updates. Since this is storing not just refs but all ref
    updates, that's actually the interesting number (and optimizing the
    per-reflog-entry size is more interesting than the per-ref size).

[2] I'm hesitant to suggest binary formats in general, but given that
    this is a blob embedded inside lmdb, I think it's OK. If we were to
    pursue the log-structured idea I suggested earlier, I'm torn on
    whether it should be binary or not.

> > It has also been a dream of mine to stop tying the reflogs specifically
> > to the refs. I.e., have a spot for reflogs of branches that no longer
> > exist, which allows us to retain them for deleted branches.
> [...]
> That would be cool, and I don't think it would be hard to add to my
> current code; we could simply replace the header with a "tombstone".
> But I would prefer to wait until the series is merged; then we can build
> on top of it.

Yeah, I think you can add it easily to basically any system that does
not have the filesystem D/F conflicts in its storage (i.e., having
"refs/foo" does not block data under "refs/foo/bar").

> > But it may also be worth going with a slightly slower database if we can
> > get wider compatibility for free.
> 
> There's a JNI interface to LMDB, which is, of course, not native.  I
> don't think it would be too hard to entirely rewrite LMDB in Java, but
> I'm not going to have time to do it for the forseeable future.  I've
> asked Howard Chu if he knows of any efforts in progress.

Yeah, I think JNI is not enough for Eclipse folks. I don't think this is
a task that you would necessarily need to take on. More just something
to think about for the future when picking a format.

> Thanks, that's valuable.  For the refs backend, opening the LMDB
> database for writing is sufficient to block other writers.  Do you think
> it would be valuable to provide a git hold-ref-lock command that simply
> reads refs from stdin and keeps them locked until it reads EOF from
> stdin?  That would allow cross-backend ref locking.

I'm not sure what you would use it for. If you want to update the refs,
then you can specify a whole transaction with "git update-ref --stdin",
and that should work whatever backend you choose. Is there some other
operation you want where you hold the lock for a longer period of time?

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-24  6:09   ` Shawn Pearce
@ 2015-06-24  9:49     ` Jeff King
  2015-06-25  1:08       ` brian m. carlson
  2015-06-24 10:18     ` Duy Nguyen
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2015-06-24  9:49 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: David Turner, git mailing list

On Tue, Jun 23, 2015 at 11:09:40PM -0700, Shawn Pearce wrote:

> Yes. $DAY_JOB's DFS implementation never expires reflogs, allowing it
> to be used as a history to inspect what happened. Its been useful a
> couple of times to investigate and recover from a few accidental
> deletions.
> 
> Once you never expire reflog records you now have to consider at what
> point do you stop paying attention to the reflog entries for graph
> reachability during repack and fsck. Users still expect to be able to
> force push or delete a branch and have a set of objects disappear from
> the repository.

Yeah, we face this problem at GitHub. We actually write every single ref
write to $GIT_DIR/audit_log, which is essentially a reflog with the
refname prepended. The key, though, is that it isn't ever _read_ by git
for reachability. So it becomes an immutable log of what happened, and
we can happily prune the reflog to drop objects.

In a log-structured ref storage world, I think I'd include a single bit
per entry for "use this for reachability". Then you could "soft-expire"
reflog entries by dropping their reachability bit, but still retain them
in your audit_log. The alternative is to just copy the entries to an
archival log.

> There are some issues with append. Before appending we would need to
> verify the last record actually ends with an LF. If there was a power
> failure and only part of the last record wrote, you can't append
> without that record separator in place.

Yeah, I think that is straightforward. You have to take a lock on the
whole log anyway, so it's OK to "fixup" the previous entry.

> If that last record was truncated, and an LF was wedged in to do a new
> append, we can't trust that intermediate record. A CRC at the end of
> the record might make it safer to know the record is intact or bogus
> due to an earlier failed write that wasn't completed.

I suspect you could get by with just realizing that the entry doesn't
parse (that's what we do now for reflogs). But the idea of per-entry
consistency checks is appealing. You could also include the CRC for the
"previous" entry (remember that we would probably have a back-pointer to
some byte offset to say "this is the current ref state that I am
building on"). Then you can walk back the whole chain to know that it
hasn't been damaged.

If you want to get very fancy, replace your CRC with a cryptographically
strong hash, and you've just reinvented a blockchain. :)

> What about the case of never expiring the reflog? This log would grow
> forever. You may eventually need to archive old sections of it (e.g. 1
> year ago?) to maintain an audit log, while keeping the "latest" entry
> for each ref to rebuild the index.

Yeah, that's certainly an option. I'd say that's somewhat outside the
scope of git. If git provides the ability to prune entries completely
(i.e., what "reflog expire" does now) and to soft-expire them, then that
is enough for anyone to build whatever sort of archival system they want
(e.g., soft-expire for reachability as desired, and then occasionally
"git reflog show >your-archive && git reflog expire").

> +1 to always storing the peeled value. This was a major improvement
> for $DAY_JOB's Git servers as peeling tags on the fly can be costly
> when your storage is something remote, such as NFS. Unfortunately the
> current wire protocol demands peeled information to serve a ref
> advertisement.

Even on good disks, it makes the initial ref advertisement from
git-upload-pack _way_ cheaper, because we don't have to actually touch
the object database at all. It's basically just blitting out the
packed-refs file.

> One thing we do is always peel all refs. We record a bit to state its
> been peeled, but there is no peeled value because the ref is pointing
> to a non-tag object (e.g. refs/heads/master points to a commit).

Yeah, since c29c46f (pack-refs: add fully-peeled trait, 2013-03-18) we
implicitly do this in packed-refs; if there's no peel line after the
entry, it cannot be peeled. We could do the same here, but I think I
favor being more implicit (I'd probably add a few bits of "flags" to
each entry, and this could be one such flag).

> Updating the index on updates to a ref would be costly, as its O(N).

It depends how you implement the index. A straight text index would be
O(N). Replacing the index with a real key/value store should be very
fast. But unless we are going to write our own, that's going to
introduce a dependency (possibly one we can ship as we do with xdiff,
but the whole JGit thing is an open question).

> You could skip some index updates. Record in the header of the index
> the length of the reflog file used to build it. When reading the
> index, scan the reflog from that position to the end and patch those
> updates in memory. Rewrites of the index could then be deferred until
> the scan delta on the log is high, or the next gc.

Yeah, basically use the log as a journal. You save (or at least
amortize) O(# of refs) work for the writers, at the cost of O(# of
recent updates) work for the readers. That might be worth doing. It's
also complicated, and I was hoping to avoid complicated things. :)

> >   - Reading a single reflog is _also_ O(U), which is not as good as
> >     today. But if each log entry contains a byte offset of the previous
> >     entry, you can follow the chain (it is still slightly worse, because
> >     you are jumping all over the file, rather than reading a compact set
> >     of lines).
> 
> But this is like saying reading `git log` is bad because we jump all
> over the pack file to parse ancestors and insert them into the
> revqueue at the correct position. Feh.

Yeah, I agree it's probably not worth caring too much about. Reading the
reflogs at all is not that common an operation, and it's a tradeoff I'd
be happy to make. I was just trying to be upfront about the tradeoffs
versus the current storage format.

> IIRC some part of Ronnie's series was about setting up a socket
> protocol between Git and the ref backend. If non-default backends are
> like this, JGit could spawn the backend binary and then speak the
> socket protocol just like git-core. This would be preferable to
> linking JNI into the JVM.

I am not excited about contacting an already-running program, which is
what I thought Ronnie's patches did. That's one more thing to go wrong
or become confusing doing basic operations.  If we have to do an
external program, I'd much rather it be something we spawn once (more
like the remote-helper, which I think is what you are proposing).

I don't know how much that helps for the JGit situation. It punts the
native code out of JGit, but people using JGit still have to have the
native helper from git on their system.  I have no problems at all with
pluggable $FANCY_DB that not everybody supports.  But I think we would
want _some_ baseline that is reasonably performant, and that everybody
will support. I'm not sure putting the index into a flat file is
performant enough. Is there any basic key/value store that is has both a
C and a pure-Java version (e.g., berkeley db)?

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-24  6:09   ` Shawn Pearce
  2015-06-24  9:49     ` Jeff King
@ 2015-06-24 10:18     ` Duy Nguyen
  1 sibling, 0 replies; 26+ messages in thread
From: Duy Nguyen @ 2015-06-24 10:18 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jeff King, David Turner, git mailing list

On Wed, Jun 24, 2015 at 1:09 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>> I chose to use LMDB for the database.  LMDB has a few features that make
>>> it suitable for usage in git:
>>
>> One of the complaints that Shawn had about sqlite is that there is no
>> native Java implementation, which makes it hard for JGit to ship a
>> compatible backend. I suspect the same is true for LMDB, but it is
>> probably a lot simpler than sqlite (so reimplementation might be
>> possible).
>
> Yes. Whatever the default standard format is for git-core, we need
> that format to be easily supportable from JGit. Loading code via JNI
> is not "easily supportable".

I'm under the impression that this will be opt-in, not completely
replacing fs-based ref backend. Anyway, any recommendation about
database format or engine that is more friendly to Java and JGit (and
preferably has good C support too)?
-- 
Duy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-24  9:14     ` Jeff King
@ 2015-06-24 17:29       ` David Turner
  0 siblings, 0 replies; 26+ messages in thread
From: David Turner @ 2015-06-24 17:29 UTC (permalink / raw)
  To: Jeff King; +Cc: git mailing list

On Wed, 2015-06-24 at 05:14 -0400, Jeff King wrote:
> On Tue, Jun 23, 2015 at 02:18:36PM -0400, David Turner wrote:
> 
> > > Can you describe a bit more about the reflog handling?
> > > 
> > > One of the problems we've had with large-ref repos is that the reflog
> > > storage is quite inefficient. You can pack all the refs, but you may
> > > still be stuck with a bunch of reflog files with one entry, wasting a
> > > whole inode. Doing a "git repack" when you have a million of those has
> > > horrible cold-cache performance. Basically anything that isn't
> > > one-file-per-reflog would be a welcome change. :)
> > 
> > Reflogs are stored in the database as well.  There is one header entry
> > per ref to indicate that a reflog is present, and then one database
> > entry per reflog entry; the entries are stored consecutively and
> > immediately following the header so that it's fast to iterate over them.
> 
> OK, that make sense. I did notice that the storage for the refdb grows
> rapidly. If I add a millions refs (like refs/tags/$i) with a simple
> reflog message "foo", I ended up with a 500MB database file.
> 
> That's _probably_ OK, because a million is getting into crazy
> territory[1].  But it's 500 bytes per ref, each with one reflog entry.
> Our ideal lower bound is probably something like 100 bytes per reflog
> entry:
> 
>   - 20 bytes for old sha1
>   - 20 bytes for new sha1
>   - ~50 bytes for name, email, timestamp
>   - ~6 bytes for refname (1000000 is the longest unique part)
> 
> That assumes we store binary[2] (and not just the raw reflog lines), and
> reconstruct the reflog lines on the fly. It also assumes we use some
> kind of trie-like storage (where we can amortize the cost of storing
> "refs/tags/" across all of the entries).
> 
> Of course that neglects lmdb's overhead, and the storage of the ref tip
> itself. But it would hopefully give us a ballpark for an optimal
> solution. We don't have to hit that, of course, but it's food for
> thought.
> 
> [1] The homebrew/homebrew repository on GitHub has almost half a million
>     ref updates. Since this is storing not just refs but all ref
>     updates, that's actually the interesting number (and optimizing the
>     per-reflog-entry size is more interesting than the per-ref size).
> 
> [2] I'm hesitant to suggest binary formats in general, but given that
>     this is a blob embedded inside lmdb, I think it's OK. If we were to
>     pursue the log-structured idea I suggested earlier, I'm torn on
>     whether it should be binary or not.

I could try a binary format.  I was optimizing for simplicity,
debuggability, recoverability, compatibility with the choice of the text
format, but I wouldn't have to.  I don't know how much this will save.
Unfortunately, given the way LMDB works, a trie-like storage to save
refs/tags does not seem possible (of course, we could hard-code some
hacks like \001=refs/rags, \002=refs/heads, etc but that is a
micro-optimization that might not be worth it.

Also, the reflog header has some overhead (it's an entire extra record
per ref). The header exists to implement reflog creation/existence
checking.  I didn't really try to understand why we have the distinction
between empty and nonexistent reflogs; I just copied it.  If we didn't
have that distinction, we could eliminate that overhead.

> > Thanks, that's valuable.  For the refs backend, opening the LMDB
> > database for writing is sufficient to block other writers.  Do you think
> > it would be valuable to provide a git hold-ref-lock command that simply
> > reads refs from stdin and keeps them locked until it reads EOF from
> > stdin?  That would allow cross-backend ref locking.
> 
> I'm not sure what you would use it for. If you want to update the refs,
> then you can specify a whole transaction with "git update-ref --stdin",
> and that should work whatever backend you choose. Is there some other
> operation you want where you hold the lock for a longer period of time?

I'm sure I had a reason for this at the time I wrote it, but now I can't
think of what it was.  Nevermind!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-23 21:27     ` Michael Haggerty
@ 2015-06-24 17:31       ` David Turner
  0 siblings, 0 replies; 26+ messages in thread
From: David Turner @ 2015-06-24 17:31 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git mailing list

On Tue, 2015-06-23 at 23:27 +0200, Michael Haggerty wrote:
> On 06/23/2015 09:53 PM, David Turner wrote:
> > On Tue, 2015-06-23 at 17:51 +0200, Michael Haggerty wrote:
> > [...]
> >> * I don't like the fact that you have replaced `struct ref_transaction
> >> *` with `void *` in the public interface. On a practical level, I like
> >> the bit of type-safety that comes with the more specific declaration.
> >> But on a more abstract level, I think that the concept of a transaction
> >> could be useful across backends, for example in utility functions that
> >> verify that a proposed set of updates are internally consistent. I would
> >> rather see either
> >>
> >>   * backends "extend" a basic `struct ref_transaction` to suit their
> >> needs, and upcast/downcast pointers at the module boundary, or
> >>
> >>   * `struct ref_transaction` itself gets a `void *` member that backends
> >> can use for whatever purposes they want.
> > 
> > There are no common fields between refs-be-file transactions and
> > refs-be-lmdb transactions.  I don't see much gain from adding an empty
> > ref_transaction that backends could extend, since we would have to
> > explicitly upcast/downcast all over the place.
> 
> If you ask me, it would be better to do a bunch of up/downcasts within
> the single module (via two helper functions that could even do
> consistency checks) than have no help from the compiler in preventing
> people from passing unrelated pointer types into the `void *transaction`
> argument. Plus the `struct ref_transaction *` variables scattered
> throughout the code are a lot more self-explanatory than `void *`.

I'll take a look at what that would look like.

> >> * Regarding MERGE_HEAD: you take the point of view that it must continue
> >> to be stored as a file. And yet it must also behave somewhat like a
> >> reference; for example, `git rev-parse MERGE_HEAD` works today.
> >> MERGE_HEAD is also used for reachability, right?
> >>
> >> Another point of view is that MERGE_HEAD is a plain old boring
> >> reference, but there is some other metadata related to it that the refs
> >> backend has to store. The file-based backend would have special-case
> >> code to read the additional data from the tail of the loose refs file
> >> (and be sure to write the metadata when writing the reference), but
> >> other backends could store the reference with the rest but do their own
> >> thing with the metadata. So I guess I'm wondering whether the refs API
> >> needs a MERGE_HEAD-specific way to read and write MERGE_HEAD along with
> >> its metadata.
> > 
> > You are probably right that this is a good idea.
> > 
> >> * Don't the same considerations that apply to MERGE_HEAD also apply to
> >> FETCH_HEAD?
> > 
> > All of the tests pass without any special handling of FETCH_HEAD.
> 
> That's odd. From git-fetch.txt:
> 
>     The names of refs that are fetched, together with the object names
>     they point at, are written to `.git/FETCH_HEAD`.  This information
>     may be used by scripts or other git commands, such as
>     linkgit:git-pull[1].
> 
> It seems like the test suite is reading FETCH_HEAD via the refs API in a
> couple of places. I don't understand why these don't fail when LMDB is
> being used...

You are right; I did add some special-case code for FETCH_HEAD.

> >> * Rehash of the last two points: I expected one backend function that is
> >> used to initialize the refs backend when a new repository is created
> >> (e.g., in `git init`). The file-based backend would use this function to
> >> create the `refs`, `refs/heads`, and `refs/tags` directories. I expected
> >> a second function that is called once every time git runs in an existing
> >> repository (this one might, for example, open a database connection).
> >> And maybe even a third one that closes down the database connection
> >> before git exits. Would you please explain how this actually works?
> > 
> > LMDB doesn't really have the concept of a "connection".  It's basically
> > just a couple of files that communicate using shared memory (and maybe
> > some other locking that I haven't paid attention to).  There is the
> > concept of a "transaction", which is the unit of concurrency (each
> > thread may only have one open transaction).  Transactions are either
> > read-only or read-write, and there can only be one read-write
> > transaction open at a time (across the entire system).  Read-only
> > transactions take a snapshot of the DB state at transaction start time.
> > This combination of features means that we need to be a bit clever about
> > read-only transactions; if a read-write transaction occurs in a separate
> > process, we need to restart any read-only transactions to pick up its
> > changes.
> 
> If you are thinking about an *unrelated* separate process, then Git's
> philosophy is that if our process is reading *some* valid state of the
> references, it's all good even if that state is not quite the newest.
> After all, who's to say whether our process ran before or after the
> other process? As long as each process sees self-consistent views of the
> world as it existed at some recent time, we're satisfied.

No, I'm thinking about a subprocess that we stared ourself here.
Unrelated separate processes are fine, I think.

> To be sure, we even fail at that unambitious goal, because loose
> reference reads are not atomic. It is possible that we read some
> references in the state they had before the other process ran, and
> others in the state they had after the other process was finished. This
> can get ugly if, for example, the other process renamed a reference,
> because we might not see the reference under either its old *or* its new
> name. We might therefore conclude that the objects reachable from that
> reference are dangling and garbage-collect them.
> 
> If the other process is one that we started ourselves, then that is a
> different situation and we would want, for example, to invalidate our
> reference cache after the other process is done.

Yep, my code does this.

> One of the long-standing hopes of a DB-backed reference backend would be
> to improve this situation--allowing atomic writes *and* reads.

Reads are atomic across renames, since we do renames within a single
write transaction. 

> > [...]
> > 
> >> * You explain in the docstring to `lmdb_transaction_begin_flags()` that
> >> it is dangerous to call a callback function while a write transaction is
> >> open if the callback function might want to initiate another write
> >> transaction. This would obviously also apply to running hooks.
> > 
> > I carefully limit the scope of write transactions to avoid problems like
> > this.
> > 
> >>  This is a
> >> limitation of LMDB because writers block each other. I can't think of
> >> anyplace that this would be a problem in our codebase. But if it were,
> >> it seems to me that you could take an approach like the file-based
> >> backend, which collects the transaction in a `ref_transaction` data
> >> structure, and executes the entire transaction within a single call to
> >> `ref_transaction_commit()`. This approach would prevent callers outside
> >> of the refs module from ever bumping against this limitation.
> > 
> > The file-based backend does per-ref locking early, and then applies the
> > transactions. Here, taking the write transaction is how the lmdb backend
> > does locking.  So the situations are not quite the same.  But I think
> > keeping the scope of transactions small is the best plan.
> 
> The file-based backend locks the references early *within
> ref_transaction_commit()*, not as the transaction is being built up
> using ref_transaction_update() etc. This is a big difference, because
> code anywhere can call
> 
>     ref_transaction_begin(...);
>     ANYTHING
>     ref_transaction_update(...);
>     ANYTHING
>     ref_transaction_commit(...);
> 
> The only way to be sure that ANYTHING can't create a deadlock with the
> open transaction (for example by calling a hook script that runs a git
> command) is to audit all of that code now and in the future. Whereas the
> file-based backend doesn't do anything that is externally observable or
> deadlocky except within the single ref_transaction_commit() function
> call, so only that one function has to be audited for actions that could
> cause a deadlock.

A deadlock is impossible; a second writer will simply be unable to
acquire the lock and will die (same as in the file-based backend if two
writers try to update the same ref at the same time).

It's true that the scope for this is potentially larger.  On the other
hand, the file-based backend might fail in the same cases -- it would
fail when trying to verify refs that had changed out from under it. The 
failure is less likely, since it would only happen on conflicting refs.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC/Pull Request: Refs db backend
  2015-06-24  9:49     ` Jeff King
@ 2015-06-25  1:08       ` brian m. carlson
  0 siblings, 0 replies; 26+ messages in thread
From: brian m. carlson @ 2015-06-25  1:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn Pearce, David Turner, git mailing list

[-- Attachment #1: Type: text/plain, Size: 1077 bytes --]

On Wed, Jun 24, 2015 at 05:49:20AM -0400, Jeff King wrote:
> I don't know how much that helps for the JGit situation. It punts the
> native code out of JGit, but people using JGit still have to have the
> native helper from git on their system.  I have no problems at all with
> pluggable $FANCY_DB that not everybody supports.  But I think we would
> want _some_ baseline that is reasonably performant, and that everybody
> will support. I'm not sure putting the index into a flat file is
> performant enough. Is there any basic key/value store that is has both a
> C and a pure-Java version (e.g., berkeley db)?

Berkeley DB has switched to the AGPLv3 for new versions.  Besides being
unpalatable for many people, it's also incompatible with the GPLv2.  I
do otherwise like Berkeley DB: it performs reasonably well and is
available on most systems.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-06-25  1:08 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-23  0:50 RFC/Pull Request: Refs db backend David Turner
2015-06-23  5:36 ` Junio C Hamano
2015-06-23 10:23   ` Duy Nguyen
2015-06-23 18:47     ` David Turner
2015-06-23 17:29   ` David Turner
2015-06-23 11:47 ` Jeff King
2015-06-23 13:10   ` Duy Nguyen
2015-06-24  8:51     ` Jeff King
2015-06-23 18:18   ` David Turner
2015-06-24  9:14     ` Jeff King
2015-06-24 17:29       ` David Turner
2015-06-24  6:09   ` Shawn Pearce
2015-06-24  9:49     ` Jeff King
2015-06-25  1:08       ` brian m. carlson
2015-06-24 10:18     ` Duy Nguyen
2015-06-23 15:51 ` Michael Haggerty
2015-06-23 19:53   ` David Turner
2015-06-23 21:27     ` Michael Haggerty
2015-06-24 17:31       ` David Turner
2015-06-23 21:35     ` David Turner
2015-06-23 21:41       ` Junio C Hamano
2015-06-23 17:16 ` Stefan Beller
2015-06-23 20:04   ` David Turner
2015-06-23 20:10     ` Randall S. Becker
2015-06-23 20:22       ` David Turner
2015-06-23 20:27         ` Randall S. Becker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.