All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: Flat directory for notes, or fan-out?  Both!
@ 2009-02-09 21:12 Johannes Schindelin
  2009-02-10  7:58 ` Boyd Stephen Smith Jr.
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Johannes Schindelin @ 2009-02-09 21:12 UTC (permalink / raw)
  To: git, peff, spearce

Hi,

Shawn triggered some well needed thinking on my part about the notes 
implementation.  At the moment, we have flat directory structure, and read 
all of them in one go (when needed).

I think we should support that, because it is relatively easy to generate 
that kind of trees for small-scale applications.

However, I think there is also a benefit to handle fan-out directory 
structures, too: they scale much nicer.

If the commit name was not found as a filename, it could be searched in 
whatever subdirectory whose name is a prefix of said commit name (first 
wins).

So I think it would be a sane plan to do the following when a commit note 
is requested:

- If not done yet, read in the whole top-level directory of the notes ref.

- If the commit name is not found, find the tree entries whose name is a 
  prefix of the commit name (we can even use the same hashmap to store 
  these "incomplete" names, as we use a linear hash, which we fill in 
  ascending order),

  - read the trees one by one, until the commit name is found (or no tree 
    entry is left), deleting the trees from the hashmap on the go.

How does that sound?

Ciao,
Dscho

	

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-09 21:12 RFC: Flat directory for notes, or fan-out? Both! Johannes Schindelin
@ 2009-02-10  7:58 ` Boyd Stephen Smith Jr.
  2009-02-10 13:16   ` Jeff King
  2009-02-10 12:18 ` Jeff King
  2009-02-11  1:14 ` Sam Vilain
  2 siblings, 1 reply; 33+ messages in thread
From: Boyd Stephen Smith Jr. @ 2009-02-10  7:58 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, peff, spearce

[-- Attachment #1: Type: text/plain, Size: 540 bytes --]

On Monday 09 February 2009 15:12:06 Johannes Schindelin wrote:
> So I think it would be a sane plan to do the following when a commit note
> is requested:

So, something like a Trie data structure?  I think that is a great way to 
store fixed-length strings from a limited alphabet with arbitrary data 
attached.
-- 
Boyd Stephen Smith Jr.                   ,= ,-_-. =.
bss@iguanasuicide.net                   ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy         `-'(. .)`-'
http://iguanasuicide.net/                    \_/


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-09 21:12 RFC: Flat directory for notes, or fan-out? Both! Johannes Schindelin
  2009-02-10  7:58 ` Boyd Stephen Smith Jr.
@ 2009-02-10 12:18 ` Jeff King
  2009-02-10 12:59   ` Johannes Schindelin
  2009-02-11  1:14 ` Sam Vilain
  2 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2009-02-10 12:18 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, spearce

On Mon, Feb 09, 2009 at 10:12:06PM +0100, Johannes Schindelin wrote:

> Shawn triggered some well needed thinking on my part about the notes 
> implementation.  At the moment, we have flat directory structure, and read 
> all of them in one go (when needed).
> 
> I think we should support that, because it is relatively easy to generate 
> that kind of trees for small-scale applications.

Hmm. Do we really care about how easy it is to generate? Are we
expecting people to not use the command interface and instead check out
a notes tree and start putting stuff into $commit/foo?

And if we are encouraging the dual possibilities, how do we handle the
case of merging two trees with equivalent but differently-formatted
content?

Imagine I have three users, A, B, and C, all collaborating on a project
with notes. A and B use the "git notes" interface which generates a
fan-out directory structure. C uses his own script that directly writes
to the notes tree without fan-out.

Now let's imagine A, B, and C all write a note for commit X, and A pulls
from the other two. When he pulls from B, there is a file-level
conflict, and he decides that his note is better and resolves in his
favor. But when he pulls from C, there is _no_ conflict, and now there
are two notes for the same commit in his notes tree. You can give the
multiple notes some sane semantics (one trumps the other, or they are a
list, or whatever), but there is still an inconsistency: B's notes and
C's notes behave differently. So now A has to start caring about how
other people generate their notes.

The only two solutions I can think of are:

  - when A pulls notes, he does a specialized merge that normalizes the
    note trees

  - particular notes trees are specified as being in "fan out" or "not
    fan out" mode. But there is no place to specify that to enforce it.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 12:18 ` Jeff King
@ 2009-02-10 12:59   ` Johannes Schindelin
  2009-02-10 13:10     ` Jeff King
  0 siblings, 1 reply; 33+ messages in thread
From: Johannes Schindelin @ 2009-02-10 12:59 UTC (permalink / raw)
  To: Jeff King; +Cc: git, spearce

Hi,

On Tue, 10 Feb 2009, Jeff King wrote:

> On Mon, Feb 09, 2009 at 10:12:06PM +0100, Johannes Schindelin wrote:
> 
> > Shawn triggered some well needed thinking on my part about the notes 
> > implementation.  At the moment, we have flat directory structure, and read 
> > all of them in one go (when needed).
> > 
> > I think we should support that, because it is relatively easy to generate 
> > that kind of trees for small-scale applications.
> 
> Hmm. Do we really care about how easy it is to generate? Are we
> expecting people to not use the command interface and instead check out
> a notes tree and start putting stuff into $commit/foo?

I wanted to be nice to existing users of the feature (it is in 'next', 
after all, and Thomas has produced some awesome examples, that will 
hopefully show the scalability of the thing).

But you're right, it almost, but not quite, too late to switch.

> And if we are encouraging the dual possibilities, how do we handle the 
> case of merging two trees with equivalent but differently-formatted 
> content?
> 
> Imagine I have three users, A, B, and C, all collaborating on a project
> with notes. A and B use the "git notes" interface which generates a
> fan-out directory structure. C uses his own script that directly writes
> to the notes tree without fan-out.
> 
> Now let's imagine A, B, and C all write a note for commit X, and A pulls
> from the other two. When he pulls from B, there is a file-level
> conflict, and he decides that his note is better and resolves in his
> favor. But when he pulls from C, there is _no_ conflict, and now there
> are two notes for the same commit in his notes tree. You can give the
> multiple notes some sane semantics (one trumps the other, or they are a
> list, or whatever), but there is still an inconsistency: B's notes and
> C's notes behave differently. So now A has to start caring about how
> other people generate their notes.
> 
> The only two solutions I can think of are:
> 
>   - when A pulls notes, he does a specialized merge that normalizes the
>     note trees
> 
>   - particular notes trees are specified as being in "fan out" or "not
>     fan out" mode. But there is no place to specify that to enforce it.

You're correct.  This buys all kinds of trouble.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 12:59   ` Johannes Schindelin
@ 2009-02-10 13:10     ` Jeff King
  2009-02-10 13:32       ` Johannes Schindelin
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2009-02-10 13:10 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, spearce

On Tue, Feb 10, 2009 at 01:59:06PM +0100, Johannes Schindelin wrote:

> > Hmm. Do we really care about how easy it is to generate? Are we
> > expecting people to not use the command interface and instead check out
> > a notes tree and start putting stuff into $commit/foo?
> 
> I wanted to be nice to existing users of the feature (it is in 'next', 
> after all, and Thomas has produced some awesome examples, that will 
> hopefully show the scalability of the thing).
> 
> But you're right, it almost, but not quite, too late to switch.

OK. I think if you are seeing performance benefits from a 2-character
fanout, then we should standardize on that (do you have new performance
numbers somewhere?).

The notes implementation is now in master. If it's about to change in an
incompatible way, how do you want to handle it? I'm wary of a quick
patch to change the format this late in the release cycle. We could hold
it back from 1.6.2. Alternatively, we could let it release with a "this
is probably going to change" warning.

I think I favor holding it back, but I am not picky.

> > multiple notes some sane semantics (one trumps the other, or they are a
> > list, or whatever), but there is still an inconsistency: B's notes and
> > C's notes behave differently. So now A has to start caring about how
> > other people generate their notes.
> > [...]
> 
> You're correct.  This buys all kinds of trouble.

One other thing to note: I think we discussed in the past other kinds of
"more than one way to store it" strategies (e.g., letting a blob note be
the same as a tree note containing a blob "default"). They suffer from
some of the same issues (though not quite as badly, since you would at
least see that there was a conflict).

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10  7:58 ` Boyd Stephen Smith Jr.
@ 2009-02-10 13:16   ` Jeff King
  2009-02-11  1:58     ` Boyd Stephen Smith Jr.
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2009-02-10 13:16 UTC (permalink / raw)
  To: Boyd Stephen Smith Jr.; +Cc: Johannes Schindelin, git, spearce

On Tue, Feb 10, 2009 at 01:58:41AM -0600, Boyd Stephen Smith Jr. wrote:

> On Monday 09 February 2009 15:12:06 Johannes Schindelin wrote:
> > So I think it would be a sane plan to do the following when a commit note
> > is requested:
> 
> So, something like a Trie data structure?  I think that is a great way to 
> store fixed-length strings from a limited alphabet with arbitrary data 
> attached.

I don't think a Trie quite makes sense here. We still have to look
linearly through each git tree (an artifact of the tree implementation).

You could organize the tree into a deeper, more complex data structure
than just a simple fan-out. But remember that traditional data
structures are usually trying to save expensive comparisons, and
following a pointer is inexpensive. In the case of git trees, though,
following a pointer into a subtree is _very_ expensive, since you have
to lookup and decompress the object.

So what we do now is read the tree into an associative hash.
You could replace the hash with a trie, but it is not really the
performance-critical part here. The issue is that without fan-out you
have to read the _whole_ tree into the hash. With a constant-sized
fanout, you get to divide that work by a constant.

Or did you mean something else entirely?

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 13:10     ` Jeff King
@ 2009-02-10 13:32       ` Johannes Schindelin
  2009-02-10 15:58         ` Junio C Hamano
  2009-02-10 16:44         ` Shawn O. Pearce
  0 siblings, 2 replies; 33+ messages in thread
From: Johannes Schindelin @ 2009-02-10 13:32 UTC (permalink / raw)
  To: Jeff King, gitster; +Cc: git, spearce

Hi,

[Junio: seems like both Peff and me would like to hold the notes out of 
1.6.2, would you mind?]

On Tue, 10 Feb 2009, Jeff King wrote:

> On Tue, Feb 10, 2009 at 01:59:06PM +0100, Johannes Schindelin wrote:
> 
> > > Hmm. Do we really care about how easy it is to generate? Are we
> > > expecting people to not use the command interface and instead check out
> > > a notes tree and start putting stuff into $commit/foo?
> > 
> > I wanted to be nice to existing users of the feature (it is in 'next', 
> > after all, and Thomas has produced some awesome examples, that will 
> > hopefully show the scalability of the thing).
> > 
> > But you're right, it almost, but not quite, too late to switch.
> 
> OK. I think if you are seeing performance benefits from a 2-character
> fanout, then we should standardize on that (do you have new performance
> numbers somewhere?).

The thing is: Shawn is correct when he says that a tree object to hold the 
notes of all commits (which is not an unlikely scenario if you are 
thinking about corporate processes) would be huge.

> The notes implementation is now in master. If it's about to change in an 
> incompatible way, how do you want to handle it? I'm wary of a quick 
> patch to change the format this late in the release cycle. We could hold 
> it back from 1.6.2. Alternatively, we could let it release with a "this 
> is probably going to change" warning.
> 
> I think I favor holding it back, but I am not picky.

Yes, I am also in favor of holding it back.

> > > multiple notes some sane semantics (one trumps the other, or they are a
> > > list, or whatever), but there is still an inconsistency: B's notes and
> > > C's notes behave differently. So now A has to start caring about how
> > > other people generate their notes.
> > > [...]
> > 
> > You're correct.  This buys all kinds of trouble.
> 
> One other thing to note: I think we discussed in the past other kinds of
> "more than one way to store it" strategies (e.g., letting a blob note be
> the same as a tree note containing a blob "default"). They suffer from
> some of the same issues (though not quite as badly, since you would at
> least see that there was a conflict).

Actually, I do not see much of a problem there.  If the entry 
(corresponding to the commit name) in the notes tree points to a blob, 
then that is that, if it points to a tree, then we just read all of the 
objects therein (or maybe at a later stage we allow restricting to a 
certain file basename).

The point you raised earlier, that there would be a lot of ambiguity if 
we allow both flat and fan-out directory structures, is a valid point, 
though.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 13:32       ` Johannes Schindelin
@ 2009-02-10 15:58         ` Junio C Hamano
  2009-02-10 16:48           ` Shawn O. Pearce
  2009-02-10 16:48           ` Johannes Schindelin
  2009-02-10 16:44         ` Shawn O. Pearce
  1 sibling, 2 replies; 33+ messages in thread
From: Junio C Hamano @ 2009-02-10 15:58 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jeff King, git, spearce

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> OK. I think if you are seeing performance benefits from a 2-character
>> fanout, then we should standardize on that (do you have new performance
>> numbers somewhere?).
>
> The thing is: Shawn is correct when he says that a tree object to hold the 
> notes of all commits (which is not an unlikely scenario if you are 
> thinking about corporate processes) would be huge.
>
>> The notes implementation is now in master. If it's about to change in an 
>> incompatible way, how do you want to handle it? I'm wary of a quick 
>> patch to change the format this late in the release cycle. We could hold 
>> it back from 1.6.2. Alternatively, we could let it release with a "this 
>> is probably going to change" warning.
>> 
>> I think I favor holding it back, but I am not picky.
>
> Yes, I am also in favor of holding it back.

I could do a revert on 'master' if it is really needed, but I found that
the above reasoning is a bit troublesome.  The thing is, if a tree to hold
the notes would be huge to be unmanageable, then it would still be huge to
be unmanageable if you split it into 256 pieces.

I'd rather prefer to see us first try to find a way to optimze the tree
parser.  Maybe packv4 or Linus's binary search (which IIRC you declared
would not work --- I recall I once thought about it myself but I do not
recall what my conclusions were) play a role in it.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 13:32       ` Johannes Schindelin
  2009-02-10 15:58         ` Junio C Hamano
@ 2009-02-10 16:44         ` Shawn O. Pearce
  2009-02-10 17:09           ` Johannes Schindelin
  2009-02-11  3:19           ` Sam Vilain
  1 sibling, 2 replies; 33+ messages in thread
From: Shawn O. Pearce @ 2009-02-10 16:44 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jeff King, gitster, git, Sam Vilain

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> [Junio: seems like both Peff and me would like to hold the notes out of 
> 1.6.2, would you mind?]

Sorry I'm getting involved in this notes thing so late.  I was way
too focused on Gerrit2 and just didn't pay much attention to what
was on the git ML recently.  Like Dscho and Peff, I think we may
want to hold notes out of 1.6.2.
 
> On Tue, 10 Feb 2009, Jeff King wrote:
> > On Tue, Feb 10, 2009 at 01:59:06PM +0100, Johannes Schindelin wrote:
> 
> The thing is: Shawn is correct when he says that a tree object to hold the 
> notes of all commits (which is not an unlikely scenario if you are 
> thinking about corporate processes) would be huge.

A notes tree entry requires 6+1+40+1+20=68 bytes per entry.  If I
use it for what I want in Gerrit, which is to annotate every commit,
on a project like git.git with 17,491 commits we're talking about
a tree that is 1.13 MB.

That tree grows at a rate of 276 KB/year.

I'm not sure I want to think about the cost to unpack that tree,
just so I can look at "git log --since=1.week.ago".

My fear here is that over time we will be spending a lot of CPU
time unpacking and indexing the tree in memory, only to then pull
out a handful of recent commits, and then see the pager abort and
kill the revision walk.
 
> The point you raised earlier, that there would be a lot of ambiguity if 
> we allow both flat and fan-out directory structures, is a valid point, 
> though.

Yup.  The flat vs. fan-out is a problem.  In a slightly unrelated
thread offlist I have been talking with Sam Vilain about using Git
as a database backend for tuple storage.  There is a related issue
there about making the tree structure consistent, but never stored
in a way that we wind up with these massive multi-megabyte objects.

We've only started to kick it around, but I think we are both in
agreement that a "database tree" is owned by the database code
and must not be twiddled manually.  Not unless you can honor the
formatting rules.  Just like you shouldn't use "git hash-object"
to create a tree, unless you can honor the basic formatting rules
for trees.

This also means that the "database trees" probably are not going
to be mergeable with a basic merge-recursive sort of algorithm,
but instead need specialized handling to perform the combination.

I think we're leaning in a direction of something more like this
for trees:

- Tuples are stored under a path constructed from their primary key.
  The analog here is, the commit SHA-1 the note is annotating.

- Trees are capped at some reasonable size limit.  For sake of
  argument lets call that MAX_TREE.  My feeling is this would be
  closer to the 16 KB side of the spectrum then to the 1 MB side.

- Initially the database tree starts out as a single root tree that
  is empty.

- Records are inserted, creating new tree entries, until MAX_TREE
  is reached for the root level tree.  Up until this point it is
  a flat tree structure, like the current notes design.

- Once MAX_TREE is reached the root is split, and ranges are used
  to point to the subtrees, which are now flat, and approximately
  are MAX_TREE/2 in size.

Etc.

This would make the git-notes.sh code a *lot* more complex, as you
can't just toss everything into an index file and then update it with
a single update-index call.  Doing a tree split is much more work and
requires removing and adding back all of the affected path names.
(Its also perhaps unreasonable anyway to load 17,491 paths into a
temporary index just to twiddle a note for the latest commit.)


Notes on commits though are a hell of a problem.  SHA-1 is just so
uniform at distributing the commits around the namespace that even
with just the 200 most recent commits we wind up with a commit in
almost every "bucket", assuming a two hex digit fan-out bucket like
the loose object directory.

For the "git database" thing above, I've been contemplating the
idea of an index stored external from the Git object database.
Sam thinks indexes should be in the object database tree, but
I'm considering storing them outside entirely because we can
make the indexes more easily searched by a hash or binary search,
like pack-*.idx.  Whenever the "database ref" gets moved we'd need
to run a "sync" utility to bring these external indexes current.
But they could also be more efficiently scanned.

E.g. in the case of commit notes, we could just mmap() the index into
memory and perform our lookups through the mmap.  Thus we wouldn't
pay massive penalities to index all 17,491 names just to access 200.
Though we may wind up paging in a good part of the index due to
the random access nature, but we can't really do anything about that.

Keeping the indexes current would perhaps mean teaching "git fetch"
to run something after the fetch is complete.  Rather trivial in
the grand scheme of things.  I also liken the external index to the
pack-*.idx, in that its derived from the real sources in the object
database, and can always be generated client side.  So making fetch
do it is really no different then making fetch run index-pack.

Eh.  That wound up being a lot longer than I wanted it to be.


Sam and I may be putting some effort into this "git as a database"
thing, and it could be used as an efficient notes store.  Its just
a very complex notes store.  Much more complex to implement than
the simple notes currently slated for 1.6.2.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 15:58         ` Junio C Hamano
@ 2009-02-10 16:48           ` Shawn O. Pearce
  2009-02-10 16:48           ` Johannes Schindelin
  1 sibling, 0 replies; 33+ messages in thread
From: Shawn O. Pearce @ 2009-02-10 16:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, Jeff King, git

Junio C Hamano <gitster@pobox.com> wrote:
> 
> I could do a revert on 'master' if it is really needed, but I found that
> the above reasoning is a bit troublesome.  The thing is, if a tree to hold
> the notes would be huge to be unmanageable, then it would still be huge to
> be unmanageable if you split it into 256 pieces.
> 
> I'd rather prefer to see us first try to find a way to optimze the tree
> parser.  Maybe packv4 or Linus's binary search (which IIRC you declared
> would not work --- I recall I once thought about it myself but I do not
> recall what my conclusions were) play a role in it.

packv4 as proposed wouldn't help a notes tree.  It relied on the fact
that we'd have no more than 64k unique file names in a repository,
and any name which overflowed that 64k limit would force its tree
to be a canonical format tree, which is what we are trying to
avoid here.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 15:58         ` Junio C Hamano
  2009-02-10 16:48           ` Shawn O. Pearce
@ 2009-02-10 16:48           ` Johannes Schindelin
  2009-02-10 16:56             ` Shawn O. Pearce
  1 sibling, 1 reply; 33+ messages in thread
From: Johannes Schindelin @ 2009-02-10 16:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, spearce

Hi,

On Tue, 10 Feb 2009, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> >> OK. I think if you are seeing performance benefits from a 2-character
> >> fanout, then we should standardize on that (do you have new performance
> >> numbers somewhere?).
> >
> > The thing is: Shawn is correct when he says that a tree object to hold the 
> > notes of all commits (which is not an unlikely scenario if you are 
> > thinking about corporate processes) would be huge.
> >
> >> The notes implementation is now in master. If it's about to change in an 
> >> incompatible way, how do you want to handle it? I'm wary of a quick 
> >> patch to change the format this late in the release cycle. We could hold 
> >> it back from 1.6.2. Alternatively, we could let it release with a "this 
> >> is probably going to change" warning.
> >> 
> >> I think I favor holding it back, but I am not picky.
> >
> > Yes, I am also in favor of holding it back.
> 
> I could do a revert on 'master' if it is really needed, but I found that
> the above reasoning is a bit troublesome.  The thing is, if a tree to hold
> the notes would be huge to be unmanageable, then it would still be huge to
> be unmanageable if you split it into 256 pieces.

The thing is, a tree object of 17 megabyte is unmanagably large if you 
have to read it whenever you access even a single node.  Having 256 trees 
instead, each of which is about 68 kilobyte is much nicer.

> I'd rather prefer to see us first try to find a way to optimze the tree 
> parser.  Maybe packv4 or Linus's binary search (which IIRC you declared 
> would not work --- I recall I once thought about it myself but I do not 
> recall what my conclusions were) play a role in it.

I declared it did not work, and showed an example here:

	http://article.gmane.org/gmane.comp.version-control.git/103297

Now, this example is so concocted that it is not even funny.  For example, 
it falls flat down for notes, as the names never contain spaces there.

I guess that we could detect possible false positives such as my example, 
by searching for NULs and spaces in the vicinity, and just search on if 
there is a salmon of a doubt left.

But the worst part about it: we'd still have to unpack the whole tree 
object to start bisecting (as described in said mail).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 16:48           ` Johannes Schindelin
@ 2009-02-10 16:56             ` Shawn O. Pearce
  2009-02-10 17:31               ` Johannes Schindelin
  2009-02-10 18:35               ` Junio C Hamano
  0 siblings, 2 replies; 33+ messages in thread
From: Shawn O. Pearce @ 2009-02-10 16:56 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Jeff King, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Tue, 10 Feb 2009, Junio C Hamano wrote:
> > 
> > I could do a revert on 'master' if it is really needed, but I found that
> > the above reasoning is a bit troublesome.  The thing is, if a tree to hold
> > the notes would be huge to be unmanageable, then it would still be huge to
> > be unmanageable if you split it into 256 pieces.
> 
> The thing is, a tree object of 17 megabyte is unmanagably large if you 
> have to read it whenever you access even a single node.  Having 256 trees 
> instead, each of which is about 68 kilobyte is much nicer.

See my other email on this thread; we'd probably need to unpack
all 256 subtrees *anyway* due to the distribution of SHA-1 names
for commits.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 16:44         ` Shawn O. Pearce
@ 2009-02-10 17:09           ` Johannes Schindelin
  2009-02-10 17:17             ` Shawn O. Pearce
  2009-02-11  3:19           ` Sam Vilain
  1 sibling, 1 reply; 33+ messages in thread
From: Johannes Schindelin @ 2009-02-10 17:09 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jeff King, gitster, git, Sam Vilain

Hi,

On Tue, 10 Feb 2009, Shawn O. Pearce wrote:

> For the "git database" thing above, I've been contemplating the
> idea of an index stored external from the Git object database.

The whole point of my exercise was to reuse as much as possible of Git's 
framework.  After all, if you store an index external from Git's object 
database, you go back to reimplementing the whole infrastructure for 
fetching/merging just for that index.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 17:09           ` Johannes Schindelin
@ 2009-02-10 17:17             ` Shawn O. Pearce
  0 siblings, 0 replies; 33+ messages in thread
From: Shawn O. Pearce @ 2009-02-10 17:17 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jeff King, gitster, git, Sam Vilain

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Tue, 10 Feb 2009, Shawn O. Pearce wrote:
> 
> > For the "git database" thing above, I've been contemplating the
> > idea of an index stored external from the Git object database.
> 
> The whole point of my exercise was to reuse as much as possible of Git's 
> framework.  After all, if you store an index external from Git's object 
> database, you go back to reimplementing the whole infrastructure for 
> fetching/merging just for that index.

Yea, I know.

It might just be easier to abandon everything in Git and start
from scratch for the "git database" thing.  But we'd lose the
ability to at least piggyback onto the existing Git transport.
And it doesn't help the "git notes" feature we're talking about.

Maybe I was viewing the external index as like the working tree,
where you can't really access the data until the external indexes
are current, just like you can't really (easily anyway) access the
working tree files until you bring the working tree current.  But
yea, it doesn't really use any of the existing machinary.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 16:56             ` Shawn O. Pearce
@ 2009-02-10 17:31               ` Johannes Schindelin
  2009-02-10 18:35               ` Junio C Hamano
  1 sibling, 0 replies; 33+ messages in thread
From: Johannes Schindelin @ 2009-02-10 17:31 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Junio C Hamano, Jeff King, git

Hi,

On Tue, 10 Feb 2009, Shawn O. Pearce wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > On Tue, 10 Feb 2009, Junio C Hamano wrote:
> > > 
> > > I could do a revert on 'master' if it is really needed, but I found that
> > > the above reasoning is a bit troublesome.  The thing is, if a tree to hold
> > > the notes would be huge to be unmanageable, then it would still be huge to
> > > be unmanageable if you split it into 256 pieces.
> > 
> > The thing is, a tree object of 17 megabyte is unmanagably large if you 
> > have to read it whenever you access even a single node.  Having 256 trees 
> > instead, each of which is about 68 kilobyte is much nicer.
> 
> See my other email on this thread; we'd probably need to unpack
> all 256 subtrees *anyway* due to the distribution of SHA-1 names
> for commits.

No, that is not true.  It is only true if you show more than 94180 
commit, actually, as only then the probability that you hit all 256 
buckets is larger than 50 percent.

In general, you will look at only a few commits, though.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 16:56             ` Shawn O. Pearce
  2009-02-10 17:31               ` Johannes Schindelin
@ 2009-02-10 18:35               ` Junio C Hamano
  2009-02-10 19:09                 ` Shawn O. Pearce
  2009-02-10 21:10                 ` Johannes Schindelin
  1 sibling, 2 replies; 33+ messages in thread
From: Junio C Hamano @ 2009-02-10 18:35 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Johannes Schindelin, Jeff King, git

"Shawn O. Pearce" <spearce@spearce.org> writes:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> On Tue, 10 Feb 2009, Junio C Hamano wrote:
>> > 
>> > I could do a revert on 'master' if it is really needed, but I found that
>> > the above reasoning is a bit troublesome.  The thing is, if a tree to hold
>> > the notes would be huge to be unmanageable, then it would still be huge to
>> > be unmanageable if you split it into 256 pieces.
>> 
>> The thing is, a tree object of 17 megabyte is unmanagably large if you 
>> have to read it whenever you access even a single node.  Having 256 trees 
>> instead, each of which is about 68 kilobyte is much nicer.
>
> See my other email on this thread; we'd probably need to unpack
> all 256 subtrees *anyway* due to the distribution of SHA-1 names
> for commits.

I wonder if we can solve this by introducing a local cache that is a flat
file that looks like:

    magic number for /usr/bin/file
    tree object SHA-1 the file caches
    Number of entries in this file
    256 fan-out offsets into this file
    N entries of <SHA-1, SHA-1>, sorted
    Checksum of the file itself

and use it when availble (otherwise optionally create it upon the first
lookup).  The file can be used by mmaping it and then doing a newton
raphson or binary search similar to the way patch-ids.c does.

The top-level API for such a hash-map would perhaps look like:

    /*
     * take the object name a tree object that is a hash map,
     * return an opaque struct.
     */
    struct hashmap *hashmap_open(const unsigned char *);

    /*
     * find the value given the key and return 0, or return negative
     * if not found.
     */
    int hashmap_lookup(struct hashmap *map, const unsigned char *key,
    		       unsigned char *val);

    /* discard the thing */
    void hashmap_close(struct hashmap *map);

We should be able to use these in "git log" and friends where Dscho added
the hook in his git-notes topic.

I am hoping that I could eventually rewrite rerere to use something like
this, so that rerere database can be shared, just like the way notes can
be shared, across repositories.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 18:35               ` Junio C Hamano
@ 2009-02-10 19:09                 ` Shawn O. Pearce
  2009-02-10 21:10                 ` Johannes Schindelin
  1 sibling, 0 replies; 33+ messages in thread
From: Shawn O. Pearce @ 2009-02-10 19:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, Jeff King, git

Junio C Hamano <gitster@pobox.com> wrote:
> 
> I wonder if we can solve this by introducing a local cache that is a flat
> file that looks like:
> 
>     magic number for /usr/bin/file

Don't forget a version number.  Waste 4 bytes now and its easier
to change the format in the future if we need to.

>     tree object SHA-1 the file caches
>     Number of entries in this file
>     256 fan-out offsets into this file
>     N entries of <SHA-1, SHA-1>, sorted
>     Checksum of the file itself
> 
> and use it when availble (otherwise optionally create it upon the first
> lookup).  The file can be used by mmaping it and then doing a newton
> raphson or binary search similar to the way patch-ids.c does.

Yup.  Sort of my thoughts when I was thinking about that external
index for a "git database".

I was considering a much more complex file layout though; one that
would permit editing without completely recopying the file every
time something changes.

More or less a traditional block oriented on-disk M-tree, with
copy-on-write semantics for the blocks.  This would permit us to
quickly append onto the end of the file with new updates, and then
periodically copy and flatten out the the file as necessary to
reclaim the prior dead space.

E.g.:

  magic number
  version
  [intermediate blocks ...]
  [leaf blocks...]
  root block

Writers would append modified leaf and intermediate blocks as
necessary to the end of the file, then append a new root block.

Readers would read the file tail and verify it is a root, then scan
with a traditional M-tree search algorithm.

If the root block has a "magic block header" and a strong checksum
at the tail of the block, readers can concurrently read while a
writer is appending.  Any invalid root block just means the reader
is seeing the middle of a write, or an aborted write, and should
scan backwards to locate the prior valid root.

If the root block also has a commit SHA-1 indicating which commit
that root become valid under, a reader can decide if that root
might give it answers which aren't correct for the current value of
the notes history it is reading, and scan backwards for some older
root block.  We could accelerate that by including the file offset
of the prior root block in each new root.

GC compacting the file is just a matter of write-locking the file
to block out a new writer, then traversing the current root and
copying all blocks that are reachable.

</end-hand-waving>

> I am hoping that I could eventually rewrite rerere to use something like
> this, so that rerere database can be shared, just like the way notes can
> be shared, across repositories.

Ooh, great idea.  If we could toss rerere data into something that
can be transported around, and efficiently accessed.  I like it.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 18:35               ` Junio C Hamano
  2009-02-10 19:09                 ` Shawn O. Pearce
@ 2009-02-10 21:10                 ` Johannes Schindelin
  2009-02-10 22:16                   ` Thomas Rast
  2009-02-11 20:02                   ` Jeff King
  1 sibling, 2 replies; 33+ messages in thread
From: Johannes Schindelin @ 2009-02-10 21:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn O. Pearce, Jeff King, git

Hi,

On Tue, 10 Feb 2009, Junio C Hamano wrote:

> "Shawn O. Pearce" <spearce@spearce.org> writes:
> 
> > Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> >> On Tue, 10 Feb 2009, Junio C Hamano wrote:
> >> > 
> >> > I could do a revert on 'master' if it is really needed, but I found that
> >> > the above reasoning is a bit troublesome.  The thing is, if a tree to hold
> >> > the notes would be huge to be unmanageable, then it would still be huge to
> >> > be unmanageable if you split it into 256 pieces.
> >> 
> >> The thing is, a tree object of 17 megabyte is unmanagably large if you 
> >> have to read it whenever you access even a single node.  Having 256 trees 
> >> instead, each of which is about 68 kilobyte is much nicer.
> >
> > See my other email on this thread; we'd probably need to unpack
> > all 256 subtrees *anyway* due to the distribution of SHA-1 names
> > for commits.
> 
> I wonder if we can solve this by introducing a local cache that is a flat
> file that looks like:
> 
>     magic number for /usr/bin/file
>     tree object SHA-1 the file caches
>     Number of entries in this file
>     256 fan-out offsets into this file
>     N entries of <SHA-1, SHA-1>, sorted
>     Checksum of the file itself
> 
> and use it when availble (otherwise optionally create it upon the first
> lookup).  The file can be used by mmaping it and then doing a newton
> raphson or binary search similar to the way patch-ids.c does.

Or we could use an on-disk hashmap.  Oh, wait...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 21:10                 ` Johannes Schindelin
@ 2009-02-10 22:16                   ` Thomas Rast
  2009-02-10 22:26                     ` Thomas Rast
  2009-02-10 22:32                     ` Junio C Hamano
  2009-02-11 20:02                   ` Jeff King
  1 sibling, 2 replies; 33+ messages in thread
From: Thomas Rast @ 2009-02-10 22:16 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Shawn O. Pearce, Jeff King, git

[-- Attachment #1: Type: text/plain, Size: 1595 bytes --]

Johannes Schindelin wrote:
> Or we could use an on-disk hashmap.  Oh, wait...

While reading this thread, I sure wondered ... why don't we use the
one on-disk fast access structure we already have: the index?

Sure, one problem is that the index reading code is inherently written
for a single index state.  However, all notes consumers I can
currently think of (show, log, anything that displays commit messages)
do not have to access the "real" index.

We'd immediately get lots of tool support for free.  Presumably the
real index code has been optimized very well, so it should perform
well.  Perhaps there could even be some definition of a NOTES_HEAD
that tracks the current (albeit not checked out, that would be insane)
state.


On a tangent, I'd really like to see a feature that lets us have
several sets of notes (by whatever mechanism).  Displaying them as
"Notes from remotes/trast/mailnotes" or similar should be ok.  Given
that even before notes are in any release we already have at least two
projects working with mass annotations, it doesn't take much of a
crystal ball to see that the current one-note restriction will be a
limitation.

At a (*very*) cursory glance at read-cache.c, it seems that there is
even support for having several index structures in memory at once,
making this easy.  And it looks like reading the cache is more or less
memcpy() if xmmap() is fast (Windows would suffer once again).


Then again I joined this discussion very late so feel free to ignore
my ramblings.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 22:16                   ` Thomas Rast
@ 2009-02-10 22:26                     ` Thomas Rast
  2009-02-10 22:32                     ` Junio C Hamano
  1 sibling, 0 replies; 33+ messages in thread
From: Thomas Rast @ 2009-02-10 22:26 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Shawn O. Pearce, Jeff King, git

[-- Attachment #1: Type: text/plain, Size: 686 bytes --]

Thomas Rast wrote:
> Sure, one problem is that the index reading code is inherently written
> for a single index state.  However, all notes consumers I can
> currently think of (show, log, anything that displays commit messages)
> do not have to access the "real" index.
[...]
> At a (*very*) cursory glance at read-cache.c, it seems that there is
> even support for having several index structures in memory at once,
> making this easy.  And it looks like reading the cache is more or less
> memcpy() if xmmap() is fast (Windows would suffer once again).

Note to self: do not write mail on bus, then pick up later at home.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 22:16                   ` Thomas Rast
  2009-02-10 22:26                     ` Thomas Rast
@ 2009-02-10 22:32                     ` Junio C Hamano
  1 sibling, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2009-02-10 22:32 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Johannes Schindelin, Shawn O. Pearce, Jeff King, git

Thomas Rast <trast@student.ethz.ch> writes:

> Johannes Schindelin wrote:
>> Or we could use an on-disk hashmap.  Oh, wait...
>
> While reading this thread, I sure wondered ... why don't we use the
> one on-disk fast access structure we already have: the index?

Since when the index has become a on-disk fast access structure?

> Sure, one problem is that the index reading code is inherently written
> for a single index state.

That's wrong, but because the index is not a on-disk fast access structure
to begin with, the incorrect statement about it is excused ;-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-09 21:12 RFC: Flat directory for notes, or fan-out? Both! Johannes Schindelin
  2009-02-10  7:58 ` Boyd Stephen Smith Jr.
  2009-02-10 12:18 ` Jeff King
@ 2009-02-11  1:14 ` Sam Vilain
  2 siblings, 0 replies; 33+ messages in thread
From: Sam Vilain @ 2009-02-11  1:14 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, peff, spearce

Johannes Schindelin wrote:
> Hi,
>
> Shawn triggered some well needed thinking on my part about the notes 
> implementation.  At the moment, we have flat directory structure, and read 
> all of them in one go (when needed).
>
> I think we should support that, because it is relatively easy to generate 
> that kind of trees for small-scale applications.
>
> However, I think there is also a benefit to handle fan-out directory 
> structures, too: they scale much nicer.
>
> If the commit name was not found as a filename, it could be searched in 
> whatever subdirectory whose name is a prefix of said commit name (first 
> wins).
>   

Great idea! Glad I thought of it! ;-)

http://thread.gmane.org/gmane.comp.version-control.git/106715/focus=107975

I hoped my approach allowed for smarter things later, such as splitting
into smaller buckets whenever a directory gets more than N entries or
periodically rebalancing if required. But the initial version is at
least forward thinking to support reading it.

Merging them will need to be savvy of this of course.

Sam.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 13:16   ` Jeff King
@ 2009-02-11  1:58     ` Boyd Stephen Smith Jr.
  2009-02-11  2:35       ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Boyd Stephen Smith Jr. @ 2009-02-11  1:58 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, git, spearce

[-- Attachment #1: Type: text/plain, Size: 2071 bytes --]

On Tuesday 10 February 2009 07:16:00 you wrote:
> On Tue, Feb 10, 2009 at 01:58:41AM -0600, Boyd Stephen Smith Jr. wrote:
> > On Monday 09 February 2009 15:12:06 Johannes Schindelin wrote:
> > > So I think it would be a sane plan to do the following when a commit
> > > note is requested:
> >
> > So, something like a Trie data structure?  I think that is a great way to
> > store fixed-length strings from a limited alphabet with arbitrary data
> > attached.
>
> I don't think a Trie quite makes sense here. We still have to look
> linearly through each git tree (an artifact of the tree implementation).

Perhaps it's not a traditional trie structure but that was the closest analogy 
I could come up with.  I was actually thinking of something between a trie and 
a b-tree, I think.  (It has been a long time since data structures class...)

The issue, as I understand it, it that we don't have gargantuan tree objects.  
Reading and writing are slow and they'd also take up way to much memory if you 
are only trying to find a few commits.

So, we figure out a maximum tree size that is reasonable, figure out a fan-out 
that prevents the tree from growing above that size, but *dynamically* apply 
that fan-out.  I.e. if the fanout is 2 characters, and we've added notes for 
both ff82730c and ff23abc0, then our tree would have ff/ -> some_tree_sha, but 
if we had only a note for the one one our tree would have ff82730c... -> 
some_note_sha.  Unlike .git/objects, we should probably also do dynamic fanout 
in subtrees.

Yes, this would require a custom merge strategy for notes to flatten -> merge 
-> canonicalize.

> Or did you mean something else entirely?

Yeah, that.

While I'm throwing out crazy ideas, why not makes a notes tree look just like 
.git/objects, including info and pack directories?
-- 
Boyd Stephen Smith Jr.                   ,= ,-_-. =.
bss@iguanasuicide.net                   ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy         `-'(. .)`-'
http://iguanasuicide.net/                    \_/


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-11  1:58     ` Boyd Stephen Smith Jr.
@ 2009-02-11  2:35       ` Linus Torvalds
  2009-02-11  3:30         ` Sam Vilain
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2009-02-11  2:35 UTC (permalink / raw)
  To: Boyd Stephen Smith Jr.; +Cc: Jeff King, Johannes Schindelin, git, spearce



On Tue, 10 Feb 2009, Boyd Stephen Smith Jr. wrote:
> 
> Yes, this would require a custom merge strategy for notes to flatten -> merge 
> -> canonicalize.

That sounds unnecessarily complicated. It also really sucks for the case 
you want to optimize: small differences between trees, where you don't 
need to even linearize the common parts.

Why not make it just a straight fixed 12-bit prefix, single-level trie.

Sure, if you have less than 4k objects, it's going to add an unnecessary 
indirection, and close to an extra tree object for each object. But it 
should scale pretty well to a fairly huge numbe of notes. IOW, if you have 
less than 2^24 notes (16 million), you'll never have a tree object with 
more than 4k entries.

And with each tree being ~70 bytes/object (40 bytes name, 20 bytes SHA1 + 
overhead), the individual tree objects will still be a reasonable(ish) 
size. And the fixed depth and prefix size means that merging is trivial 
and can use the normal tree merge that avoids touching common subtrees.

The default .git/objects fan-out of just 8 bits might work too, but if 
we're thinking millions of notes (which is not entirely unreasonable), it 
gets ugly pretty fast. The reason it works ok for git is the repacking.

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 16:44         ` Shawn O. Pearce
  2009-02-10 17:09           ` Johannes Schindelin
@ 2009-02-11  3:19           ` Sam Vilain
  1 sibling, 0 replies; 33+ messages in thread
From: Sam Vilain @ 2009-02-11  3:19 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Johannes Schindelin, Jeff King, gitster, git

Shawn O. Pearce wrote:
>> The point you raised earlier, that there would be a lot of ambiguity if 
>> we allow both flat and fan-out directory structures, is a valid point, 
>> though.
> 
> Yup.  The flat vs. fan-out is a problem.
  [...]
> Notes on commits though are a hell of a problem.  SHA-1 is just so
> uniform at distributing the commits around the namespace that even
> with just the 200 most recent commits we wind up with a commit in
> almost every "bucket", assuming a two hex digit fan-out bucket like
> the loose object directory.

I think my patch from 1 Feb addressed this, at least for the operations
it implemented.

I just don't see why you need to decide up front what the split is going
to be.  Just read the next tree, descend into the closest matching tree
until you find the record you are looking for and that's it.  Sure, my
patch just loads it all and throws it into a hash - this should still be
efficient for short log operations even if the hash table ends up 1MB.
But why take my guess.  Let's stress test it.

'lorem' is the binary in the Text::Lorem Perl module.  It generates a
paragraph of random Latin text.

 wilber:~/src/git$ time git-log | wc -l
 256072

 real    0m0.709s
 user    0m0.608s
 sys     0m0.116s
 wilber:~/src/git$ git rev-list HEAD | wc -l
 17678
 wilber:~/src/git$ cat > my-editor
 #!/bin/sh

 ( lorem; echo ) > $1
 wilber:~/src/git$ chmod +x my-editor
 wilber:~/src/git$ export EDITOR=`pwd`/my-editor
 wilber:~/src/git$ export GIT_NOTES_SPLIT=2
 wilber:~/src/git$ time git-rev-list HEAD | while read rev
 > do ./git-notes.sh edit $rev; done
 fatal: unable to create '.git/refs/notes/commits.lock': File exists
 error: Ref refs/notes/commits is at
5f0732975b4acf237912a31e7ce14aa86d2e8179 but expected
725a2d119d2725e7d821906ad085bfbadbf43c8e
fatal: Cannot lock the ref 'refs/notes/commits'.
 [...]
 fatal: unable to write new index file
 Could not read index
 fatal: unable to write new index file
 Could not read index
 fatal: unable to write new index file
 Could not read index
 fatal: unable to write new index file
 Could not read index
 fatal: unable to write new index file
 Could not read index

 real    76m16.927s
 user    43m55.909s
 sys     19m33.005s

Oo.  Nasty errors there but never mind that for now.  Obviously some
remaining issues in the shell script.

What did I get out of that?

 wilber:~/src/git$ git-ls-tree -r refs/notes/commits | wc
   12043   48172 1144085
 wilber:~/src/git$

Hey well that's not too bad.  Enough to be a good test.  How long does
"git-log" take now?

 wilber:~/src/git$ time ./git-log | wc -l
 292201

 real    0m13.740s
 user    0m0.852s
 sys     0m0.716s
 wilber:~/src/git$ time ./git-log | wc -l
 292201

 real    0m1.335s
 user    0m0.856s
 sys     0m0.512s

Not bad!  Cool cache performance sucked there but only a 50% slowdown
for reading almost twice the number of objects.  Let's try 200 commits:

 wilber:~/src/git$ time git-log -200 | wc -l
 2877

 real    0m0.027s
 user    0m0.008s
 sys     0m0.020s

 wilber:~/src/git$ time ./git-log -200 | wc -l
 3477

 real    0m0.081s
 user    0m0.056s
 sys     0m0.020s

Quite a big slowdown proportionally, but not a huge amount in absolute
terms.  And we didn't even make the builtin-log machinery smart enough
to skip unneeded trees!

>  In a slightly unrelated
> thread offlist I have been talking with Sam Vilain about using Git
> as a database backend for tuple storage.
  [...]
> This would make the git-notes.sh code a *lot* more complex, as you
> can't just toss everything into an index file and then update it with
> a single update-index call.  Doing a tree split is much more work and
> requires removing and adding back all of the affected path names.
> (Its also perhaps unreasonable anyway to load 17,491 paths into a
> temporary index just to twiddle a note for the latest commit.)

Hehe, horribly overcomplicated for this use case... many applicable
ideas though.

> For the "git database" thing above, I've been contemplating the
> idea of an index stored external from the Git object database.
> Sam thinks indexes should be in the object database tree, but
> I'm considering storing them outside entirely because we can
> make the indexes more easily searched by a hash or binary search,
> like pack-*.idx.  Whenever the "database ref" gets moved we'd need
> to run a "sync" utility to bring these external indexes current.
> But they could also be more efficiently scanned.

Well either way it's a file you've got to scan somehow ... guess it
doesn't matter much whether it's in-tree or not.  I was actually saying
that there are some use cases where you might want to keep indexes in
the history and some where you don't.  Keeping them in-tree is not
normalised, but there are good use cases for it - eg efficient retrieval
of pre-computed aggregates that don't need to be up to the second, or
for instances where you want your nodes to be able to "hit the ground
running" after synchronisation without having to reindex.

For the use case we originally talked about I don't think you'd want any
indexes in-tree at all.

But I'd like to steer this thread well away from the database stuff I'm
drafting ... it's a lot more comprehensive, notes are a very simple hash
relationship.
-- 
Sam Vilain, Perl Hacker, Catalyst IT (NZ) Ltd.
phone: +64 4 499 2267        PGP ID: 0x66B25843

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-11  2:35       ` Linus Torvalds
@ 2009-02-11  3:30         ` Sam Vilain
  2009-02-11  3:54           ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Sam Vilain @ 2009-02-11  3:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Boyd Stephen Smith Jr., Jeff King, Johannes Schindelin, git, spearce

Linus Torvalds wrote:
> That sounds unnecessarily complicated. It also really sucks for the case 
> you want to optimize: small differences between trees, where you don't 
> need to even linearize the common parts.
>
> Why not make it just a straight fixed 12-bit prefix, single-level trie.
>   

My solution suffers from that problem too, but I personally still don't
think that the answer is to fix the trie boundary.

The only case where it hurts is when you want to merge. Nothing else
should care. So, if a merge of these note trees sees two different trie
sizes then it can convert the shorter one to the longer length first,
and then try the merge again. So you get the pain, but only once. And
when a project decides that its split is too small, it can split then
and it should "silently" spread out to others.

Sam.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-11  3:30         ` Sam Vilain
@ 2009-02-11  3:54           ` Linus Torvalds
  2009-02-11  5:05             ` Sam Vilain
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2009-02-11  3:54 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Boyd Stephen Smith Jr., Jeff King, Johannes Schindelin, git, spearce



On Wed, 11 Feb 2009, Sam Vilain wrote:
> 
> The only case where it hurts is when you want to merge. Nothing else
> should care. So, if a merge of these note trees sees two different trie
> sizes then it can convert the shorter one to the longer length first,
> and then try the merge again. So you get the pain, but only once. And
> when a project decides that its split is too small, it can split then
> and it should "silently" spread out to others.

But what's the advantage of the added complexity?

The non-fixed trie only helps for the case that doesn't matter - just a 
few annotations. If you have a thousand annotations or less, you _really_ 
don't care. Whatever you do will be fine.

So the whole thing only matters once you have tens of thousands of 
entries, and then you do want to have fan-out. No?

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-11  3:54           ` Linus Torvalds
@ 2009-02-11  5:05             ` Sam Vilain
  2009-02-11 12:35               ` Johannes Schindelin
  0 siblings, 1 reply; 33+ messages in thread
From: Sam Vilain @ 2009-02-11  5:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Boyd Stephen Smith Jr., Jeff King, Johannes Schindelin, git, spearce

Linus Torvalds wrote:
>> The only case where it hurts is when you want to merge. Nothing else
>> should care. So, if a merge of these note trees sees two different trie
>> sizes then it can convert the shorter one to the longer length first,
>> and then try the merge again. So you get the pain, but only once. And
>> when a project decides that its split is too small, it can split then
>> and it should "silently" spread out to others.
>>     
>
> But what's the advantage of the added complexity?
>
> The non-fixed trie only helps for the case that doesn't matter - just a 
> few annotations. If you have a thousand annotations or less, you _really_ 
> don't care. Whatever you do will be fine.
>
> So the whole thing only matters once you have tens of thousands of 
> entries, and then you do want to have fan-out. No?
>   

Yeah. I see your point and you may be right, that a 12/28 split hurts
no-one, if we take this to the benchmarks. There's certainly savings in
terms of total object count for the small users by using a smaller split.

I just already wrote the code to handle an arbitrary split for the
features written so far[1]. If *I* can write it, in C, it means it must
not be that complicated ;-)

So it comes down to how complicated things are when merging happens. If
12 is fixed in stone this is simple, because there are no chances for
discrepancies. But refs/notes/commits still needs special treatment to
be fetched, because it is not under refs/heads/* and you wouldn't
normally have a working tree to resolve conflicts.

So I think probably the most productive thing to do is for me to write
the code to handle the merge as I described above, once the code to
handle pulling in - and merging - notes at 'git fetch' time is written.
Then we can see whether it's that much of a complication.

To bench this we need the current builtin-log implementation to be
re-written to be lazy. Which means we can't put it in the next release
unless someone writes that. However my proposal means that we can
release as we are and not care, and let some code - which I hope I have
shown isn't *that* complicated, really - deal with it in a later
release, and not break backwards compatibility.

Sam.

1. see message <1233455960.17688.122.camel@maia.lan>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-11  5:05             ` Sam Vilain
@ 2009-02-11 12:35               ` Johannes Schindelin
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Schindelin @ 2009-02-11 12:35 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Linus Torvalds, Boyd Stephen Smith Jr., Jeff King, git, spearce

Hi,

On Wed, 11 Feb 2009, Sam Vilain wrote:

> I just already wrote the code to handle an arbitrary split for the 
> features written so far[1]. If *I* can write it, in C, it means it must 
> not be that complicated ;-)

I think I either missed your mail or had to ignore it due to too much day 
job work.

It is a good first step, of course the next step would be to load the 
trees on-demand.

Oh, and the best approach to handle the "to Trie or not to Trie" question 
would be to be strict in what we emit (12/28 it seems, by authority of 
Linus), and liberal in what we accept, IMHO.  That is, accept whatever 
partition of the SHA-1, stopping on the first we found (smaller number of 
slashes, or when that is equal, the smaller first prefixes).

We can always discuss ways to handle merging later, I guess.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-10 21:10                 ` Johannes Schindelin
  2009-02-10 22:16                   ` Thomas Rast
@ 2009-02-11 20:02                   ` Jeff King
  2009-02-11 20:57                     ` Johannes Schindelin
  1 sibling, 1 reply; 33+ messages in thread
From: Jeff King @ 2009-02-11 20:02 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Shawn O. Pearce, git

On Tue, Feb 10, 2009 at 10:10:43PM +0100, Johannes Schindelin wrote:

> > I wonder if we can solve this by introducing a local cache that is a flat
> > file that looks like:
> [...]
> Or we could use an on-disk hashmap.  Oh, wait...

That was my first thought, as well. Maybe your original implementation
wasn't so bad, after all. :)

I searched through the archive to find a list of criticisms, but I
didn't see any. So I guess the problem was just a concern that it might
end up too complex.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-11 20:02                   ` Jeff King
@ 2009-02-11 20:57                     ` Johannes Schindelin
  2009-02-11 21:16                       ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Johannes Schindelin @ 2009-02-11 20:57 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Shawn O. Pearce, git

Hi,

On Wed, 11 Feb 2009, Jeff King wrote:

> On Tue, Feb 10, 2009 at 10:10:43PM +0100, Johannes Schindelin wrote:
> 
> > > I wonder if we can solve this by introducing a local cache that is a flat
> > > file that looks like:
> > [...]
> > Or we could use an on-disk hashmap.  Oh, wait...
> 
> That was my first thought, as well. Maybe your original implementation
> wasn't so bad, after all. :)
> 
> I searched through the archive to find a list of criticisms, but I
> didn't see any. So I guess the problem was just a concern that it might
> end up too complex.

Nope, the issue was that it would take too long to recreate IIRC.

BTW I am no longer a fan of the on-disk cache; I think it is an ugly 
solution to a problem that should be solved without ugliness using a 
flexible directory layout in the note ref' tree.

I mean, we really can allow different directory layouts as Sam described, 
with a few benefits, and only slight downsides.  If we support multiple 
levels anyway, the code to allow arbitrary splits is not complicated (see 
Sam's patch).

Even the merging should not pose any problem at all; we need a custom 
merge driver anyway, and there is no reason whatsoever why we should not 
just teach the merge driver to remove the slashes before comparing the 
filie names.

At edit time, we can afford to perform a check that is a little more 
expensive than it would have been otherwise.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-11 20:57                     ` Johannes Schindelin
@ 2009-02-11 21:16                       ` Junio C Hamano
  2009-02-11 23:05                         ` Johannes Schindelin
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2009-02-11 21:16 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jeff King, Shawn O. Pearce, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Even the merging should not pose any problem at all; we need a custom 
> merge driver anyway, and there is no reason whatsoever why we should not 
> just teach the merge driver to remove the slashes before comparing the 
> filie names.

Once you start talking about "remove the slashes", you are assuming that
the custom merge algorithm must look at *all the paths* in the two trees
being merged, and it is a sign that your thinking is so trapped in the
inefficient way the current merge-recursive and unpack-trees based merge
works, and cannot think about the possibility that there could be more
efficient way to do merges.  Not very good.

If you have a fixed boundary and if most subtrees are the same between two
notes during a merge, we can do the same optimization as we do for two
input "diff-tree" codepath.  If the top of a subtree matches, we do not
even have to look at their subtree.  But that is true only if you do not
remove the slashes and allow a random hierarchy.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: Flat directory for notes, or fan-out?  Both!
  2009-02-11 21:16                       ` Junio C Hamano
@ 2009-02-11 23:05                         ` Johannes Schindelin
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Schindelin @ 2009-02-11 23:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Shawn O. Pearce, git

Hi,

On Wed, 11 Feb 2009, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > Even the merging should not pose any problem at all; we need a custom 
> > merge driver anyway, and there is no reason whatsoever why we should 
> > not just teach the merge driver to remove the slashes before comparing 
> > the filie names.
> 
> Once you start talking about "remove the slashes", you are assuming that 
> the custom merge algorithm must look at *all the paths* in the two trees 
> being merged, and it is a sign that your thinking is so trapped in the 
> inefficient way the current merge-recursive and unpack-trees based merge 
> works, and cannot think about the possibility that there could be more 
> efficient way to do merges.  Not very good.
> 
> If you have a fixed boundary and if most subtrees are the same between 
> two notes during a merge, we can do the same optimization as we do for 
> two input "diff-tree" codepath.  If the top of a subtree matches, we do 
> not even have to look at their subtree.  But that is true only if you do 
> not remove the slashes and allow a random hierarchy.

Well, I think in the case of notes, we have to optimize for random access 
of dozens of blobs rather than for merging.  I was well aware that the 
merging is more expensive when the hierarchy is not defined a priori.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2009-02-11 23:06 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-09 21:12 RFC: Flat directory for notes, or fan-out? Both! Johannes Schindelin
2009-02-10  7:58 ` Boyd Stephen Smith Jr.
2009-02-10 13:16   ` Jeff King
2009-02-11  1:58     ` Boyd Stephen Smith Jr.
2009-02-11  2:35       ` Linus Torvalds
2009-02-11  3:30         ` Sam Vilain
2009-02-11  3:54           ` Linus Torvalds
2009-02-11  5:05             ` Sam Vilain
2009-02-11 12:35               ` Johannes Schindelin
2009-02-10 12:18 ` Jeff King
2009-02-10 12:59   ` Johannes Schindelin
2009-02-10 13:10     ` Jeff King
2009-02-10 13:32       ` Johannes Schindelin
2009-02-10 15:58         ` Junio C Hamano
2009-02-10 16:48           ` Shawn O. Pearce
2009-02-10 16:48           ` Johannes Schindelin
2009-02-10 16:56             ` Shawn O. Pearce
2009-02-10 17:31               ` Johannes Schindelin
2009-02-10 18:35               ` Junio C Hamano
2009-02-10 19:09                 ` Shawn O. Pearce
2009-02-10 21:10                 ` Johannes Schindelin
2009-02-10 22:16                   ` Thomas Rast
2009-02-10 22:26                     ` Thomas Rast
2009-02-10 22:32                     ` Junio C Hamano
2009-02-11 20:02                   ` Jeff King
2009-02-11 20:57                     ` Johannes Schindelin
2009-02-11 21:16                       ` Junio C Hamano
2009-02-11 23:05                         ` Johannes Schindelin
2009-02-10 16:44         ` Shawn O. Pearce
2009-02-10 17:09           ` Johannes Schindelin
2009-02-10 17:17             ` Shawn O. Pearce
2009-02-11  3:19           ` Sam Vilain
2009-02-11  1:14 ` Sam Vilain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.