* RFC: Flat directory for notes, or fan-out? Both! @ 2009-02-09 21:12 Johannes Schindelin 2009-02-10 7:58 ` Boyd Stephen Smith Jr. ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Johannes Schindelin @ 2009-02-09 21:12 UTC (permalink / raw) To: git, peff, spearce Hi, Shawn triggered some well needed thinking on my part about the notes implementation. At the moment, we have flat directory structure, and read all of them in one go (when needed). I think we should support that, because it is relatively easy to generate that kind of trees for small-scale applications. However, I think there is also a benefit to handle fan-out directory structures, too: they scale much nicer. If the commit name was not found as a filename, it could be searched in whatever subdirectory whose name is a prefix of said commit name (first wins). So I think it would be a sane plan to do the following when a commit note is requested: - If not done yet, read in the whole top-level directory of the notes ref. - If the commit name is not found, find the tree entries whose name is a prefix of the commit name (we can even use the same hashmap to store these "incomplete" names, as we use a linear hash, which we fill in ascending order), - read the trees one by one, until the commit name is found (or no tree entry is left), deleting the trees from the hashmap on the go. How does that sound? Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-09 21:12 RFC: Flat directory for notes, or fan-out? Both! Johannes Schindelin @ 2009-02-10 7:58 ` Boyd Stephen Smith Jr. 2009-02-10 13:16 ` Jeff King 2009-02-10 12:18 ` Jeff King 2009-02-11 1:14 ` Sam Vilain 2 siblings, 1 reply; 33+ messages in thread From: Boyd Stephen Smith Jr. @ 2009-02-10 7:58 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git, peff, spearce [-- Attachment #1: Type: text/plain, Size: 540 bytes --] On Monday 09 February 2009 15:12:06 Johannes Schindelin wrote: > So I think it would be a sane plan to do the following when a commit note > is requested: So, something like a Trie data structure? I think that is a great way to store fixed-length strings from a limited alphabet with arbitrary data attached. -- Boyd Stephen Smith Jr. ,= ,-_-. =. bss@iguanasuicide.net ((_/)o o(\_)) ICQ: 514984 YM/AIM: DaTwinkDaddy `-'(. .)`-' http://iguanasuicide.net/ \_/ [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 7:58 ` Boyd Stephen Smith Jr. @ 2009-02-10 13:16 ` Jeff King 2009-02-11 1:58 ` Boyd Stephen Smith Jr. 0 siblings, 1 reply; 33+ messages in thread From: Jeff King @ 2009-02-10 13:16 UTC (permalink / raw) To: Boyd Stephen Smith Jr.; +Cc: Johannes Schindelin, git, spearce On Tue, Feb 10, 2009 at 01:58:41AM -0600, Boyd Stephen Smith Jr. wrote: > On Monday 09 February 2009 15:12:06 Johannes Schindelin wrote: > > So I think it would be a sane plan to do the following when a commit note > > is requested: > > So, something like a Trie data structure? I think that is a great way to > store fixed-length strings from a limited alphabet with arbitrary data > attached. I don't think a Trie quite makes sense here. We still have to look linearly through each git tree (an artifact of the tree implementation). You could organize the tree into a deeper, more complex data structure than just a simple fan-out. But remember that traditional data structures are usually trying to save expensive comparisons, and following a pointer is inexpensive. In the case of git trees, though, following a pointer into a subtree is _very_ expensive, since you have to lookup and decompress the object. So what we do now is read the tree into an associative hash. You could replace the hash with a trie, but it is not really the performance-critical part here. The issue is that without fan-out you have to read the _whole_ tree into the hash. With a constant-sized fanout, you get to divide that work by a constant. Or did you mean something else entirely? -Peff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 13:16 ` Jeff King @ 2009-02-11 1:58 ` Boyd Stephen Smith Jr. 2009-02-11 2:35 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Boyd Stephen Smith Jr. @ 2009-02-11 1:58 UTC (permalink / raw) To: Jeff King; +Cc: Johannes Schindelin, git, spearce [-- Attachment #1: Type: text/plain, Size: 2071 bytes --] On Tuesday 10 February 2009 07:16:00 you wrote: > On Tue, Feb 10, 2009 at 01:58:41AM -0600, Boyd Stephen Smith Jr. wrote: > > On Monday 09 February 2009 15:12:06 Johannes Schindelin wrote: > > > So I think it would be a sane plan to do the following when a commit > > > note is requested: > > > > So, something like a Trie data structure? I think that is a great way to > > store fixed-length strings from a limited alphabet with arbitrary data > > attached. > > I don't think a Trie quite makes sense here. We still have to look > linearly through each git tree (an artifact of the tree implementation). Perhaps it's not a traditional trie structure but that was the closest analogy I could come up with. I was actually thinking of something between a trie and a b-tree, I think. (It has been a long time since data structures class...) The issue, as I understand it, it that we don't have gargantuan tree objects. Reading and writing are slow and they'd also take up way to much memory if you are only trying to find a few commits. So, we figure out a maximum tree size that is reasonable, figure out a fan-out that prevents the tree from growing above that size, but *dynamically* apply that fan-out. I.e. if the fanout is 2 characters, and we've added notes for both ff82730c and ff23abc0, then our tree would have ff/ -> some_tree_sha, but if we had only a note for the one one our tree would have ff82730c... -> some_note_sha. Unlike .git/objects, we should probably also do dynamic fanout in subtrees. Yes, this would require a custom merge strategy for notes to flatten -> merge -> canonicalize. > Or did you mean something else entirely? Yeah, that. While I'm throwing out crazy ideas, why not makes a notes tree look just like .git/objects, including info and pack directories? -- Boyd Stephen Smith Jr. ,= ,-_-. =. bss@iguanasuicide.net ((_/)o o(\_)) ICQ: 514984 YM/AIM: DaTwinkDaddy `-'(. .)`-' http://iguanasuicide.net/ \_/ [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-11 1:58 ` Boyd Stephen Smith Jr. @ 2009-02-11 2:35 ` Linus Torvalds 2009-02-11 3:30 ` Sam Vilain 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2009-02-11 2:35 UTC (permalink / raw) To: Boyd Stephen Smith Jr.; +Cc: Jeff King, Johannes Schindelin, git, spearce On Tue, 10 Feb 2009, Boyd Stephen Smith Jr. wrote: > > Yes, this would require a custom merge strategy for notes to flatten -> merge > -> canonicalize. That sounds unnecessarily complicated. It also really sucks for the case you want to optimize: small differences between trees, where you don't need to even linearize the common parts. Why not make it just a straight fixed 12-bit prefix, single-level trie. Sure, if you have less than 4k objects, it's going to add an unnecessary indirection, and close to an extra tree object for each object. But it should scale pretty well to a fairly huge numbe of notes. IOW, if you have less than 2^24 notes (16 million), you'll never have a tree object with more than 4k entries. And with each tree being ~70 bytes/object (40 bytes name, 20 bytes SHA1 + overhead), the individual tree objects will still be a reasonable(ish) size. And the fixed depth and prefix size means that merging is trivial and can use the normal tree merge that avoids touching common subtrees. The default .git/objects fan-out of just 8 bits might work too, but if we're thinking millions of notes (which is not entirely unreasonable), it gets ugly pretty fast. The reason it works ok for git is the repacking. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-11 2:35 ` Linus Torvalds @ 2009-02-11 3:30 ` Sam Vilain 2009-02-11 3:54 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Sam Vilain @ 2009-02-11 3:30 UTC (permalink / raw) To: Linus Torvalds Cc: Boyd Stephen Smith Jr., Jeff King, Johannes Schindelin, git, spearce Linus Torvalds wrote: > That sounds unnecessarily complicated. It also really sucks for the case > you want to optimize: small differences between trees, where you don't > need to even linearize the common parts. > > Why not make it just a straight fixed 12-bit prefix, single-level trie. > My solution suffers from that problem too, but I personally still don't think that the answer is to fix the trie boundary. The only case where it hurts is when you want to merge. Nothing else should care. So, if a merge of these note trees sees two different trie sizes then it can convert the shorter one to the longer length first, and then try the merge again. So you get the pain, but only once. And when a project decides that its split is too small, it can split then and it should "silently" spread out to others. Sam. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-11 3:30 ` Sam Vilain @ 2009-02-11 3:54 ` Linus Torvalds 2009-02-11 5:05 ` Sam Vilain 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2009-02-11 3:54 UTC (permalink / raw) To: Sam Vilain Cc: Boyd Stephen Smith Jr., Jeff King, Johannes Schindelin, git, spearce On Wed, 11 Feb 2009, Sam Vilain wrote: > > The only case where it hurts is when you want to merge. Nothing else > should care. So, if a merge of these note trees sees two different trie > sizes then it can convert the shorter one to the longer length first, > and then try the merge again. So you get the pain, but only once. And > when a project decides that its split is too small, it can split then > and it should "silently" spread out to others. But what's the advantage of the added complexity? The non-fixed trie only helps for the case that doesn't matter - just a few annotations. If you have a thousand annotations or less, you _really_ don't care. Whatever you do will be fine. So the whole thing only matters once you have tens of thousands of entries, and then you do want to have fan-out. No? Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-11 3:54 ` Linus Torvalds @ 2009-02-11 5:05 ` Sam Vilain 2009-02-11 12:35 ` Johannes Schindelin 0 siblings, 1 reply; 33+ messages in thread From: Sam Vilain @ 2009-02-11 5:05 UTC (permalink / raw) To: Linus Torvalds Cc: Boyd Stephen Smith Jr., Jeff King, Johannes Schindelin, git, spearce Linus Torvalds wrote: >> The only case where it hurts is when you want to merge. Nothing else >> should care. So, if a merge of these note trees sees two different trie >> sizes then it can convert the shorter one to the longer length first, >> and then try the merge again. So you get the pain, but only once. And >> when a project decides that its split is too small, it can split then >> and it should "silently" spread out to others. >> > > But what's the advantage of the added complexity? > > The non-fixed trie only helps for the case that doesn't matter - just a > few annotations. If you have a thousand annotations or less, you _really_ > don't care. Whatever you do will be fine. > > So the whole thing only matters once you have tens of thousands of > entries, and then you do want to have fan-out. No? > Yeah. I see your point and you may be right, that a 12/28 split hurts no-one, if we take this to the benchmarks. There's certainly savings in terms of total object count for the small users by using a smaller split. I just already wrote the code to handle an arbitrary split for the features written so far[1]. If *I* can write it, in C, it means it must not be that complicated ;-) So it comes down to how complicated things are when merging happens. If 12 is fixed in stone this is simple, because there are no chances for discrepancies. But refs/notes/commits still needs special treatment to be fetched, because it is not under refs/heads/* and you wouldn't normally have a working tree to resolve conflicts. So I think probably the most productive thing to do is for me to write the code to handle the merge as I described above, once the code to handle pulling in - and merging - notes at 'git fetch' time is written. Then we can see whether it's that much of a complication. To bench this we need the current builtin-log implementation to be re-written to be lazy. Which means we can't put it in the next release unless someone writes that. However my proposal means that we can release as we are and not care, and let some code - which I hope I have shown isn't *that* complicated, really - deal with it in a later release, and not break backwards compatibility. Sam. 1. see message <1233455960.17688.122.camel@maia.lan> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-11 5:05 ` Sam Vilain @ 2009-02-11 12:35 ` Johannes Schindelin 0 siblings, 0 replies; 33+ messages in thread From: Johannes Schindelin @ 2009-02-11 12:35 UTC (permalink / raw) To: Sam Vilain Cc: Linus Torvalds, Boyd Stephen Smith Jr., Jeff King, git, spearce Hi, On Wed, 11 Feb 2009, Sam Vilain wrote: > I just already wrote the code to handle an arbitrary split for the > features written so far[1]. If *I* can write it, in C, it means it must > not be that complicated ;-) I think I either missed your mail or had to ignore it due to too much day job work. It is a good first step, of course the next step would be to load the trees on-demand. Oh, and the best approach to handle the "to Trie or not to Trie" question would be to be strict in what we emit (12/28 it seems, by authority of Linus), and liberal in what we accept, IMHO. That is, accept whatever partition of the SHA-1, stopping on the first we found (smaller number of slashes, or when that is equal, the smaller first prefixes). We can always discuss ways to handle merging later, I guess. Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-09 21:12 RFC: Flat directory for notes, or fan-out? Both! Johannes Schindelin 2009-02-10 7:58 ` Boyd Stephen Smith Jr. @ 2009-02-10 12:18 ` Jeff King 2009-02-10 12:59 ` Johannes Schindelin 2009-02-11 1:14 ` Sam Vilain 2 siblings, 1 reply; 33+ messages in thread From: Jeff King @ 2009-02-10 12:18 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git, spearce On Mon, Feb 09, 2009 at 10:12:06PM +0100, Johannes Schindelin wrote: > Shawn triggered some well needed thinking on my part about the notes > implementation. At the moment, we have flat directory structure, and read > all of them in one go (when needed). > > I think we should support that, because it is relatively easy to generate > that kind of trees for small-scale applications. Hmm. Do we really care about how easy it is to generate? Are we expecting people to not use the command interface and instead check out a notes tree and start putting stuff into $commit/foo? And if we are encouraging the dual possibilities, how do we handle the case of merging two trees with equivalent but differently-formatted content? Imagine I have three users, A, B, and C, all collaborating on a project with notes. A and B use the "git notes" interface which generates a fan-out directory structure. C uses his own script that directly writes to the notes tree without fan-out. Now let's imagine A, B, and C all write a note for commit X, and A pulls from the other two. When he pulls from B, there is a file-level conflict, and he decides that his note is better and resolves in his favor. But when he pulls from C, there is _no_ conflict, and now there are two notes for the same commit in his notes tree. You can give the multiple notes some sane semantics (one trumps the other, or they are a list, or whatever), but there is still an inconsistency: B's notes and C's notes behave differently. So now A has to start caring about how other people generate their notes. The only two solutions I can think of are: - when A pulls notes, he does a specialized merge that normalizes the note trees - particular notes trees are specified as being in "fan out" or "not fan out" mode. But there is no place to specify that to enforce it. -Peff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 12:18 ` Jeff King @ 2009-02-10 12:59 ` Johannes Schindelin 2009-02-10 13:10 ` Jeff King 0 siblings, 1 reply; 33+ messages in thread From: Johannes Schindelin @ 2009-02-10 12:59 UTC (permalink / raw) To: Jeff King; +Cc: git, spearce Hi, On Tue, 10 Feb 2009, Jeff King wrote: > On Mon, Feb 09, 2009 at 10:12:06PM +0100, Johannes Schindelin wrote: > > > Shawn triggered some well needed thinking on my part about the notes > > implementation. At the moment, we have flat directory structure, and read > > all of them in one go (when needed). > > > > I think we should support that, because it is relatively easy to generate > > that kind of trees for small-scale applications. > > Hmm. Do we really care about how easy it is to generate? Are we > expecting people to not use the command interface and instead check out > a notes tree and start putting stuff into $commit/foo? I wanted to be nice to existing users of the feature (it is in 'next', after all, and Thomas has produced some awesome examples, that will hopefully show the scalability of the thing). But you're right, it almost, but not quite, too late to switch. > And if we are encouraging the dual possibilities, how do we handle the > case of merging two trees with equivalent but differently-formatted > content? > > Imagine I have three users, A, B, and C, all collaborating on a project > with notes. A and B use the "git notes" interface which generates a > fan-out directory structure. C uses his own script that directly writes > to the notes tree without fan-out. > > Now let's imagine A, B, and C all write a note for commit X, and A pulls > from the other two. When he pulls from B, there is a file-level > conflict, and he decides that his note is better and resolves in his > favor. But when he pulls from C, there is _no_ conflict, and now there > are two notes for the same commit in his notes tree. You can give the > multiple notes some sane semantics (one trumps the other, or they are a > list, or whatever), but there is still an inconsistency: B's notes and > C's notes behave differently. So now A has to start caring about how > other people generate their notes. > > The only two solutions I can think of are: > > - when A pulls notes, he does a specialized merge that normalizes the > note trees > > - particular notes trees are specified as being in "fan out" or "not > fan out" mode. But there is no place to specify that to enforce it. You're correct. This buys all kinds of trouble. Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 12:59 ` Johannes Schindelin @ 2009-02-10 13:10 ` Jeff King 2009-02-10 13:32 ` Johannes Schindelin 0 siblings, 1 reply; 33+ messages in thread From: Jeff King @ 2009-02-10 13:10 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git, spearce On Tue, Feb 10, 2009 at 01:59:06PM +0100, Johannes Schindelin wrote: > > Hmm. Do we really care about how easy it is to generate? Are we > > expecting people to not use the command interface and instead check out > > a notes tree and start putting stuff into $commit/foo? > > I wanted to be nice to existing users of the feature (it is in 'next', > after all, and Thomas has produced some awesome examples, that will > hopefully show the scalability of the thing). > > But you're right, it almost, but not quite, too late to switch. OK. I think if you are seeing performance benefits from a 2-character fanout, then we should standardize on that (do you have new performance numbers somewhere?). The notes implementation is now in master. If it's about to change in an incompatible way, how do you want to handle it? I'm wary of a quick patch to change the format this late in the release cycle. We could hold it back from 1.6.2. Alternatively, we could let it release with a "this is probably going to change" warning. I think I favor holding it back, but I am not picky. > > multiple notes some sane semantics (one trumps the other, or they are a > > list, or whatever), but there is still an inconsistency: B's notes and > > C's notes behave differently. So now A has to start caring about how > > other people generate their notes. > > [...] > > You're correct. This buys all kinds of trouble. One other thing to note: I think we discussed in the past other kinds of "more than one way to store it" strategies (e.g., letting a blob note be the same as a tree note containing a blob "default"). They suffer from some of the same issues (though not quite as badly, since you would at least see that there was a conflict). -Peff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 13:10 ` Jeff King @ 2009-02-10 13:32 ` Johannes Schindelin 2009-02-10 15:58 ` Junio C Hamano 2009-02-10 16:44 ` Shawn O. Pearce 0 siblings, 2 replies; 33+ messages in thread From: Johannes Schindelin @ 2009-02-10 13:32 UTC (permalink / raw) To: Jeff King, gitster; +Cc: git, spearce Hi, [Junio: seems like both Peff and me would like to hold the notes out of 1.6.2, would you mind?] On Tue, 10 Feb 2009, Jeff King wrote: > On Tue, Feb 10, 2009 at 01:59:06PM +0100, Johannes Schindelin wrote: > > > > Hmm. Do we really care about how easy it is to generate? Are we > > > expecting people to not use the command interface and instead check out > > > a notes tree and start putting stuff into $commit/foo? > > > > I wanted to be nice to existing users of the feature (it is in 'next', > > after all, and Thomas has produced some awesome examples, that will > > hopefully show the scalability of the thing). > > > > But you're right, it almost, but not quite, too late to switch. > > OK. I think if you are seeing performance benefits from a 2-character > fanout, then we should standardize on that (do you have new performance > numbers somewhere?). The thing is: Shawn is correct when he says that a tree object to hold the notes of all commits (which is not an unlikely scenario if you are thinking about corporate processes) would be huge. > The notes implementation is now in master. If it's about to change in an > incompatible way, how do you want to handle it? I'm wary of a quick > patch to change the format this late in the release cycle. We could hold > it back from 1.6.2. Alternatively, we could let it release with a "this > is probably going to change" warning. > > I think I favor holding it back, but I am not picky. Yes, I am also in favor of holding it back. > > > multiple notes some sane semantics (one trumps the other, or they are a > > > list, or whatever), but there is still an inconsistency: B's notes and > > > C's notes behave differently. So now A has to start caring about how > > > other people generate their notes. > > > [...] > > > > You're correct. This buys all kinds of trouble. > > One other thing to note: I think we discussed in the past other kinds of > "more than one way to store it" strategies (e.g., letting a blob note be > the same as a tree note containing a blob "default"). They suffer from > some of the same issues (though not quite as badly, since you would at > least see that there was a conflict). Actually, I do not see much of a problem there. If the entry (corresponding to the commit name) in the notes tree points to a blob, then that is that, if it points to a tree, then we just read all of the objects therein (or maybe at a later stage we allow restricting to a certain file basename). The point you raised earlier, that there would be a lot of ambiguity if we allow both flat and fan-out directory structures, is a valid point, though. Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 13:32 ` Johannes Schindelin @ 2009-02-10 15:58 ` Junio C Hamano 2009-02-10 16:48 ` Shawn O. Pearce 2009-02-10 16:48 ` Johannes Schindelin 2009-02-10 16:44 ` Shawn O. Pearce 1 sibling, 2 replies; 33+ messages in thread From: Junio C Hamano @ 2009-02-10 15:58 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Jeff King, git, spearce Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: >> OK. I think if you are seeing performance benefits from a 2-character >> fanout, then we should standardize on that (do you have new performance >> numbers somewhere?). > > The thing is: Shawn is correct when he says that a tree object to hold the > notes of all commits (which is not an unlikely scenario if you are > thinking about corporate processes) would be huge. > >> The notes implementation is now in master. If it's about to change in an >> incompatible way, how do you want to handle it? I'm wary of a quick >> patch to change the format this late in the release cycle. We could hold >> it back from 1.6.2. Alternatively, we could let it release with a "this >> is probably going to change" warning. >> >> I think I favor holding it back, but I am not picky. > > Yes, I am also in favor of holding it back. I could do a revert on 'master' if it is really needed, but I found that the above reasoning is a bit troublesome. The thing is, if a tree to hold the notes would be huge to be unmanageable, then it would still be huge to be unmanageable if you split it into 256 pieces. I'd rather prefer to see us first try to find a way to optimze the tree parser. Maybe packv4 or Linus's binary search (which IIRC you declared would not work --- I recall I once thought about it myself but I do not recall what my conclusions were) play a role in it. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 15:58 ` Junio C Hamano @ 2009-02-10 16:48 ` Shawn O. Pearce 2009-02-10 16:48 ` Johannes Schindelin 1 sibling, 0 replies; 33+ messages in thread From: Shawn O. Pearce @ 2009-02-10 16:48 UTC (permalink / raw) To: Junio C Hamano; +Cc: Johannes Schindelin, Jeff King, git Junio C Hamano <gitster@pobox.com> wrote: > > I could do a revert on 'master' if it is really needed, but I found that > the above reasoning is a bit troublesome. The thing is, if a tree to hold > the notes would be huge to be unmanageable, then it would still be huge to > be unmanageable if you split it into 256 pieces. > > I'd rather prefer to see us first try to find a way to optimze the tree > parser. Maybe packv4 or Linus's binary search (which IIRC you declared > would not work --- I recall I once thought about it myself but I do not > recall what my conclusions were) play a role in it. packv4 as proposed wouldn't help a notes tree. It relied on the fact that we'd have no more than 64k unique file names in a repository, and any name which overflowed that 64k limit would force its tree to be a canonical format tree, which is what we are trying to avoid here. -- Shawn. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 15:58 ` Junio C Hamano 2009-02-10 16:48 ` Shawn O. Pearce @ 2009-02-10 16:48 ` Johannes Schindelin 2009-02-10 16:56 ` Shawn O. Pearce 1 sibling, 1 reply; 33+ messages in thread From: Johannes Schindelin @ 2009-02-10 16:48 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, git, spearce Hi, On Tue, 10 Feb 2009, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > >> OK. I think if you are seeing performance benefits from a 2-character > >> fanout, then we should standardize on that (do you have new performance > >> numbers somewhere?). > > > > The thing is: Shawn is correct when he says that a tree object to hold the > > notes of all commits (which is not an unlikely scenario if you are > > thinking about corporate processes) would be huge. > > > >> The notes implementation is now in master. If it's about to change in an > >> incompatible way, how do you want to handle it? I'm wary of a quick > >> patch to change the format this late in the release cycle. We could hold > >> it back from 1.6.2. Alternatively, we could let it release with a "this > >> is probably going to change" warning. > >> > >> I think I favor holding it back, but I am not picky. > > > > Yes, I am also in favor of holding it back. > > I could do a revert on 'master' if it is really needed, but I found that > the above reasoning is a bit troublesome. The thing is, if a tree to hold > the notes would be huge to be unmanageable, then it would still be huge to > be unmanageable if you split it into 256 pieces. The thing is, a tree object of 17 megabyte is unmanagably large if you have to read it whenever you access even a single node. Having 256 trees instead, each of which is about 68 kilobyte is much nicer. > I'd rather prefer to see us first try to find a way to optimze the tree > parser. Maybe packv4 or Linus's binary search (which IIRC you declared > would not work --- I recall I once thought about it myself but I do not > recall what my conclusions were) play a role in it. I declared it did not work, and showed an example here: http://article.gmane.org/gmane.comp.version-control.git/103297 Now, this example is so concocted that it is not even funny. For example, it falls flat down for notes, as the names never contain spaces there. I guess that we could detect possible false positives such as my example, by searching for NULs and spaces in the vicinity, and just search on if there is a salmon of a doubt left. But the worst part about it: we'd still have to unpack the whole tree object to start bisecting (as described in said mail). Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 16:48 ` Johannes Schindelin @ 2009-02-10 16:56 ` Shawn O. Pearce 2009-02-10 17:31 ` Johannes Schindelin 2009-02-10 18:35 ` Junio C Hamano 0 siblings, 2 replies; 33+ messages in thread From: Shawn O. Pearce @ 2009-02-10 16:56 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, Jeff King, git Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > On Tue, 10 Feb 2009, Junio C Hamano wrote: > > > > I could do a revert on 'master' if it is really needed, but I found that > > the above reasoning is a bit troublesome. The thing is, if a tree to hold > > the notes would be huge to be unmanageable, then it would still be huge to > > be unmanageable if you split it into 256 pieces. > > The thing is, a tree object of 17 megabyte is unmanagably large if you > have to read it whenever you access even a single node. Having 256 trees > instead, each of which is about 68 kilobyte is much nicer. See my other email on this thread; we'd probably need to unpack all 256 subtrees *anyway* due to the distribution of SHA-1 names for commits. -- Shawn. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 16:56 ` Shawn O. Pearce @ 2009-02-10 17:31 ` Johannes Schindelin 2009-02-10 18:35 ` Junio C Hamano 1 sibling, 0 replies; 33+ messages in thread From: Johannes Schindelin @ 2009-02-10 17:31 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Junio C Hamano, Jeff King, git Hi, On Tue, 10 Feb 2009, Shawn O. Pearce wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > On Tue, 10 Feb 2009, Junio C Hamano wrote: > > > > > > I could do a revert on 'master' if it is really needed, but I found that > > > the above reasoning is a bit troublesome. The thing is, if a tree to hold > > > the notes would be huge to be unmanageable, then it would still be huge to > > > be unmanageable if you split it into 256 pieces. > > > > The thing is, a tree object of 17 megabyte is unmanagably large if you > > have to read it whenever you access even a single node. Having 256 trees > > instead, each of which is about 68 kilobyte is much nicer. > > See my other email on this thread; we'd probably need to unpack > all 256 subtrees *anyway* due to the distribution of SHA-1 names > for commits. No, that is not true. It is only true if you show more than 94180 commit, actually, as only then the probability that you hit all 256 buckets is larger than 50 percent. In general, you will look at only a few commits, though. Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 16:56 ` Shawn O. Pearce 2009-02-10 17:31 ` Johannes Schindelin @ 2009-02-10 18:35 ` Junio C Hamano 2009-02-10 19:09 ` Shawn O. Pearce 2009-02-10 21:10 ` Johannes Schindelin 1 sibling, 2 replies; 33+ messages in thread From: Junio C Hamano @ 2009-02-10 18:35 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Johannes Schindelin, Jeff King, git "Shawn O. Pearce" <spearce@spearce.org> writes: > Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: >> On Tue, 10 Feb 2009, Junio C Hamano wrote: >> > >> > I could do a revert on 'master' if it is really needed, but I found that >> > the above reasoning is a bit troublesome. The thing is, if a tree to hold >> > the notes would be huge to be unmanageable, then it would still be huge to >> > be unmanageable if you split it into 256 pieces. >> >> The thing is, a tree object of 17 megabyte is unmanagably large if you >> have to read it whenever you access even a single node. Having 256 trees >> instead, each of which is about 68 kilobyte is much nicer. > > See my other email on this thread; we'd probably need to unpack > all 256 subtrees *anyway* due to the distribution of SHA-1 names > for commits. I wonder if we can solve this by introducing a local cache that is a flat file that looks like: magic number for /usr/bin/file tree object SHA-1 the file caches Number of entries in this file 256 fan-out offsets into this file N entries of <SHA-1, SHA-1>, sorted Checksum of the file itself and use it when availble (otherwise optionally create it upon the first lookup). The file can be used by mmaping it and then doing a newton raphson or binary search similar to the way patch-ids.c does. The top-level API for such a hash-map would perhaps look like: /* * take the object name a tree object that is a hash map, * return an opaque struct. */ struct hashmap *hashmap_open(const unsigned char *); /* * find the value given the key and return 0, or return negative * if not found. */ int hashmap_lookup(struct hashmap *map, const unsigned char *key, unsigned char *val); /* discard the thing */ void hashmap_close(struct hashmap *map); We should be able to use these in "git log" and friends where Dscho added the hook in his git-notes topic. I am hoping that I could eventually rewrite rerere to use something like this, so that rerere database can be shared, just like the way notes can be shared, across repositories. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 18:35 ` Junio C Hamano @ 2009-02-10 19:09 ` Shawn O. Pearce 2009-02-10 21:10 ` Johannes Schindelin 1 sibling, 0 replies; 33+ messages in thread From: Shawn O. Pearce @ 2009-02-10 19:09 UTC (permalink / raw) To: Junio C Hamano; +Cc: Johannes Schindelin, Jeff King, git Junio C Hamano <gitster@pobox.com> wrote: > > I wonder if we can solve this by introducing a local cache that is a flat > file that looks like: > > magic number for /usr/bin/file Don't forget a version number. Waste 4 bytes now and its easier to change the format in the future if we need to. > tree object SHA-1 the file caches > Number of entries in this file > 256 fan-out offsets into this file > N entries of <SHA-1, SHA-1>, sorted > Checksum of the file itself > > and use it when availble (otherwise optionally create it upon the first > lookup). The file can be used by mmaping it and then doing a newton > raphson or binary search similar to the way patch-ids.c does. Yup. Sort of my thoughts when I was thinking about that external index for a "git database". I was considering a much more complex file layout though; one that would permit editing without completely recopying the file every time something changes. More or less a traditional block oriented on-disk M-tree, with copy-on-write semantics for the blocks. This would permit us to quickly append onto the end of the file with new updates, and then periodically copy and flatten out the the file as necessary to reclaim the prior dead space. E.g.: magic number version [intermediate blocks ...] [leaf blocks...] root block Writers would append modified leaf and intermediate blocks as necessary to the end of the file, then append a new root block. Readers would read the file tail and verify it is a root, then scan with a traditional M-tree search algorithm. If the root block has a "magic block header" and a strong checksum at the tail of the block, readers can concurrently read while a writer is appending. Any invalid root block just means the reader is seeing the middle of a write, or an aborted write, and should scan backwards to locate the prior valid root. If the root block also has a commit SHA-1 indicating which commit that root become valid under, a reader can decide if that root might give it answers which aren't correct for the current value of the notes history it is reading, and scan backwards for some older root block. We could accelerate that by including the file offset of the prior root block in each new root. GC compacting the file is just a matter of write-locking the file to block out a new writer, then traversing the current root and copying all blocks that are reachable. </end-hand-waving> > I am hoping that I could eventually rewrite rerere to use something like > this, so that rerere database can be shared, just like the way notes can > be shared, across repositories. Ooh, great idea. If we could toss rerere data into something that can be transported around, and efficiently accessed. I like it. -- Shawn. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 18:35 ` Junio C Hamano 2009-02-10 19:09 ` Shawn O. Pearce @ 2009-02-10 21:10 ` Johannes Schindelin 2009-02-10 22:16 ` Thomas Rast 2009-02-11 20:02 ` Jeff King 1 sibling, 2 replies; 33+ messages in thread From: Johannes Schindelin @ 2009-02-10 21:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: Shawn O. Pearce, Jeff King, git Hi, On Tue, 10 Feb 2009, Junio C Hamano wrote: > "Shawn O. Pearce" <spearce@spearce.org> writes: > > > Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > >> On Tue, 10 Feb 2009, Junio C Hamano wrote: > >> > > >> > I could do a revert on 'master' if it is really needed, but I found that > >> > the above reasoning is a bit troublesome. The thing is, if a tree to hold > >> > the notes would be huge to be unmanageable, then it would still be huge to > >> > be unmanageable if you split it into 256 pieces. > >> > >> The thing is, a tree object of 17 megabyte is unmanagably large if you > >> have to read it whenever you access even a single node. Having 256 trees > >> instead, each of which is about 68 kilobyte is much nicer. > > > > See my other email on this thread; we'd probably need to unpack > > all 256 subtrees *anyway* due to the distribution of SHA-1 names > > for commits. > > I wonder if we can solve this by introducing a local cache that is a flat > file that looks like: > > magic number for /usr/bin/file > tree object SHA-1 the file caches > Number of entries in this file > 256 fan-out offsets into this file > N entries of <SHA-1, SHA-1>, sorted > Checksum of the file itself > > and use it when availble (otherwise optionally create it upon the first > lookup). The file can be used by mmaping it and then doing a newton > raphson or binary search similar to the way patch-ids.c does. Or we could use an on-disk hashmap. Oh, wait... Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 21:10 ` Johannes Schindelin @ 2009-02-10 22:16 ` Thomas Rast 2009-02-10 22:26 ` Thomas Rast 2009-02-10 22:32 ` Junio C Hamano 2009-02-11 20:02 ` Jeff King 1 sibling, 2 replies; 33+ messages in thread From: Thomas Rast @ 2009-02-10 22:16 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, Shawn O. Pearce, Jeff King, git [-- Attachment #1: Type: text/plain, Size: 1595 bytes --] Johannes Schindelin wrote: > Or we could use an on-disk hashmap. Oh, wait... While reading this thread, I sure wondered ... why don't we use the one on-disk fast access structure we already have: the index? Sure, one problem is that the index reading code is inherently written for a single index state. However, all notes consumers I can currently think of (show, log, anything that displays commit messages) do not have to access the "real" index. We'd immediately get lots of tool support for free. Presumably the real index code has been optimized very well, so it should perform well. Perhaps there could even be some definition of a NOTES_HEAD that tracks the current (albeit not checked out, that would be insane) state. On a tangent, I'd really like to see a feature that lets us have several sets of notes (by whatever mechanism). Displaying them as "Notes from remotes/trast/mailnotes" or similar should be ok. Given that even before notes are in any release we already have at least two projects working with mass annotations, it doesn't take much of a crystal ball to see that the current one-note restriction will be a limitation. At a (*very*) cursory glance at read-cache.c, it seems that there is even support for having several index structures in memory at once, making this easy. And it looks like reading the cache is more or less memcpy() if xmmap() is fast (Windows would suffer once again). Then again I joined this discussion very late so feel free to ignore my ramblings. -- Thomas Rast trast@{inf,student}.ethz.ch [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 22:16 ` Thomas Rast @ 2009-02-10 22:26 ` Thomas Rast 2009-02-10 22:32 ` Junio C Hamano 1 sibling, 0 replies; 33+ messages in thread From: Thomas Rast @ 2009-02-10 22:26 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, Shawn O. Pearce, Jeff King, git [-- Attachment #1: Type: text/plain, Size: 686 bytes --] Thomas Rast wrote: > Sure, one problem is that the index reading code is inherently written > for a single index state. However, all notes consumers I can > currently think of (show, log, anything that displays commit messages) > do not have to access the "real" index. [...] > At a (*very*) cursory glance at read-cache.c, it seems that there is > even support for having several index structures in memory at once, > making this easy. And it looks like reading the cache is more or less > memcpy() if xmmap() is fast (Windows would suffer once again). Note to self: do not write mail on bus, then pick up later at home. -- Thomas Rast trast@{inf,student}.ethz.ch [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 22:16 ` Thomas Rast 2009-02-10 22:26 ` Thomas Rast @ 2009-02-10 22:32 ` Junio C Hamano 1 sibling, 0 replies; 33+ messages in thread From: Junio C Hamano @ 2009-02-10 22:32 UTC (permalink / raw) To: Thomas Rast; +Cc: Johannes Schindelin, Shawn O. Pearce, Jeff King, git Thomas Rast <trast@student.ethz.ch> writes: > Johannes Schindelin wrote: >> Or we could use an on-disk hashmap. Oh, wait... > > While reading this thread, I sure wondered ... why don't we use the > one on-disk fast access structure we already have: the index? Since when the index has become a on-disk fast access structure? > Sure, one problem is that the index reading code is inherently written > for a single index state. That's wrong, but because the index is not a on-disk fast access structure to begin with, the incorrect statement about it is excused ;-) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 21:10 ` Johannes Schindelin 2009-02-10 22:16 ` Thomas Rast @ 2009-02-11 20:02 ` Jeff King 2009-02-11 20:57 ` Johannes Schindelin 1 sibling, 1 reply; 33+ messages in thread From: Jeff King @ 2009-02-11 20:02 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, Shawn O. Pearce, git On Tue, Feb 10, 2009 at 10:10:43PM +0100, Johannes Schindelin wrote: > > I wonder if we can solve this by introducing a local cache that is a flat > > file that looks like: > [...] > Or we could use an on-disk hashmap. Oh, wait... That was my first thought, as well. Maybe your original implementation wasn't so bad, after all. :) I searched through the archive to find a list of criticisms, but I didn't see any. So I guess the problem was just a concern that it might end up too complex. -Peff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-11 20:02 ` Jeff King @ 2009-02-11 20:57 ` Johannes Schindelin 2009-02-11 21:16 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Johannes Schindelin @ 2009-02-11 20:57 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Shawn O. Pearce, git Hi, On Wed, 11 Feb 2009, Jeff King wrote: > On Tue, Feb 10, 2009 at 10:10:43PM +0100, Johannes Schindelin wrote: > > > > I wonder if we can solve this by introducing a local cache that is a flat > > > file that looks like: > > [...] > > Or we could use an on-disk hashmap. Oh, wait... > > That was my first thought, as well. Maybe your original implementation > wasn't so bad, after all. :) > > I searched through the archive to find a list of criticisms, but I > didn't see any. So I guess the problem was just a concern that it might > end up too complex. Nope, the issue was that it would take too long to recreate IIRC. BTW I am no longer a fan of the on-disk cache; I think it is an ugly solution to a problem that should be solved without ugliness using a flexible directory layout in the note ref' tree. I mean, we really can allow different directory layouts as Sam described, with a few benefits, and only slight downsides. If we support multiple levels anyway, the code to allow arbitrary splits is not complicated (see Sam's patch). Even the merging should not pose any problem at all; we need a custom merge driver anyway, and there is no reason whatsoever why we should not just teach the merge driver to remove the slashes before comparing the filie names. At edit time, we can afford to perform a check that is a little more expensive than it would have been otherwise. Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-11 20:57 ` Johannes Schindelin @ 2009-02-11 21:16 ` Junio C Hamano 2009-02-11 23:05 ` Johannes Schindelin 0 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2009-02-11 21:16 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Jeff King, Shawn O. Pearce, git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > Even the merging should not pose any problem at all; we need a custom > merge driver anyway, and there is no reason whatsoever why we should not > just teach the merge driver to remove the slashes before comparing the > filie names. Once you start talking about "remove the slashes", you are assuming that the custom merge algorithm must look at *all the paths* in the two trees being merged, and it is a sign that your thinking is so trapped in the inefficient way the current merge-recursive and unpack-trees based merge works, and cannot think about the possibility that there could be more efficient way to do merges. Not very good. If you have a fixed boundary and if most subtrees are the same between two notes during a merge, we can do the same optimization as we do for two input "diff-tree" codepath. If the top of a subtree matches, we do not even have to look at their subtree. But that is true only if you do not remove the slashes and allow a random hierarchy. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-11 21:16 ` Junio C Hamano @ 2009-02-11 23:05 ` Johannes Schindelin 0 siblings, 0 replies; 33+ messages in thread From: Johannes Schindelin @ 2009-02-11 23:05 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, Shawn O. Pearce, git Hi, On Wed, 11 Feb 2009, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > Even the merging should not pose any problem at all; we need a custom > > merge driver anyway, and there is no reason whatsoever why we should > > not just teach the merge driver to remove the slashes before comparing > > the filie names. > > Once you start talking about "remove the slashes", you are assuming that > the custom merge algorithm must look at *all the paths* in the two trees > being merged, and it is a sign that your thinking is so trapped in the > inefficient way the current merge-recursive and unpack-trees based merge > works, and cannot think about the possibility that there could be more > efficient way to do merges. Not very good. > > If you have a fixed boundary and if most subtrees are the same between > two notes during a merge, we can do the same optimization as we do for > two input "diff-tree" codepath. If the top of a subtree matches, we do > not even have to look at their subtree. But that is true only if you do > not remove the slashes and allow a random hierarchy. Well, I think in the case of notes, we have to optimize for random access of dozens of blobs rather than for merging. I was well aware that the merging is more expensive when the hierarchy is not defined a priori. Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 13:32 ` Johannes Schindelin 2009-02-10 15:58 ` Junio C Hamano @ 2009-02-10 16:44 ` Shawn O. Pearce 2009-02-10 17:09 ` Johannes Schindelin 2009-02-11 3:19 ` Sam Vilain 1 sibling, 2 replies; 33+ messages in thread From: Shawn O. Pearce @ 2009-02-10 16:44 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Jeff King, gitster, git, Sam Vilain Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > [Junio: seems like both Peff and me would like to hold the notes out of > 1.6.2, would you mind?] Sorry I'm getting involved in this notes thing so late. I was way too focused on Gerrit2 and just didn't pay much attention to what was on the git ML recently. Like Dscho and Peff, I think we may want to hold notes out of 1.6.2. > On Tue, 10 Feb 2009, Jeff King wrote: > > On Tue, Feb 10, 2009 at 01:59:06PM +0100, Johannes Schindelin wrote: > > The thing is: Shawn is correct when he says that a tree object to hold the > notes of all commits (which is not an unlikely scenario if you are > thinking about corporate processes) would be huge. A notes tree entry requires 6+1+40+1+20=68 bytes per entry. If I use it for what I want in Gerrit, which is to annotate every commit, on a project like git.git with 17,491 commits we're talking about a tree that is 1.13 MB. That tree grows at a rate of 276 KB/year. I'm not sure I want to think about the cost to unpack that tree, just so I can look at "git log --since=1.week.ago". My fear here is that over time we will be spending a lot of CPU time unpacking and indexing the tree in memory, only to then pull out a handful of recent commits, and then see the pager abort and kill the revision walk. > The point you raised earlier, that there would be a lot of ambiguity if > we allow both flat and fan-out directory structures, is a valid point, > though. Yup. The flat vs. fan-out is a problem. In a slightly unrelated thread offlist I have been talking with Sam Vilain about using Git as a database backend for tuple storage. There is a related issue there about making the tree structure consistent, but never stored in a way that we wind up with these massive multi-megabyte objects. We've only started to kick it around, but I think we are both in agreement that a "database tree" is owned by the database code and must not be twiddled manually. Not unless you can honor the formatting rules. Just like you shouldn't use "git hash-object" to create a tree, unless you can honor the basic formatting rules for trees. This also means that the "database trees" probably are not going to be mergeable with a basic merge-recursive sort of algorithm, but instead need specialized handling to perform the combination. I think we're leaning in a direction of something more like this for trees: - Tuples are stored under a path constructed from their primary key. The analog here is, the commit SHA-1 the note is annotating. - Trees are capped at some reasonable size limit. For sake of argument lets call that MAX_TREE. My feeling is this would be closer to the 16 KB side of the spectrum then to the 1 MB side. - Initially the database tree starts out as a single root tree that is empty. - Records are inserted, creating new tree entries, until MAX_TREE is reached for the root level tree. Up until this point it is a flat tree structure, like the current notes design. - Once MAX_TREE is reached the root is split, and ranges are used to point to the subtrees, which are now flat, and approximately are MAX_TREE/2 in size. Etc. This would make the git-notes.sh code a *lot* more complex, as you can't just toss everything into an index file and then update it with a single update-index call. Doing a tree split is much more work and requires removing and adding back all of the affected path names. (Its also perhaps unreasonable anyway to load 17,491 paths into a temporary index just to twiddle a note for the latest commit.) Notes on commits though are a hell of a problem. SHA-1 is just so uniform at distributing the commits around the namespace that even with just the 200 most recent commits we wind up with a commit in almost every "bucket", assuming a two hex digit fan-out bucket like the loose object directory. For the "git database" thing above, I've been contemplating the idea of an index stored external from the Git object database. Sam thinks indexes should be in the object database tree, but I'm considering storing them outside entirely because we can make the indexes more easily searched by a hash or binary search, like pack-*.idx. Whenever the "database ref" gets moved we'd need to run a "sync" utility to bring these external indexes current. But they could also be more efficiently scanned. E.g. in the case of commit notes, we could just mmap() the index into memory and perform our lookups through the mmap. Thus we wouldn't pay massive penalities to index all 17,491 names just to access 200. Though we may wind up paging in a good part of the index due to the random access nature, but we can't really do anything about that. Keeping the indexes current would perhaps mean teaching "git fetch" to run something after the fetch is complete. Rather trivial in the grand scheme of things. I also liken the external index to the pack-*.idx, in that its derived from the real sources in the object database, and can always be generated client side. So making fetch do it is really no different then making fetch run index-pack. Eh. That wound up being a lot longer than I wanted it to be. Sam and I may be putting some effort into this "git as a database" thing, and it could be used as an efficient notes store. Its just a very complex notes store. Much more complex to implement than the simple notes currently slated for 1.6.2. -- Shawn. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 16:44 ` Shawn O. Pearce @ 2009-02-10 17:09 ` Johannes Schindelin 2009-02-10 17:17 ` Shawn O. Pearce 2009-02-11 3:19 ` Sam Vilain 1 sibling, 1 reply; 33+ messages in thread From: Johannes Schindelin @ 2009-02-10 17:09 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Jeff King, gitster, git, Sam Vilain Hi, On Tue, 10 Feb 2009, Shawn O. Pearce wrote: > For the "git database" thing above, I've been contemplating the > idea of an index stored external from the Git object database. The whole point of my exercise was to reuse as much as possible of Git's framework. After all, if you store an index external from Git's object database, you go back to reimplementing the whole infrastructure for fetching/merging just for that index. Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 17:09 ` Johannes Schindelin @ 2009-02-10 17:17 ` Shawn O. Pearce 0 siblings, 0 replies; 33+ messages in thread From: Shawn O. Pearce @ 2009-02-10 17:17 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Jeff King, gitster, git, Sam Vilain Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > On Tue, 10 Feb 2009, Shawn O. Pearce wrote: > > > For the "git database" thing above, I've been contemplating the > > idea of an index stored external from the Git object database. > > The whole point of my exercise was to reuse as much as possible of Git's > framework. After all, if you store an index external from Git's object > database, you go back to reimplementing the whole infrastructure for > fetching/merging just for that index. Yea, I know. It might just be easier to abandon everything in Git and start from scratch for the "git database" thing. But we'd lose the ability to at least piggyback onto the existing Git transport. And it doesn't help the "git notes" feature we're talking about. Maybe I was viewing the external index as like the working tree, where you can't really access the data until the external indexes are current, just like you can't really (easily anyway) access the working tree files until you bring the working tree current. But yea, it doesn't really use any of the existing machinary. -- Shawn. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-10 16:44 ` Shawn O. Pearce 2009-02-10 17:09 ` Johannes Schindelin @ 2009-02-11 3:19 ` Sam Vilain 1 sibling, 0 replies; 33+ messages in thread From: Sam Vilain @ 2009-02-11 3:19 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Johannes Schindelin, Jeff King, gitster, git Shawn O. Pearce wrote: >> The point you raised earlier, that there would be a lot of ambiguity if >> we allow both flat and fan-out directory structures, is a valid point, >> though. > > Yup. The flat vs. fan-out is a problem. [...] > Notes on commits though are a hell of a problem. SHA-1 is just so > uniform at distributing the commits around the namespace that even > with just the 200 most recent commits we wind up with a commit in > almost every "bucket", assuming a two hex digit fan-out bucket like > the loose object directory. I think my patch from 1 Feb addressed this, at least for the operations it implemented. I just don't see why you need to decide up front what the split is going to be. Just read the next tree, descend into the closest matching tree until you find the record you are looking for and that's it. Sure, my patch just loads it all and throws it into a hash - this should still be efficient for short log operations even if the hash table ends up 1MB. But why take my guess. Let's stress test it. 'lorem' is the binary in the Text::Lorem Perl module. It generates a paragraph of random Latin text. wilber:~/src/git$ time git-log | wc -l 256072 real 0m0.709s user 0m0.608s sys 0m0.116s wilber:~/src/git$ git rev-list HEAD | wc -l 17678 wilber:~/src/git$ cat > my-editor #!/bin/sh ( lorem; echo ) > $1 wilber:~/src/git$ chmod +x my-editor wilber:~/src/git$ export EDITOR=`pwd`/my-editor wilber:~/src/git$ export GIT_NOTES_SPLIT=2 wilber:~/src/git$ time git-rev-list HEAD | while read rev > do ./git-notes.sh edit $rev; done fatal: unable to create '.git/refs/notes/commits.lock': File exists error: Ref refs/notes/commits is at 5f0732975b4acf237912a31e7ce14aa86d2e8179 but expected 725a2d119d2725e7d821906ad085bfbadbf43c8e fatal: Cannot lock the ref 'refs/notes/commits'. [...] fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index real 76m16.927s user 43m55.909s sys 19m33.005s Oo. Nasty errors there but never mind that for now. Obviously some remaining issues in the shell script. What did I get out of that? wilber:~/src/git$ git-ls-tree -r refs/notes/commits | wc 12043 48172 1144085 wilber:~/src/git$ Hey well that's not too bad. Enough to be a good test. How long does "git-log" take now? wilber:~/src/git$ time ./git-log | wc -l 292201 real 0m13.740s user 0m0.852s sys 0m0.716s wilber:~/src/git$ time ./git-log | wc -l 292201 real 0m1.335s user 0m0.856s sys 0m0.512s Not bad! Cool cache performance sucked there but only a 50% slowdown for reading almost twice the number of objects. Let's try 200 commits: wilber:~/src/git$ time git-log -200 | wc -l 2877 real 0m0.027s user 0m0.008s sys 0m0.020s wilber:~/src/git$ time ./git-log -200 | wc -l 3477 real 0m0.081s user 0m0.056s sys 0m0.020s Quite a big slowdown proportionally, but not a huge amount in absolute terms. And we didn't even make the builtin-log machinery smart enough to skip unneeded trees! > In a slightly unrelated > thread offlist I have been talking with Sam Vilain about using Git > as a database backend for tuple storage. [...] > This would make the git-notes.sh code a *lot* more complex, as you > can't just toss everything into an index file and then update it with > a single update-index call. Doing a tree split is much more work and > requires removing and adding back all of the affected path names. > (Its also perhaps unreasonable anyway to load 17,491 paths into a > temporary index just to twiddle a note for the latest commit.) Hehe, horribly overcomplicated for this use case... many applicable ideas though. > For the "git database" thing above, I've been contemplating the > idea of an index stored external from the Git object database. > Sam thinks indexes should be in the object database tree, but > I'm considering storing them outside entirely because we can > make the indexes more easily searched by a hash or binary search, > like pack-*.idx. Whenever the "database ref" gets moved we'd need > to run a "sync" utility to bring these external indexes current. > But they could also be more efficiently scanned. Well either way it's a file you've got to scan somehow ... guess it doesn't matter much whether it's in-tree or not. I was actually saying that there are some use cases where you might want to keep indexes in the history and some where you don't. Keeping them in-tree is not normalised, but there are good use cases for it - eg efficient retrieval of pre-computed aggregates that don't need to be up to the second, or for instances where you want your nodes to be able to "hit the ground running" after synchronisation without having to reindex. For the use case we originally talked about I don't think you'd want any indexes in-tree at all. But I'd like to steer this thread well away from the database stuff I'm drafting ... it's a lot more comprehensive, notes are a very simple hash relationship. -- Sam Vilain, Perl Hacker, Catalyst IT (NZ) Ltd. phone: +64 4 499 2267 PGP ID: 0x66B25843 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: RFC: Flat directory for notes, or fan-out? Both! 2009-02-09 21:12 RFC: Flat directory for notes, or fan-out? Both! Johannes Schindelin 2009-02-10 7:58 ` Boyd Stephen Smith Jr. 2009-02-10 12:18 ` Jeff King @ 2009-02-11 1:14 ` Sam Vilain 2 siblings, 0 replies; 33+ messages in thread From: Sam Vilain @ 2009-02-11 1:14 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git, peff, spearce Johannes Schindelin wrote: > Hi, > > Shawn triggered some well needed thinking on my part about the notes > implementation. At the moment, we have flat directory structure, and read > all of them in one go (when needed). > > I think we should support that, because it is relatively easy to generate > that kind of trees for small-scale applications. > > However, I think there is also a benefit to handle fan-out directory > structures, too: they scale much nicer. > > If the commit name was not found as a filename, it could be searched in > whatever subdirectory whose name is a prefix of said commit name (first > wins). > Great idea! Glad I thought of it! ;-) http://thread.gmane.org/gmane.comp.version-control.git/106715/focus=107975 I hoped my approach allowed for smarter things later, such as splitting into smaller buckets whenever a directory gets more than N entries or periodically rebalancing if required. But the initial version is at least forward thinking to support reading it. Merging them will need to be savvy of this of course. Sam. ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2009-02-11 23:06 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-02-09 21:12 RFC: Flat directory for notes, or fan-out? Both! Johannes Schindelin 2009-02-10 7:58 ` Boyd Stephen Smith Jr. 2009-02-10 13:16 ` Jeff King 2009-02-11 1:58 ` Boyd Stephen Smith Jr. 2009-02-11 2:35 ` Linus Torvalds 2009-02-11 3:30 ` Sam Vilain 2009-02-11 3:54 ` Linus Torvalds 2009-02-11 5:05 ` Sam Vilain 2009-02-11 12:35 ` Johannes Schindelin 2009-02-10 12:18 ` Jeff King 2009-02-10 12:59 ` Johannes Schindelin 2009-02-10 13:10 ` Jeff King 2009-02-10 13:32 ` Johannes Schindelin 2009-02-10 15:58 ` Junio C Hamano 2009-02-10 16:48 ` Shawn O. Pearce 2009-02-10 16:48 ` Johannes Schindelin 2009-02-10 16:56 ` Shawn O. Pearce 2009-02-10 17:31 ` Johannes Schindelin 2009-02-10 18:35 ` Junio C Hamano 2009-02-10 19:09 ` Shawn O. Pearce 2009-02-10 21:10 ` Johannes Schindelin 2009-02-10 22:16 ` Thomas Rast 2009-02-10 22:26 ` Thomas Rast 2009-02-10 22:32 ` Junio C Hamano 2009-02-11 20:02 ` Jeff King 2009-02-11 20:57 ` Johannes Schindelin 2009-02-11 21:16 ` Junio C Hamano 2009-02-11 23:05 ` Johannes Schindelin 2009-02-10 16:44 ` Shawn O. Pearce 2009-02-10 17:09 ` Johannes Schindelin 2009-02-10 17:17 ` Shawn O. Pearce 2009-02-11 3:19 ` Sam Vilain 2009-02-11 1:14 ` Sam Vilain
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.