* Re: [RFC] Add --create-cache to repack
2011-01-29 1:32 ` Shawn Pearce
@ 2011-01-29 2:34 ` Shawn Pearce
2011-01-30 8:05 ` Junio C Hamano
2011-01-29 4:08 ` Nicolas Pitre
` (3 subsequent siblings)
4 siblings, 1 reply; 26+ messages in thread
From: Shawn Pearce @ 2011-01-29 2:34 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley
On Fri, Jan 28, 2011 at 17:32, Shawn Pearce <spearce@spearce.org> wrote:
>
> Well, scratch the idea in this thread. I think.
>
> I retested JGit vs. CGit on an identical linux-2.6 repository. The
> repository was fully packed, but had two pack files. 362M and 57M,
> and was created by packing a 1 month old master, marking it .keep, and
> then repacking -a -d to get most recent last month into another pack.
> This results in some files that should be delta compressed together
> being stored whole in the two packs (obviously).
>
> The two implementations take the same amount of time to generate the
> clone. 3m28s / 3m22s for JGit, 3m23s for C Git. The JGit created
> pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB.
I just tried caching only the object list of what is reachable from a
particular commit. The file is a small 20 byte header:
4 byte magic
4 byte version
4 byte number of commits (C)
4 byte number of trees (T)
4 byte number of blobs (B)
Then C commit SHA-1s, followed by T tree SHA-1 + 4 byte path_hash,
followed by B blob SHA-1 + 4 byte path_hash. For any project the size
is basically on par with the .idx file for the pack v1 format, so ~41
MB for linux-2.6. The file is stored as
$GIT_OBJECT_DIRECTORY/cache/$COMMIT_SHA1.list, and is completely
pack-independent.
Using this for object enumeration shaves almost 1 minute off server
packing time; the clone dropped from 3m28s to 2m29s. That is close to
what I was getting with the cached pack idea, but the network transfer
stayed the small 376 MiB. I think this supports your pack v4 work...
if we can speed up object enumeration to be this simple (scan down a
list of objects with their types declared inline, or implied by
location), we can cut a full minute of CPU time off the server side.
--
Shawn.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-29 2:34 ` Shawn Pearce
@ 2011-01-30 8:05 ` Junio C Hamano
2011-01-30 19:43 ` Shawn Pearce
0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2011-01-30 8:05 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley
Shawn Pearce <spearce@spearce.org> writes:
> Using this for object enumeration shaves almost 1 minute off server
> packing time; the clone dropped from 3m28s to 2m29s. That is close to
> what I was getting with the cached pack idea, but the network transfer
> stayed the small 376 MiB.
I like this result.
The amount of transfer being that small was something I didn't quite
expect, though. Doesn't it indicate that our pathname based object
clustering heuristics is not as effective as we hoped?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-30 8:05 ` Junio C Hamano
@ 2011-01-30 19:43 ` Shawn Pearce
2011-01-30 20:02 ` Junio C Hamano
2011-01-30 22:26 ` Nicolas Pitre
0 siblings, 2 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-30 19:43 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley
On Sun, Jan 30, 2011 at 00:05, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>> Using this for object enumeration shaves almost 1 minute off server
>> packing time; the clone dropped from 3m28s to 2m29s. That is close to
>> what I was getting with the cached pack idea, but the network transfer
>> stayed the small 376 MiB.
>
> I like this result.
I'm really leaning towards putting this cached object list into JGit.
I need to shave that 1 minute off server CPU time. I can afford the 41
MiB disk (and kernel buffer cache), but I cannot really continue to
pay the 1 minute of CPU on each clone request for large repositories.
The object list of what is reachable from commit X isn't ever going to
change, and the path hash function is reasonably stable. With a
version code in the file we can desupport old files if the path hash
function changes. 10% more disk/kernel memory is cheap for some of my
servers compared to 1 minute of CPU, and some explicit cache
management by the server administrator to construct the file.
> The amount of transfer being that small was something I didn't quite
> expect, though. Doesn't it indicate that our pathname based object
> clustering heuristics is not as effective as we hoped?
I'm not sure I follow your question.
I think the problem here is old side branches that got recently
merged. Their _best_ delta base was some old revision, possibly close
to where they branched off from. Using a newer version of the file
for the delta base created a much larger delta. E.g. consider a file
where in more recent revisions a function was completely rewritten.
If you have to delta compress against that new version, but you use
the older definition of the function, you need to use insert
instructions
for the entire content of that old function. But if you can delta
compress against the version you branched from (or one much closer to
it in time), your delta would be very small as that function is
handled by the smaller copy instruction.
Our clustering heuristics work fine.
Our thin-pack selection of potential delta base candidates is not. We
are not very aggressive in loading the delta base window with
potential candidates, which means we miss some really good compression
opportunities.
Ooooh.
I think my test was flawed. I injected the cached pack's tip as the
edge for the new stuff to delta compress against. I should have
injected all of the merge bases between the cached pack's tip and the
new stuff. Although the cached pack tip is one of the merge bases,
its not all of them. If we inject all of the merge bases, we can find
the revision that this old side branch is based on, and possibly get a
better delta candidate for it.
IIRC, upload-pack would have walked backwards further and found the
merge base for that side branch, and it would have been part of the
delta base candidates. I think I need to re-do my cached pack test.
Good thing I have history of my source code saved in this fancy
revision control thingy called "git". :-)
--
Shawn.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-30 19:43 ` Shawn Pearce
@ 2011-01-30 20:02 ` Junio C Hamano
2011-01-30 20:20 ` Shawn Pearce
2011-01-30 22:26 ` Nicolas Pitre
1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2011-01-30 20:02 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley
Shawn Pearce <spearce@spearce.org> writes:
>> The amount of transfer being that small was something I didn't quite
>> expect, though. Doesn't it indicate that our pathname based object
>> clustering heuristics is not as effective as we hoped?
>
> I'm not sure I follow your question.
I didn't see path information in your cachefile that contains C commits, T
trees, etc. that sped up the object enumeration, but you didn't observe
much transfer inflation over the stock git.
> Ooooh.
>
> I think my test was flawed. I injected the cached pack's tip as the
> edge for the new stuff to delta compress against.
That is one of the things I was wondering. I manually created a thin pack
with only the 1-month-old tip as boundary, and another with all the
boundaries that can be found by rev-list. I didn't find much difference
in the result, though, as "rev-list --boundary --all --not $onemontholdtip"
had only a few boundary entries in my test.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-30 20:02 ` Junio C Hamano
@ 2011-01-30 20:20 ` Shawn Pearce
0 siblings, 0 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-30 20:20 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley
On Sun, Jan 30, 2011 at 12:02, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>>> The amount of transfer being that small was something I didn't quite
>>> expect, though. Doesn't it indicate that our pathname based object
>>> clustering heuristics is not as effective as we hoped?
>>
>> I'm not sure I follow your question.
>
> I didn't see path information in your cachefile that contains C commits, T
> trees, etc. that sped up the object enumeration, but you didn't observe
> much transfer inflation over the stock git.
I didn't store the path itself, I stored the path hash as a 4 byte
int. Its smaller, but still helps to schedule the object into the
right position in the delta search.
--
Shawn.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-30 19:43 ` Shawn Pearce
2011-01-30 20:02 ` Junio C Hamano
@ 2011-01-30 22:26 ` Nicolas Pitre
1 sibling, 0 replies; 26+ messages in thread
From: Nicolas Pitre @ 2011-01-30 22:26 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Junio C Hamano, Johannes Sixt, git, John Hawley
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1485 bytes --]
On Sun, 30 Jan 2011, Shawn Pearce wrote:
> On Sun, Jan 30, 2011 at 00:05, Junio C Hamano <gitster@pobox.com> wrote:
> > Shawn Pearce <spearce@spearce.org> writes:
> >
> >> Using this for object enumeration shaves almost 1 minute off server
> >> packing time; the clone dropped from 3m28s to 2m29s. That is close to
> >> what I was getting with the cached pack idea, but the network transfer
> >> stayed the small 376 MiB.
> >
> > I like this result.
>
> I'm really leaning towards putting this cached object list into JGit.
>
> I need to shave that 1 minute off server CPU time. I can afford the 41
> MiB disk (and kernel buffer cache), but I cannot really continue to
> pay the 1 minute of CPU on each clone request for large repositories.
> The object list of what is reachable from commit X isn't ever going to
> change, and the path hash function is reasonably stable. With a
> version code in the file we can desupport old files if the path hash
> function changes. 10% more disk/kernel memory is cheap for some of my
> servers compared to 1 minute of CPU, and some explicit cache
> management by the server administrator to construct the file.
Yep, I think this is probably the best short term solution. Just walk
the commit graph as usual, and whenever the commit tip from the cache is
matched then just shove the entire cache content in the object list.
And let's hope that eventually some future developments will make this
cache redundant and obsolete.
Nicolas
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-29 1:32 ` Shawn Pearce
2011-01-29 2:34 ` Shawn Pearce
@ 2011-01-29 4:08 ` Nicolas Pitre
2011-01-29 4:35 ` Shawn Pearce
2011-01-30 6:51 ` Junio C Hamano
` (2 subsequent siblings)
4 siblings, 1 reply; 26+ messages in thread
From: Nicolas Pitre @ 2011-01-29 4:08 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley
[-- Attachment #1: Type: TEXT/PLAIN, Size: 8240 bytes --]
On Fri, 28 Jan 2011, Shawn Pearce wrote:
> On Fri, Jan 28, 2011 at 13:09, Nicolas Pitre <nico@fluxnic.net> wrote:
> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
> >
> >> On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@fluxnic.net> wrote:
> >> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
> >> >
> >> >> This started because I was looking for a way to speed up clones coming
> >> >> from a JGit server. Cloning the linux-2.6 repository is painful,
>
> Well, scratch the idea in this thread. I think.
>
> I retested JGit vs. CGit on an identical linux-2.6 repository. The
> repository was fully packed, but had two pack files. 362M and 57M,
> and was created by packing a 1 month old master, marking it .keep, and
> then repacking -a -d to get most recent last month into another pack.
> This results in some files that should be delta compressed together
> being stored whole in the two packs (obviously).
>
> The two implementations take the same amount of time to generate the
> clone. 3m28s / 3m22s for JGit, 3m23s for C Git. The JGit created
> pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB. I point
> out this data because improvements made to JGit may show similar
> improvements to CGit given how close they are in running time.
What are those improvements?
Now, the fact that JGit is so close to CGit must be because the actual
cost is outside of them such as within zlib, otherwise the C code should
normally always be faster, right?
Looking at the profile for "git rev-list --objects --all > /dev/null"
for the object enumeration phase, we have:
# Samples: 1814637
#
# Overhead Command Shared Object Symbol
# ........ ............... ............. ......
#
28.81% git /home/nico/bin/git [.] lookup_object
12.21% git /lib64/libz.so.1.2.3 [.] inflate
10.49% git /lib64/libz.so.1.2.3 [.] inflate_fast
7.47% git /lib64/libz.so.1.2.3 [.] inflate_table
6.66% git /lib64/libc-2.11.2.so [.] __GI_memcpy
5.66% git /home/nico/bin/git [.] find_pack_entry_one
2.98% git /home/nico/bin/git [.] decode_tree_entry
2.73% git /lib64/libc-2.11.2.so [.] _int_malloc
2.71% git /lib64/libz.so.1.2.3 [.] adler32
2.63% git /home/nico/bin/git [.] process_tree
1.58% git [kernel] [k] 0xffffffff8112fc0c
1.44% git /lib64/libc-2.11.2.so [.] __strlen_sse2
1.31% git /home/nico/bin/git [.] tree_entry
1.10% git /lib64/libc-2.11.2.so [.] _int_free
0.96% git /home/nico/bin/git [.] patch_delta
0.92% git /lib64/libc-2.11.2.so [.] malloc_consolidate
0.86% git /lib64/libc-2.11.2.so [.] __GI_vfprintf
0.80% git /home/nico/bin/git [.] create_object
0.80% git /home/nico/bin/git [.] lookup_blob
0.63% git /home/nico/bin/git [.] update_tree_entry
[...]
So we've got lookup_object() clearly at the top. I suspect the
hashcmp() in there, which probably gets inlined, is responsible for most
cycles. There is certainly a better way here, and probably in JGit you
rely on some optimized facility provided by the language/library to
perform that lookup. So there is probably some easy improvements that
can be made here.
Otherwise it is at least 12.21 + 10.49 + 7.47 + 2.71 = 32.88% spent
directly in the zlib code, making it the biggest cost. This is rather
unavoidable unless the data structure is changed. And pack v4 would
probably move things such as find_pack_entry_one, decode_tree_entry,
process_tree and tree_entry off the radar as well.
The object writeout phase should pretty much be network bound.
> I fully implemented the reuse of a cached pack behind a thin pack idea
> I was trying to describe in this thread. It saved 1m7s off the JGit
> running time, but increased the data transfer by 25 MiB. I didn't
> expect this much of an increase, I honestly expected the thin pack
> portion to be well, thinner. The issue is the thin pack cannot delta
> against all of the history, its only delta compressing against the tip
> of the cached pack. So long-lived side branches that forked off an
> older part of the history aren't delta compressing well, or at all,
> and that is significantly bloating the thin pack. (Its also why that
> "newer" pack is 57M, but should be 14M if correctly combined with the
> cached pack.) If I were to consider all of the objects in the cached
> pack as potential delta base candidates for the thin pack, the entire
> benefit of the cached pack disappears.
Yeah... this sucks.
> I'm not sure I like it so much anymore. :-)
>
> The idea was half-baked, and came at the end of a long day, and after
> putting my cranky infant son down to sleep way past his normal bed
> time. I claim I was a sleep deprived new parent who wasn't thinking
> things through enough before writing an email to git@vger.
Well, this is still valuable information to archive.
And I wish I had been able to still write such quality emails when I was
a new parent. ;-)
> >> sendfile() call for the bulk of the content. I think we can just hand
> >> off the major streaming to the kernel.
> >
> > While this might look like a good idea in theory, did you actually
> > profile it to see if that would make a noticeable difference? The
> > pkt-line framing allows for asynchronous messages to be sent over a
> > sideband,
>
> No, of course not. The pkt-line framing is pretty low overhead, but
> copying kernel buffer to userspace back to kernel buffer sort of sucks
> for 400 MiB of data. sendfile() on 400 MiB to a network socket is
> much easier when its all kernel space.
Of course. But still... If you save 0.5 second by avoiding the copy to
and from user space of that 400 MiB (based on my machine which can do
1670MB/s) that's pretty much insignificant compared to the total time
for the clone, and therefore the wrong thing to optimize given the
required protocol changes.
> I figured, if it all worked
> out already to just dump the pack to the wire as-is, then we probably
> should also try to go for broke and reduce the userspace copying. It
> might not matter to your desktop, but ask John Hawley (CC'd) about
> kernel.org and the git traffic volume he is serving. They are doing
> more than 1 million git:// requests per day now.
Impressive. However I suspect that the vast majority of those requests
are from clients making a connection just to realize they're up to date
already. I don't think the user space copying is really a problem.
Of course, if we could have used sendfile() freely in, say,
copy_pack_data() then we would have done so long ago. But we are
checksuming the data we create on the fly with the data we reuse from
disk so this is not necessarily a gain.
> >> Plus we can safely do byte range requests for resumable clone within
> >> the cached pack part of the stream.
> >
> > That part I'm not sure of. We are still facing the same old issues
> > here, as some mirrors might have the same commit edges for a cache pack
> > but not necessarily the same packing result, etc. So I'd keep that out
> > of the picture for now.
>
> I don't think its that hard. If we modify the transfer protocol to
> allow the server to denote boundaries between packs, the server can
> send the pack name (as in pack-$name.pack) and the pack SHA-1 trailer
> to the client. A client asking for resume of a cached pack presents
> its original want list, these two SHA-1s, and the byte offset he wants
> to restart from. The server validates the want set is still
> reachable, that the cached pack exists, and that the cached pack tips
> are reachable from current refs. If all of that is true, it validates
> the trailing SHA-1 in the pack matches what the client gave it. If
> that matches, it should be OK to resume transfer from where the client
> asked for.
This is still an half solution. If your network connection drops after
the first 52 MiB of transfer given the scenario you provided then you're
still screwed.
Nicolas
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-29 4:08 ` Nicolas Pitre
@ 2011-01-29 4:35 ` Shawn Pearce
0 siblings, 0 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-29 4:35 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley
On Fri, Jan 28, 2011 at 20:08, Nicolas Pitre <nico@fluxnic.net> wrote:
>> pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB. I point
>> out this data because improvements made to JGit may show similar
>> improvements to CGit given how close they are in running time.
>
> What are those improvements?
None right now. JGit is similar to CGit algorithm-wise. (Actually it
looks like JGit has a faster diff implementation, but that's a
different email.)
If you are asking about why JGit created a slightly smaller pack
file... it splits the delta window during threaded delta search
differently than CGit does, and we align our blocks slightly
differently when comparing two objects to generate a delta sequence
for them. These two variations mean JGit produces different deltas
than CGit does. Sometimes we are smaller, sometimes we are larger.
But its a small difference, on the order of 1-4 MiB for something like
linux-2.6. I don't think its worthwhile trying to analyze the
specific differences in implementations and retrofit those differences
into the other one.
What I was trying to say was, _if_ we made a change to JGit and it
dropped the running time, that same change in CGit should have _at
least_ the same running time improvement, if not better. I was
pointing out that this cached-pack change dropped the running time by
1 minute, so CGit should also see a similar improvement (if not
better). I would prefer to test against CGit for this sort of thing,
but its been too long since I last poked pack-objects.c and the
revision code in CGit, while the JGit equivalents are really fresh in
my head.
> Now, the fact that JGit is so close to CGit must be because the actual
> cost is outside of them such as within zlib, otherwise the C code should
> normally always be faster, right?
Yup, I mostly agree with this statement. CGit does a lot of
malloc/free activity when reading objects in. JGit does too, but we
often fit into the young generation for the GC, which sometimes can be
faster to clean and recycle memory in. We're not too far off from C
code.
But yes... our profile looks like this too:
> Looking at the profile for "git rev-list --objects --all > /dev/null"
> for the object enumeration phase, we have:
>
> # Samples: 1814637
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... ............. ......
> #
> 28.81% git /home/nico/bin/git [.] lookup_object
> 12.21% git /lib64/libz.so.1.2.3 [.] inflate
> 10.49% git /lib64/libz.so.1.2.3 [.] inflate_fast
> 7.47% git /lib64/libz.so.1.2.3 [.] inflate_table
> 6.66% git /lib64/libc-2.11.2.so [.] __GI_memcpy
> 5.66% git /home/nico/bin/git [.] find_pack_entry_one
> 2.98% git /home/nico/bin/git [.] decode_tree_entry
> [...]
>
> So we've got lookup_object() clearly at the top.
Isn't this the hash table lookup inside the revision pool, to see if
the object has already been visited? That seems horrible, 28% of the
CPU is going to probing that table.
> I suspect the
> hashcmp() in there, which probably gets inlined, is responsible for most
> cycles.
Probably true. I know our hashcmp() is inlined, its actually written
by hand as 5 word compares, and is marked final, so the JIT is rather
likely to inline it.
> There is certainly a better way here, and probably in JGit you
> rely on some optimized facility provided by the language/library to
> perform that lookup. So there is probably some easy improvements that
> can be made here.
Nope. Actually we have to bend over backwards and work against the
language to get anything even reasonably sane for performance. Our
"solution" in JGit has actually been used by Rob Pike to promote his
Go programming language and why Java sucks as a language. Its a great
quote of mine that someone dragged up off the git@vger mailing list
and started using to promote Go.
At least once I week I envy how easy it is to use hashcmp() and
hashcpy() inside of CGit. JGit's management of hashes is sh*t because
we have to bend so hard around the language.
> Otherwise it is at least 12.21 + 10.49 + 7.47 + 2.71 = 32.88% spent
> directly in the zlib code, making it the biggest cost.
Yea, that's what we have too, about 33% inside of zlib code... which
is the same implementation that CGit uses.
> This is rather
> unavoidable unless the data structure is changed.
We already knew this from our pack v4 experiments years ago.
> And pack v4 would
> probably move things such as find_pack_entry_one, decode_tree_entry,
> process_tree and tree_entry off the radar as well.
This is hard to do inside of CGit if I recall... but yes, changing the
way trees are handled would really improve things.
> The object writeout phase should pretty much be network bound.
Yes.
>> I fully implemented the reuse of a cached pack behind a thin pack idea
>> I was trying to describe in this thread. It saved 1m7s off the JGit
>> running time, but increased the data transfer by 25 MiB.
>
> Yeah... this sucks.
Very much. :-(
But this is a fundamental issue with our incremental fetch support
anyway. In this exact case if the client was at that 1 month old
commit, and fetched current master, he would pull 25 MiB of data.. but
only needed about 4-6 MiB worth of deltas if it was properly delta
compressed against the content we know he already has. Our server
side optimization of only pushing the immediate "have" list of the
client into the delta search window limits how much we can compress
the data we are sending. If we were willing to push more in on the
server side, we could shrink the incremental fetch more. But that's a
CPU problem on the server.
--
Shawn.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-29 1:32 ` Shawn Pearce
2011-01-29 2:34 ` Shawn Pearce
2011-01-29 4:08 ` Nicolas Pitre
@ 2011-01-30 6:51 ` Junio C Hamano
2011-01-30 17:14 ` Nicolas Pitre
2011-01-30 19:29 ` Shawn Pearce
2011-01-30 22:13 ` Shawn Pearce
2011-01-31 18:47 ` Shawn Pearce
4 siblings, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2011-01-30 6:51 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley
Shawn Pearce <spearce@spearce.org> writes:
> I fully implemented the reuse of a cached pack behind a thin pack idea
> I was trying to describe in this thread. It saved 1m7s off the JGit
> running time, but increased the data transfer by 25 MiB. I didn't
> expect this much of an increase, I honestly expected the thin pack
> portion to be well, thinner. The issue is the thin pack cannot delta
> against all of the history, its only delta compressing against the tip
> of the cached pack. So long-lived side branches that forked off an
> older part of the history aren't delta compressing well, or at all,
> and that is significantly bloating the thin pack. (Its also why that
> "newer" pack is 57M, but should be 14M if correctly combined with the
> cached pack.) If I were to consider all of the objects in the cached
> pack as potential delta base candidates for the thin pack, the entire
> benefit of the cached pack disappears.
What if you instead use the cached pack this way?
0. You perform the proposed pre-traversal until you hit the tip of cached
pack(s), and realize that you will end up sending everything.
1. Instead of sending the new part of the history first and then sending
the cached pack(s), you send the contents of cached pack(s), but also
note what objects you sent;
2. Then you send the new part of the history, taking full advantage of
what you have already sent, perhaps doing only half of the reuse-delta
logic (i.e. you reuse what you can reuse, but you do _not_ punt on an
object that is not a delta in an existing pack).
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-30 6:51 ` Junio C Hamano
@ 2011-01-30 17:14 ` Nicolas Pitre
2011-01-30 17:41 ` A Large Angry SCM
2011-01-30 19:29 ` Shawn Pearce
1 sibling, 1 reply; 26+ messages in thread
From: Nicolas Pitre @ 2011-01-30 17:14 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Shawn Pearce, Johannes Sixt, git, John Hawley
On Sat, 29 Jan 2011, Junio C Hamano wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
> > I fully implemented the reuse of a cached pack behind a thin pack idea
> > I was trying to describe in this thread. It saved 1m7s off the JGit
> > running time, but increased the data transfer by 25 MiB. I didn't
> > expect this much of an increase, I honestly expected the thin pack
> > portion to be well, thinner. The issue is the thin pack cannot delta
> > against all of the history, its only delta compressing against the tip
> > of the cached pack. So long-lived side branches that forked off an
> > older part of the history aren't delta compressing well, or at all,
> > and that is significantly bloating the thin pack. (Its also why that
> > "newer" pack is 57M, but should be 14M if correctly combined with the
> > cached pack.) If I were to consider all of the objects in the cached
> > pack as potential delta base candidates for the thin pack, the entire
> > benefit of the cached pack disappears.
>
> What if you instead use the cached pack this way?
>
> 0. You perform the proposed pre-traversal until you hit the tip of cached
> pack(s), and realize that you will end up sending everything.
>
> 1. Instead of sending the new part of the history first and then sending
> the cached pack(s), you send the contents of cached pack(s), but also
> note what objects you sent;
>
> 2. Then you send the new part of the history, taking full advantage of
> what you have already sent, perhaps doing only half of the reuse-delta
> logic (i.e. you reuse what you can reuse, but you do _not_ punt on an
> object that is not a delta in an existing pack).
The problem is to determine the best base object to delta against. If
you end up listing all the already sent objects and perform delta
attempts against them for the remaining non delta objects to find the
best match then you might end up taking more CPU time than the current
enumeration phase.
Nicolas
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-30 17:14 ` Nicolas Pitre
@ 2011-01-30 17:41 ` A Large Angry SCM
0 siblings, 0 replies; 26+ messages in thread
From: A Large Angry SCM @ 2011-01-30 17:41 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Junio C Hamano, Shawn Pearce, Johannes Sixt, git, John Hawley
On 01/30/2011 12:14 PM, Nicolas Pitre wrote:
> On Sat, 29 Jan 2011, Junio C Hamano wrote:
>
>> Shawn Pearce<spearce@spearce.org> writes:
>>
>>> I fully implemented the reuse of a cached pack behind a thin pack idea
>>> I was trying to describe in this thread. It saved 1m7s off the JGit
>>> running time, but increased the data transfer by 25 MiB. I didn't
>>> expect this much of an increase, I honestly expected the thin pack
>>> portion to be well, thinner. The issue is the thin pack cannot delta
>>> against all of the history, its only delta compressing against the tip
>>> of the cached pack. So long-lived side branches that forked off an
>>> older part of the history aren't delta compressing well, or at all,
>>> and that is significantly bloating the thin pack. (Its also why that
>>> "newer" pack is 57M, but should be 14M if correctly combined with the
>>> cached pack.) If I were to consider all of the objects in the cached
>>> pack as potential delta base candidates for the thin pack, the entire
>>> benefit of the cached pack disappears.
>>
>> What if you instead use the cached pack this way?
>>
>> 0. You perform the proposed pre-traversal until you hit the tip of cached
>> pack(s), and realize that you will end up sending everything.
>>
>> 1. Instead of sending the new part of the history first and then sending
>> the cached pack(s), you send the contents of cached pack(s), but also
>> note what objects you sent;
>>
>> 2. Then you send the new part of the history, taking full advantage of
>> what you have already sent, perhaps doing only half of the reuse-delta
>> logic (i.e. you reuse what you can reuse, but you do _not_ punt on an
>> object that is not a delta in an existing pack).
>
> The problem is to determine the best base object to delta against. If
> you end up listing all the already sent objects and perform delta
> attempts against them for the remaining non delta objects to find the
> best match then you might end up taking more CPU time than the current
> enumeration phase.
Why worry about best here? Just add the object (or one of the objects)
with the same path from the commit you found in step 0, above, to the
delta base search for each object to pack.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-30 6:51 ` Junio C Hamano
2011-01-30 17:14 ` Nicolas Pitre
@ 2011-01-30 19:29 ` Shawn Pearce
1 sibling, 0 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-30 19:29 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley
On Sat, Jan 29, 2011 at 22:51, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>> I fully implemented the reuse of a cached pack behind a thin pack idea
>> I was trying to describe in this thread. It saved 1m7s off the JGit
>> running time, but increased the data transfer by 25 MiB. I didn't
>> expect this much of an increase, I honestly expected the thin pack
>> portion to be well, thinner. The issue is the thin pack cannot delta
>> against all of the history, its only delta compressing against the tip
>> of the cached pack. So long-lived side branches that forked off an
>> older part of the history aren't delta compressing well, or at all,
>> and that is significantly bloating the thin pack. (Its also why that
>> "newer" pack is 57M, but should be 14M if correctly combined with the
>> cached pack.) If I were to consider all of the objects in the cached
>> pack as potential delta base candidates for the thin pack, the entire
>> benefit of the cached pack disappears.
>
> What if you instead use the cached pack this way?
>
> 0. You perform the proposed pre-traversal until you hit the tip of cached
> pack(s), and realize that you will end up sending everything.
>
> 1. Instead of sending the new part of the history first and then sending
> the cached pack(s), you send the contents of cached pack(s), but also
> note what objects you sent;
This is the part I was trying to avoid. Making this list of objects
from the cached pack(s) costs working set inside of the pack-objects
process. I had hoped that the cached packs would let me skip this
step.
But lets say that's acceptable cost. We cannot efficiently make a
useful list of objects from the pack. Scanning the .idx file only
tells us the SHA-1. It does not tell us the type, nor does it tell us
what the path hash code would be for the object if it were a tree or
blob. So we cannot efficiently use this pack listing to construct the
delta window.
> 2. Then you send the new part of the history, taking full advantage of
> what you have already sent, perhaps doing only half of the reuse-delta
> logic (i.e. you reuse what you can reuse, but you do _not_ punt on an
> object that is not a delta in an existing pack).
Well, I guess we could go half-way. We could try to use only
non-delta objects from the cached pack as potential delta bases for
this delta search.
To do that we would build the reverse index for the cached pack, then
check each object's type code just before we send that part of the
cached pack. If its non-delta, we can get its SHA-1 from the reverse
index, toss the object into the delta search list, and copy out the
length of the object until the next object starts.
However... I suspect our delta results would be the same as the thin
pack before cached pack test I did earlier. The objects that are
non-delta in the cached pack are (in theory) approximately the objects
immediately reachable from the cached pack's tip. That was already
put into the delta window as the base candidates for the thin pack.
This may be a faster way to find that thin pack edge, but the data
transfer will still be sub-optimal because we cannot consider deltas
as bases.
--
Shawn.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-29 1:32 ` Shawn Pearce
` (2 preceding siblings ...)
2011-01-30 6:51 ` Junio C Hamano
@ 2011-01-30 22:13 ` Shawn Pearce
2011-01-31 18:47 ` Shawn Pearce
4 siblings, 0 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-30 22:13 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley
On Fri, Jan 28, 2011 at 17:32, Shawn Pearce <spearce@spearce.org> wrote:
>
> I fully implemented the reuse of a cached pack behind a thin pack idea
> I was trying to describe in this thread. It saved 1m7s off the JGit
> running time, but increased the data transfer by 25 MiB. I didn't
> expect this much of an increase, I honestly expected the thin pack
> portion to be well, thinner.
JGit's thin pack creation is crap. For example, this is the same fetch:
$ git fetch ../tmp_linux26
remote: Counting objects: 61521, done.
remote: Compressing objects: 100% (12096/12096), done.
remote: Total 50275 (delta 42578), reused 45220 (delta 37524)
Receiving objects: 100% (50275/50275), 11.13 MiB | 7.29 MiB/s, done.
Resolving deltas: 100% (42578/42578), completed with 4968 local objects.
$ git fetch git://localhost/tmp_linux26
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Compressing objects: 100% (106568/106568)
remote: Compressing objects: 100% (12750/12750)
Receiving objects: 100% (50275/50275), 24.66 MiB | 10.93 MiB/s, done.
Resolving deltas: 100% (40345/40345), completed with 2218 local objects.
JGit produced an extra 13.53 MiB for this pack, because it missed
about 2,233 delta opportunities. It turns out we are too aggressive
at pushing objects from the edges into the delta windows. JGit pushes
*everything* in the edge commits, rather than only the paths that are
actually used by the objects we need to send. This floods the delta
search window with garbage, and makes it less likely that an object to
be sent will find a relevant delta base in the search window.
--
Shawn.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-29 1:32 ` Shawn Pearce
` (3 preceding siblings ...)
2011-01-30 22:13 ` Shawn Pearce
@ 2011-01-31 18:47 ` Shawn Pearce
2011-01-31 21:48 ` Nicolas Pitre
4 siblings, 1 reply; 26+ messages in thread
From: Shawn Pearce @ 2011-01-31 18:47 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley
On Fri, Jan 28, 2011 at 17:32, Shawn Pearce <spearce@spearce.org> wrote:
>>> >
>>> >> This started because I was looking for a way to speed up clones coming
>>> >> from a JGit server. Cloning the linux-2.6 repository is painful,
>
> Well, scratch the idea in this thread. I think.
Nope, I'm back in favor with this after fixing JGit's thin pack
generation. Here's why.
Take linux-2.6.git as of Jan 12th, with the cache root as of Dec 28th:
$ git update-ref HEAD f878133bf022717b880d0e0995b8f91436fd605c
$ git-repack.sh --create-cache \
--cache-root=b52e2a6d6d05421dea6b6a94582126af8cd5cca2 \
--cache-include=v2.6.11-tree
$ git repack -a -d
$ ls -lh objects/pack/
total 456M
1.4M pack-74af5edca80797736fe4de7279b2a81af98470a5.idx
38M pack-74af5edca80797736fe4de7279b2a81af98470a5.pack
49M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.idx
89 pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.keep
368M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.pack
Our "recent history" is 38M, and our "cached pack" is 368M. Its a bit
more disk than is strictly necessary, this should be ~380M. Call it
~26M of wasted disk. The "cached object list" I proposed elsewhere in
this thread would cost about 41M of disk and is utterly useless except
for initial clones. Here we are wasting about 26M of disk to have
slightly shorter delta chains in the cached pack (otherwise known as
our ancient history). So its a slightly smaller waste, and we get
some (minor) benefit.
Clone without pack caching:
$ time git clone --bare git://localhost/tmp_linux26_withTag tmp_in.git
Cloning into bare repository tmp_in.git...
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
user 1m36.250s
sys 0m10.290s
Clone with pack caching:
$ time git clone --bare git://localhost/tmp_linux26_withTag tmp_in.git
Cloning into bare repository tmp_in.git...
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
user 1m35.890s
sys 0m9.830s
Using the cached pack increased our total data transfer by 2.39 MiB,
but saved 1m17s on server computation time. If we go back and look at
our cached pack size (368M), the leading thin-pack should be about
10.4 MiB (378.40M - 368M = 10.4M). If I modify the tmp_in.git client
to have only the cached pack's tip and fetch using CGit, we see the
thin pack to bring ourselves current is 11.07 MiB (JGit does this in
10.96 MiB):
$ cd tmp_in.git
$ git update-ref HEAD b52e2a6d6d05421dea6b6a94582126af8cd5cca2
$ git repack -a -d ; # yay we are at ~1 month ago
$ time git fetch ../tmp_linux26_withTag
remote: Counting objects: 60570, done.
remote: Compressing objects: 100% (11924/11924), done.
remote: Total 49804 (delta 42196), reused 44837 (delta 37231)
Receiving objects: 100% (49804/49804), 11.07 MiB | 7.37 MiB/s, done.
Resolving deltas: 100% (42196/42196), completed with 4956 local objects.
From ../tmp_linux26_withTag
* branch HEAD -> FETCH_HEAD
real 0m35.083s
user 0m25.710s
sys 0m1.190s
The pack caching feature is *no worse* in transfer size than if the
client copied the pack from 1 month ago, and then did an incremental
fetch to bring themselves current. Compared to the naive clone, it
saves an incredible amount of working set space and CPU time. The
server only needs to keep track of the incremental thin pack, and can
completely ignore the ancient history objects. Its a great
alternative for projects that want users to rsync/http dumb transport
down a large stable repository, then incremental fetch themselves
current. Or busy mirror sites that are willing to trade some small
bandwidth for server CPU and memory.
In this particular example, there is ~11 MiB of data that cannot be
safely resumed, or the first 2.9%. At 56 KiB/s, a client needs to get
through the first 3 minutes of transfer before they can reach the
resumable checkpoint (where the thin pack ends, and the cached pack
starts). It would be better if we could resume anywhere in the
stream, but being able to resume the last 97% is infinitely better
than being able to resume nothing. If someone wants to really go
crazy, this is where a "gittorrent" client could start up and handle
the remaining 97% of the transfer. :-)
I think this is worthwhile. If we are afraid of the extra 2.39 MiB
data transfer this forces on the client when the repository owner
enables the feature, we should go back and improve our thin-pack code.
Transferring 11 MiB to catch up a kernel from Dec 28th to Jan 12th
sounds like a lot of data, and any improvements in the general
thin-pack code would shrink the leading thin-pack, possibly getting us
that 2.39 MiB back.
--
Shawn.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] Add --create-cache to repack
2011-01-31 18:47 ` Shawn Pearce
@ 2011-01-31 21:48 ` Nicolas Pitre
0 siblings, 0 replies; 26+ messages in thread
From: Nicolas Pitre @ 2011-01-31 21:48 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley
[-- Attachment #1: Type: TEXT/PLAIN, Size: 3348 bytes --]
On Mon, 31 Jan 2011, Shawn Pearce wrote:
> On Fri, Jan 28, 2011 at 17:32, Shawn Pearce <spearce@spearce.org> wrote:
> >>> >
> >>> >> This started because I was looking for a way to speed up clones coming
> >>> >> from a JGit server. Cloning the linux-2.6 repository is painful,
> >
> > Well, scratch the idea in this thread. I think.
>
> Nope, I'm back in favor with this after fixing JGit's thin pack
> generation. Here's why.
>
> Take linux-2.6.git as of Jan 12th, with the cache root as of Dec 28th:
>
> $ git update-ref HEAD f878133bf022717b880d0e0995b8f91436fd605c
> $ git-repack.sh --create-cache \
> --cache-root=b52e2a6d6d05421dea6b6a94582126af8cd5cca2 \
> --cache-include=v2.6.11-tree
> $ git repack -a -d
>
> $ ls -lh objects/pack/
> total 456M
> 1.4M pack-74af5edca80797736fe4de7279b2a81af98470a5.idx
> 38M pack-74af5edca80797736fe4de7279b2a81af98470a5.pack
>
> 49M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.idx
> 89 pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.keep
> 368M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.pack
>
> Our "recent history" is 38M, and our "cached pack" is 368M. Its a bit
> more disk than is strictly necessary, this should be ~380M. Call it
> ~26M of wasted disk.
This is fine. When doing an incremental fetch, the thin pack does
minimize the transfer size, but it does increase the stored pack size by
appending a bunch of non delta objects to make the pack complete.
What happens though, is that when gc kicks in, the wasted space is
collected back. Here with a single pack we wouldn't claim that space
back as our current euristics is to reuse delta (non) pairing by
default. Maybe in that case we could simply not reuse deltas if they're
of the REF_DELTA type.
> The "cached object list" I proposed elsewhere in
> this thread would cost about 41M of disk and is utterly useless except
> for initial clones. Here we are wasting about 26M of disk to have
> slightly shorter delta chains in the cached pack (otherwise known as
> our ancient history). So its a slightly smaller waste, and we get
> some (minor) benefit.
Well, of course the ancient history you're willing to keep stable for a
while could be repacked even more aggressively than usual.
> Using the cached pack increased our total data transfer by 2.39 MiB,
That's more than acceptable IMHO. That's less than 1% of the total
transfer.
> I think this is worthwhile. If we are afraid of the extra 2.39 MiB
> data transfer this forces on the client when the repository owner
> enables the feature, we should go back and improve our thin-pack code.
> Transferring 11 MiB to catch up a kernel from Dec 28th to Jan 12th
> sounds like a lot of data,
Well, your timing for this test corresponds with the 2.6.38 merge window
which is a high activity peak for this repository. Still, that would
probably fit the usage scenario in practice pretty well where the cache
pack would be produced on a tagged release which happens right before
the merge window.
> and any improvements in the general
> thin-pack code would shrink the leading thin-pack, possibly getting us
> that 2.39 MiB back.
Any improvement to the thin pack would require more CPU cycles, possibly
lot more. So given this transfer overhead is less than 1% already I
don't think we need to bother.
Nicolas
^ permalink raw reply [flat|nested] 26+ messages in thread