All of lore.kernel.org
 help / color / mirror / Atom feed
* Continue git clone after interruption
@ 2009-08-17 11:42 Tomasz Kontusz
  2009-08-17 12:31 ` Johannes Schindelin
  0 siblings, 1 reply; 39+ messages in thread
From: Tomasz Kontusz @ 2009-08-17 11:42 UTC (permalink / raw)
  To: git

Hi,
is anybody working on making it possible to continue git clone after
interruption? It would be quite useful for people with bad internet
connection (I was downloading a big repo lately, and it was a bit
frustrating to start it over every time git stopped at ~90%).

Tomasz Kontusz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
@ 2009-08-17 12:31 ` Johannes Schindelin
  2009-08-17 15:23   ` Shawn O. Pearce
  2009-08-18  5:43   ` Matthieu Moy
  0 siblings, 2 replies; 39+ messages in thread
From: Johannes Schindelin @ 2009-08-17 12:31 UTC (permalink / raw)
  To: Tomasz Kontusz; +Cc: git

Hi,

On Mon, 17 Aug 2009, Tomasz Kontusz wrote:

> is anybody working on making it possible to continue git clone after 
> interruption? It would be quite useful for people with bad internet 
> connection (I was downloading a big repo lately, and it was a bit 
> frustrating to start it over every time git stopped at ~90%).

Unfortunately, we did not have enough GSoC slots for the project to allow 
restartable clones.

There were discussions about how to implement this on the list, though.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-17 12:31 ` Johannes Schindelin
@ 2009-08-17 15:23   ` Shawn O. Pearce
  2009-08-18  5:43   ` Matthieu Moy
  1 sibling, 0 replies; 39+ messages in thread
From: Shawn O. Pearce @ 2009-08-17 15:23 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Tomasz Kontusz, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Mon, 17 Aug 2009, Tomasz Kontusz wrote:
> 
> > is anybody working on making it possible to continue git clone after 
> > interruption? It would be quite useful for people with bad internet 
> > connection (I was downloading a big repo lately, and it was a bit 
> > frustrating to start it over every time git stopped at ~90%).
> 
> Unfortunately, we did not have enough GSoC slots for the project to allow 
> restartable clones.
> 
> There were discussions about how to implement this on the list, though.

Unfortunately, those of us who know how the native protocol works
can't come to an agreement on how it might be restartable.  If you
really read the archives on this topic, you'll see that Nico and I
disagree about how to do this.  IIRC Nico's position is, it isn't
really possible to implement a restart.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-17 12:31 ` Johannes Schindelin
  2009-08-17 15:23   ` Shawn O. Pearce
@ 2009-08-18  5:43   ` Matthieu Moy
  2009-08-18  6:58     ` Tomasz Kontusz
  1 sibling, 1 reply; 39+ messages in thread
From: Matthieu Moy @ 2009-08-18  5:43 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Tomasz Kontusz, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi,
>
> On Mon, 17 Aug 2009, Tomasz Kontusz wrote:
>
>> is anybody working on making it possible to continue git clone after 
>> interruption? It would be quite useful for people with bad internet 
>> connection (I was downloading a big repo lately, and it was a bit 
>> frustrating to start it over every time git stopped at ~90%).
>
> Unfortunately, we did not have enough GSoC slots for the project to allow 
> restartable clones.
>
> There were discussions about how to implement this on the list,
> though.

And a paragraph on the wiki:

http://git.or.cz/gitwiki/SoC2009Ideas#RestartableClone

-- 
Matthieu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-18  5:43   ` Matthieu Moy
@ 2009-08-18  6:58     ` Tomasz Kontusz
  2009-08-18 17:56       ` Nicolas Pitre
  0 siblings, 1 reply; 39+ messages in thread
From: Tomasz Kontusz @ 2009-08-18  6:58 UTC (permalink / raw)
  To: git

Dnia 2009-08-18, wto o godzinie 07:43 +0200, Matthieu Moy pisze:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > Hi,
> >
> > On Mon, 17 Aug 2009, Tomasz Kontusz wrote:
> >
> >> is anybody working on making it possible to continue git clone after 
> >> interruption? It would be quite useful for people with bad internet 
> >> connection (I was downloading a big repo lately, and it was a bit 
> >> frustrating to start it over every time git stopped at ~90%).
> >
> > Unfortunately, we did not have enough GSoC slots for the project to allow 
> > restartable clones.
> >
> > There were discussions about how to implement this on the list,
> > though.
> 
> And a paragraph on the wiki:
> 
> http://git.or.cz/gitwiki/SoC2009Ideas#RestartableClone

Ok, so it looks like it's not implementable without some kind of cache
server-side, so the server would know what the pack it was sending
looked like.
But here's my idea: make server send objects in different order (the
newest commit + whatever it points to first, then next one,then
another...). Then it would be possible to look at what we got, tell
server we have nothing, and want [the newest commit that was not
complete]. I know the reason why it is sorted the way it is, but I think
that the way data is stored after clone is clients problem, so the
client should reorganize packs the way it wants.

Tomasz K.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-18  6:58     ` Tomasz Kontusz
@ 2009-08-18 17:56       ` Nicolas Pitre
  2009-08-18 18:45         ` Jakub Narebski
  0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-18 17:56 UTC (permalink / raw)
  To: Tomasz Kontusz; +Cc: git

On Tue, 18 Aug 2009, Tomasz Kontusz wrote:

> Ok, so it looks like it's not implementable without some kind of cache
> server-side, so the server would know what the pack it was sending
> looked like.
> But here's my idea: make server send objects in different order (the
> newest commit + whatever it points to first, then next one,then
> another...). Then it would be possible to look at what we got, tell
> server we have nothing, and want [the newest commit that was not
> complete]. I know the reason why it is sorted the way it is, but I think
> that the way data is stored after clone is clients problem, so the
> client should reorganize packs the way it wants.

That won't buy you much.  You should realize that a pack is made of:

1) Commit objects.  Yes they're all put together at the front of the pack,
   but they roughly are the equivalent of:

	git log --pretty=raw | gzip | wc -c

   For the Linux repo as of now that is around 32 MB.

2) Tree andblob objects.  Those are the bulk of the content for the top 
   commit.  The top commit is usually not delta compressed because we 
   want fast access to the top commit, and that is used as the base for 
   further delta compression for older commits.  So the very first 
   commit is whole at the front of the pack right after the commit 
   objects.  you can estimate the size of this data with:

	git archive --format=tar HEAD | gzip | wc -c

   On the same Linux repo this is currently 75 MB.

3) Delta objects.  Those are making the rest of the pack, plus a couple 
   tree/blob objects that were not found in the top commit and are 
   different enough from any object in that top commit not to be 
   represented as deltas.  Still, the majority of objects for all the 
   remaining commits are delta objects.

So... if we reorder objects, all that we can do is to spread commit 
objects around so that the objects referenced by one commit are all seen 
before another commit object is included.  That would cut on that 
initial 32 MB.

However you still have to get that 75 MB in order to at least be able to 
look at _one_ commit.  So you've only reduced your critical download 
size from 107 MB to 75 MB.  This is some improvement, of course, but not 
worth the bother IMHO.  If we're to have restartable clone, it has to 
work for any size.

And that's where the real problem is.  I don't think having servers to 
cache pack results for every fetch requests is sensible as that would be 
an immediate DoS attack vector.

And because the object order in a pack is not defined by the protocol, 
we cannot expect the server to necessarily always provide the same 
object order either.  For example, it is already undefined in which 
order you'll receive objects as threaded delta search is non 
deterministic and two identical fetch requests may end up with slightly 
different packing.  Or load balancing may redirect your fetch requests 
to different git servers which might have different versions of zlib, or 
even git itself, affecting the object packing order and/or size.

Now... What _could_ be done, though, is some extension to the 
git-archive command.  One thing that is well and strictly defined in git 
is the file path sort order.  So given a commit SHA1, you should always 
get the same files in the same order from git-archive.  For an initial 
clone, git could attempt fetching the top commit using the remote 
git-archive service and locally reconstruct that top commit that way.  
if the transfer is interrupted in the middle, then the remote 
git-archive could be told how to resume the transfer by telling it how 
many files and how many bytes in the current file to skip.  This way the 
server doesn't need to perform any sort of caching and remains 
stateless.

You then end up with a pretty shallow repository.  The clone process 
could then fall back to the traditional native git transfer protocol to 
deepen the history of that shallow repository.  And then that special 
packing sort order to distribute commit objects would make sense since 
each commit would then have a fairly small set of new objects, and most 
of them would be deltas anyway, making the data size per commit really 
small and any interrupted transfer much less of an issue.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-18 17:56       ` Nicolas Pitre
@ 2009-08-18 18:45         ` Jakub Narebski
  2009-08-18 20:01           ` Nicolas Pitre
  2009-08-19  4:42           ` Sitaram Chamarty
  0 siblings, 2 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-08-18 18:45 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Tomasz Kontusz, git

Nicolas Pitre <nico@cam.org> writes:

> On Tue, 18 Aug 2009, Tomasz Kontusz wrote:
> 
> > Ok, so it looks like it's not implementable without some kind of cache
> > server-side, so the server would know what the pack it was sending
> > looked like.
> > But here's my idea: make server send objects in different order (the
> > newest commit + whatever it points to first, then next one,then
> > another...). Then it would be possible to look at what we got, tell
> > server we have nothing, and want [the newest commit that was not
> > complete]. I know the reason why it is sorted the way it is, but I think
> > that the way data is stored after clone is clients problem, so the
> > client should reorganize packs the way it wants.
> 
> That won't buy you much.  You should realize that a pack is made of:
> 
> 1) Commit objects.  Yes they're all put together at the front of the pack,
>    but they roughly are the equivalent of:
> 
> 	git log --pretty=raw | gzip | wc -c
> 
>    For the Linux repo as of now that is around 32 MB.

For my clone of Git repository this gives 3.8 MB
 
> 2) Tree and blob objects.  Those are the bulk of the content for the top 
>    commit.  The top commit is usually not delta compressed because we 
>    want fast access to the top commit, and that is used as the base for 
>    further delta compression for older commits.  So the very first 
>    commit is whole at the front of the pack right after the commit 
>    objects.  you can estimate the size of this data with:
> 
> 	git archive --format=tar HEAD | gzip | wc -c
> 
>    On the same Linux repo this is currently 75 MB.

On the same Git repository this gives 2.5 MB

> 
> 3) Delta objects.  Those are making the rest of the pack, plus a couple 
>    tree/blob objects that were not found in the top commit and are 
>    different enough from any object in that top commit not to be 
>    represented as deltas.  Still, the majority of objects for all the 
>    remaining commits are delta objects.

You forgot that delta chains are bound by pack.depth limit, which
defaults to 50.  You would have then additional full objects.

The single packfile for this (just gc'ed) Git repository is 37 MB.
Much more than 3.8 MB + 2.5 MB = 6.3 MB.

[cut]

There is another way which we can go to implement resumable clone.
Let's git first try to clone whole repository (single pack; BTW what
happens if this pack is larger than file size limit for given
filesystem?).  If it fails, client ask first for first half of of
repository (half as in bisect, but it is server that has to calculate
it).  If it downloads, it will ask server for the rest of repository.
If it fails, it would reduce size in half again, and ask about 1/4 of
repository in packfile first.

The only extension required is for server to support additional
capability, which enable for client to ask for appropriate 1/2^n part
of repository (approximately), or 1/2^n between have and want.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-18 18:45         ` Jakub Narebski
@ 2009-08-18 20:01           ` Nicolas Pitre
  2009-08-18 21:02             ` Jakub Narebski
  2009-08-18 22:28             ` Johannes Schindelin
  2009-08-19  4:42           ` Sitaram Chamarty
  1 sibling, 2 replies; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-18 20:01 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Tomasz Kontusz, git

On Tue, 18 Aug 2009, Jakub Narebski wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > On Tue, 18 Aug 2009, Tomasz Kontusz wrote:
> > 
> > > Ok, so it looks like it's not implementable without some kind of cache
> > > server-side, so the server would know what the pack it was sending
> > > looked like.
> > > But here's my idea: make server send objects in different order (the
> > > newest commit + whatever it points to first, then next one,then
> > > another...). Then it would be possible to look at what we got, tell
> > > server we have nothing, and want [the newest commit that was not
> > > complete]. I know the reason why it is sorted the way it is, but I think
> > > that the way data is stored after clone is clients problem, so the
> > > client should reorganize packs the way it wants.
> > 
> > That won't buy you much.  You should realize that a pack is made of:
> > 
> > 1) Commit objects.  Yes they're all put together at the front of the pack,
> >    but they roughly are the equivalent of:
> > 
> > 	git log --pretty=raw | gzip | wc -c
> > 
> >    For the Linux repo as of now that is around 32 MB.
> 
> For my clone of Git repository this gives 3.8 MB
>  
> > 2) Tree and blob objects.  Those are the bulk of the content for the top 
> >    commit.  The top commit is usually not delta compressed because we 
> >    want fast access to the top commit, and that is used as the base for 
> >    further delta compression for older commits.  So the very first 
> >    commit is whole at the front of the pack right after the commit 
> >    objects.  you can estimate the size of this data with:
> > 
> > 	git archive --format=tar HEAD | gzip | wc -c
> > 
> >    On the same Linux repo this is currently 75 MB.
> 
> On the same Git repository this gives 2.5 MB

Interesting to see that the commit history is larger than the latest 
source tree.  Probably that would be the same with the Linux kernel as 
well if all versions since the beginning with adequate commit logs were 
included in the repo.

> > 3) Delta objects.  Those are making the rest of the pack, plus a couple 
> >    tree/blob objects that were not found in the top commit and are 
> >    different enough from any object in that top commit not to be 
> >    represented as deltas.  Still, the majority of objects for all the 
> >    remaining commits are delta objects.
> 
> You forgot that delta chains are bound by pack.depth limit, which
> defaults to 50.  You would have then additional full objects.

Sure, but that's probably not significant.  the delta chain depth is 
limited, but not the width.  A given base object can have unlimited 
delta "children", and so on at each depth level.

> The single packfile for this (just gc'ed) Git repository is 37 MB.
> Much more than 3.8 MB + 2.5 MB = 6.3 MB.

What I'm saying is that most of that 37 MB - 6.3 MB = 31 MB is likely to 
be occupied by deltas.

> [cut]
> 
> There is another way which we can go to implement resumable clone.
> Let's git first try to clone whole repository (single pack; BTW what
> happens if this pack is larger than file size limit for given
> filesystem?).

We currently fail.  Seems that no one ever had a problem with that so 
far. We'd have to split the pack stream into multiple packs on the 
receiving end.  But frankly, if you have a repository large enough to 
bust your filesystem's file size limit then maybe you should seriously 
reconsider your choice of development environment.

> If it fails, client ask first for first half of of
> repository (half as in bisect, but it is server that has to calculate
> it).  If it downloads, it will ask server for the rest of repository.
> If it fails, it would reduce size in half again, and ask about 1/4 of
> repository in packfile first.

Problem people with slow links have won't be helped at all with this.  
What if the network connection gets broken only after 49% of the 
transfer and that took 3 hours to download?  You'll attempt a 25% size 
transfer which would take 1.5 hour despite the fact that you already 
spent that much time downloading that first 1/4 of the repository 
already.  And yet what if you're unlucky and now the network craps on 
you after 23% of that second attempt?

I think it is better to "prime" the repository with the content of the 
top commit in the most straight forward manner using git-archive which 
has the potential to be fully restartable at any point with little 
complexity on the server side.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-18 20:01           ` Nicolas Pitre
@ 2009-08-18 21:02             ` Jakub Narebski
  2009-08-18 21:32               ` Nicolas Pitre
  2009-08-18 22:28             ` Johannes Schindelin
  1 sibling, 1 reply; 39+ messages in thread
From: Jakub Narebski @ 2009-08-18 21:02 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Tomasz Kontusz, git

On Tue, 18 Aug 2009, Nicolas Pitre wrote:
> On Tue, 18 Aug 2009, Jakub Narebski wrote:
>> Nicolas Pitre <nico@cam.org> writes:

>>> That won't buy you much.  You should realize that a pack is made of:
>>> 
>>> 1) Commit objects.  Yes they're all put together at the front of the pack,
>>>    but they roughly are the equivalent of:
>>> 
>>> 	git log --pretty=raw | gzip | wc -c
>>> 
>>>    For the Linux repo as of now that is around 32 MB.
>> 
>> For my clone of Git repository this gives 3.8 MB
>>  
>>> 2) Tree and blob objects.  Those are the bulk of the content for the top 
>>>    commit. [...]  You can estimate the size of this data with:
>>> 
>>> 	git archive --format=tar HEAD | gzip | wc -c
>>> 
>>>    On the same Linux repo this is currently 75 MB.
>> 
>> On the same Git repository this gives 2.5 MB
> 
> Interesting to see that the commit history is larger than the latest 
> source tree.  Probably that would be the same with the Linux kernel as 
> well if all versions since the beginning with adequate commit logs were 
> included in the repo.

Note that having reflog and/or patch management interface like StGit,
and frequently reworking commits (e.g. using rebase) means more commit
objects in repository.

Also Git repository has 3 independent branches: 'man', 'html' and 'todo',
from whose branches objects are not included in "git archive HEAD".

> 
>>> 3) Delta objects.  Those are making the rest of the pack, plus a couple 
>>>    tree/blob objects that were not found in the top commit and are 
>>>    different enough from any object in that top commit not to be 
>>>    represented as deltas.  Still, the majority of objects for all the 
>>>    remaining commits are delta objects.
>> 
>> You forgot that delta chains are bound by pack.depth limit, which
>> defaults to 50.  You would have then additional full objects.
> 
> Sure, but that's probably not significant.  the delta chain depth is 
> limited, but not the width.  A given base object can have unlimited 
> delta "children", and so on at each depth level.

You can probably get number and size taken by delta and non-delta (base)
objects in the packfile somehow.  Neither "git verify-pack -v <packfile>"
nor contrib/stats/packinfo.pl did help me arrive at this data.

>> The single packfile for this (just gc'ed) Git repository is 37 MB.
>> Much more than 3.8 MB + 2.5 MB = 6.3 MB.
> 
> What I'm saying is that most of that 37 MB - 6.3 MB = 31 MB is likely to 
> be occupied by deltas.

True.
 
>> [cut]
>> 
>> There is another way which we can go to implement resumable clone.
>> Let's git first try to clone whole repository (single pack; BTW what
>> happens if this pack is larger than file size limit for given
>> filesystem?).
> 
> We currently fail.  Seems that no one ever had a problem with that so 
> far. We'd have to split the pack stream into multiple packs on the 
> receiving end.  But frankly, if you have a repository large enough to 
> bust your filesystem's file size limit then maybe you should seriously 
> reconsider your choice of development environment.

Do we fail gracefully (with an error message), or does git crash then?

If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
FAT is often used on SSD, on USB drive.  Although if you have  2 GB
packfile, you are doing something wrong, or UGFWIINI (Using Git For
What It Is Not Intended).
 
>> If it fails, client ask first for first half of of
>> repository (half as in bisect, but it is server that has to calculate
>> it).  If it downloads, it will ask server for the rest of repository.
>> If it fails, it would reduce size in half again, and ask about 1/4 of
>> repository in packfile first.
> 
> Problem people with slow links have won't be helped at all with this.  
> What if the network connection gets broken only after 49% of the 
> transfer and that took 3 hours to download?  You'll attempt a 25% size 
> transfer which would take 1.5 hour despite the fact that you already 
> spent that much time downloading that first 1/4 of the repository 
> already.  And yet what if you're unlucky and now the network craps on 
> you after 23% of that second attempt?

A modification then.

First try ordinary clone.  If it fails because network is unreliable,
check how much we did download, and ask server for packfile of slightly
smaller size; this means that we are asking server for approximate pack
size limit, not for bisect-like partitioning revision list.

> I think it is better to "prime" the repository with the content of the 
> top commit in the most straight forward manner using git-archive which 
> has the potential to be fully restartable at any point with little 
> complexity on the server side.

But didn't it make fully restartable 2.5 MB part out of 37 MB packfile?

A question about pack protocol negotiation.  If clients presents some
objects as "have", server can and does assume that client has all 
prerequisites for such objects, e.g. for tree objects that it has
all objects for files and directories inside tree; for commit it means
all ancestors and all objects in snapshot (have top tree, and its 
prerequisites).  Do I understand this correctly?

If we have partial packfile which crashed during downloading, can we
extract from it some full objects (including blobs)?  Can we pass
tree and blob objects as "have" to server, and is it taken into account?
Perhaps instead of separate step of resumable-downloading of top commit
objects (in snapshot), we can pass to server what we did download in
full?


BTW. because of compression it might be more difficult to resume 
archive creation in the middle, I think...

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-18 21:02             ` Jakub Narebski
@ 2009-08-18 21:32               ` Nicolas Pitre
  2009-08-19 15:19                 ` Jakub Narebski
  0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-18 21:32 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Tomasz Kontusz, git

On Tue, 18 Aug 2009, Jakub Narebski wrote:

> You can probably get number and size taken by delta and non-delta (base)
> objects in the packfile somehow.  Neither "git verify-pack -v <packfile>"
> nor contrib/stats/packinfo.pl did help me arrive at this data.

Documentation for verify-pack says:

|When specifying the -v option the format used is:
|
|        SHA1 type size size-in-pack-file offset-in-packfile
|
|for objects that are not deltified in the pack, and
|
|        SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
|
|for objects that are deltified.

So a simple script should be able to give you the answer.

> >> (BTW what happens if this pack is larger than file size limit for 
> >> given filesystem?).
> > 
> > We currently fail.  Seems that no one ever had a problem with that so 
> > far. We'd have to split the pack stream into multiple packs on the 
> > receiving end.  But frankly, if you have a repository large enough to 
> > bust your filesystem's file size limit then maybe you should seriously 
> > reconsider your choice of development environment.
> 
> Do we fail gracefully (with an error message), or does git crash then?

If the filesystem is imposing the limit, it will likely return an error 
on the write() call and we'll die().  If the machine has a too small 
off_t for the received pack then we also die("pack too large for current 
definition of off_t").

> If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
> FAT is often used on SSD, on USB drive.  Although if you have  2 GB
> packfile, you are doing something wrong, or UGFWIINI (Using Git For
> What It Is Not Intended).

Hopefully you're not performing a 'git clone' off of a FAT filesystem.  
For physical transport you may repack with the appropriate switches.

> >> If it fails, client ask first for first half of of
> >> repository (half as in bisect, but it is server that has to calculate
> >> it).  If it downloads, it will ask server for the rest of repository.
> >> If it fails, it would reduce size in half again, and ask about 1/4 of
> >> repository in packfile first.
> > 
> > Problem people with slow links have won't be helped at all with this.  
> > What if the network connection gets broken only after 49% of the 
> > transfer and that took 3 hours to download?  You'll attempt a 25% size 
> > transfer which would take 1.5 hour despite the fact that you already 
> > spent that much time downloading that first 1/4 of the repository 
> > already.  And yet what if you're unlucky and now the network craps on 
> > you after 23% of that second attempt?
> 
> A modification then.
> 
> First try ordinary clone.  If it fails because network is unreliable,
> check how much we did download, and ask server for packfile of slightly
> smaller size; this means that we are asking server for approximate pack
> size limit, not for bisect-like partitioning revision list.

If the download didn't reach past the critical point (75 MB in my linux 
repo example) then you cannot validate the received data and you've 
wasted that much bandwidth.

> > I think it is better to "prime" the repository with the content of the 
> > top commit in the most straight forward manner using git-archive which 
> > has the potential to be fully restartable at any point with little 
> > complexity on the server side.
> 
> But didn't it make fully restartable 2.5 MB part out of 37 MB packfile?

The front of the pack is the critical point.  If you get enough to 
create the top commit then further transfers can be done incrementally 
with only the deltas between each commits.

> A question about pack protocol negotiation.  If clients presents some
> objects as "have", server can and does assume that client has all 
> prerequisites for such objects, e.g. for tree objects that it has
> all objects for files and directories inside tree; for commit it means
> all ancestors and all objects in snapshot (have top tree, and its 
> prerequisites).  Do I understand this correctly?

That works only for commits.

> If we have partial packfile which crashed during downloading, can we
> extract from it some full objects (including blobs)?  Can we pass
> tree and blob objects as "have" to server, and is it taken into account?

No.

> Perhaps instead of separate step of resumable-downloading of top commit
> objects (in snapshot), we can pass to server what we did download in
> full?

See above.

> BTW. because of compression it might be more difficult to resume 
> archive creation in the middle, I think...

Why so?  the tar+gzip format is streamable.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-18 20:01           ` Nicolas Pitre
  2009-08-18 21:02             ` Jakub Narebski
@ 2009-08-18 22:28             ` Johannes Schindelin
  2009-08-18 23:40               ` Nicolas Pitre
  1 sibling, 1 reply; 39+ messages in thread
From: Johannes Schindelin @ 2009-08-18 22:28 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jakub Narebski, Tomasz Kontusz, git

Hi,

On Tue, 18 Aug 2009, Nicolas Pitre wrote:

> On Tue, 18 Aug 2009, Jakub Narebski wrote:
> 
> > There is another way which we can go to implement resumable clone. 
> > Let's git first try to clone whole repository (single pack; BTW what 
> > happens if this pack is larger than file size limit for given 
> > filesystem?).
> 
> We currently fail.  Seems that no one ever had a problem with that so 
> far.

They just went away, most probably.

But seriously, I miss a very important idea in this discussion: we control 
the Git source code.  So we _can_ add a upload_pack feature that a client 
can ask for after the first failed attempt.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-18 22:28             ` Johannes Schindelin
@ 2009-08-18 23:40               ` Nicolas Pitre
  2009-08-19  7:35                 ` Johannes Schindelin
  0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-18 23:40 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jakub Narebski, Tomasz Kontusz, git

On Wed, 19 Aug 2009, Johannes Schindelin wrote:

> Hi,
> 
> On Tue, 18 Aug 2009, Nicolas Pitre wrote:
> 
> > On Tue, 18 Aug 2009, Jakub Narebski wrote:
> > 
> > > There is another way which we can go to implement resumable clone. 
> > > Let's git first try to clone whole repository (single pack; BTW what 
> > > happens if this pack is larger than file size limit for given 
> > > filesystem?).
> > 
> > We currently fail.  Seems that no one ever had a problem with that so 
> > far.
> 
> They just went away, most probably.

Most probably they simply don't exist.  I would be highly surprised 
otherwise.

> But seriously, I miss a very important idea in this discussion: we control 
> the Git source code.  So we _can_ add a upload_pack feature that a client 
> can ask for after the first failed attempt.

Indeed.  So what do you think about my proposal?  It was included in my 
first reply to this thread.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-18 18:45         ` Jakub Narebski
  2009-08-18 20:01           ` Nicolas Pitre
@ 2009-08-19  4:42           ` Sitaram Chamarty
  2009-08-19  9:53             ` Jakub Narebski
  1 sibling, 1 reply; 39+ messages in thread
From: Sitaram Chamarty @ 2009-08-19  4:42 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Nicolas Pitre, Tomasz Kontusz, git

On Wed, Aug 19, 2009 at 12:15 AM, Jakub Narebski<jnareb@gmail.com> wrote:
> There is another way which we can go to implement resumable clone.
> Let's git first try to clone whole repository (single pack; BTW what
> happens if this pack is larger than file size limit for given
> filesystem?).  If it fails, client ask first for first half of of
> repository (half as in bisect, but it is server that has to calculate
> it).  If it downloads, it will ask server for the rest of repository.
> If it fails, it would reduce size in half again, and ask about 1/4 of
> repository in packfile first.

How about an extension where the user can *ask* for a clone of a
particular HEAD to be sent to him as a git bundle?  Or particular
revisions (say once a week) were kept as a single file git-bundle,
made available over HTTP -- easily restartable with byte-range -- and
anyone who has bandwidth problems first gets that, then changes the
origin remote URL and does a "pull" to get uptodate?

I've done this manually a few times when sneakernet bandwidth was
better than the normal kind, heh, but it seems to me the lowest impact
solution.

Yes you'd need some extra space on the server, but you keep only one
bundle, and maybe replace it every week by cron.  Should work fine
right now, as is, with a wee bit of manual work by the user, and a
quick cron entry on the server

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-18 23:40               ` Nicolas Pitre
@ 2009-08-19  7:35                 ` Johannes Schindelin
  2009-08-19  8:25                   ` Nguyen Thai Ngoc Duy
  2009-08-19 17:21                   ` Nicolas Pitre
  0 siblings, 2 replies; 39+ messages in thread
From: Johannes Schindelin @ 2009-08-19  7:35 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jakub Narebski, Tomasz Kontusz, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1531 bytes --]

Hi,

On Tue, 18 Aug 2009, Nicolas Pitre wrote:

> On Wed, 19 Aug 2009, Johannes Schindelin wrote:
> 
> > But seriously, I miss a very important idea in this discussion: we 
> > control the Git source code.  So we _can_ add a upload_pack feature 
> > that a client can ask for after the first failed attempt.
> 
> Indeed.  So what do you think about my proposal?  It was included in my 
> first reply to this thread.

Did you not talk about an extension of the archive protocol?  That's not 
what I meant.  The archive protocol can be disabled for completely 
different reasons than to prevent restartable clones.

But you brought up an important point: shallow repositories.

Now, the problem, of course, is that if you cannot even get a single ref 
(shallow'ed to depth 0 -- which reminds me: I think I promised to fix 
that, but I did not do that yet) due to intermittent network failures, you 
are borked, as you said.

But here comes an idea: together with Nguyễn's sparse series, it is 
conceivable that we support a shallow & narrow clone via the upload-pack 
protocol (also making mithro happy).  The problem with narrow clones was 
not the pack generation side, that is done by a rev-list that can be 
limited to certain paths.  The problem was that we end up with missing 
tree objects.  However, if we can make a sparse checkout, we can avoid 
the problem.

Note: this is not well thought-through, but just a brainstorm-like answer 
to your ideas.

Ciao,
Dscho "who should shut up now and get some work done instead ;-)"

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-19  7:35                 ` Johannes Schindelin
@ 2009-08-19  8:25                   ` Nguyen Thai Ngoc Duy
  2009-08-19  9:52                     ` Johannes Schindelin
  2009-08-19 17:21                   ` Nicolas Pitre
  1 sibling, 1 reply; 39+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2009-08-19  8:25 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Nicolas Pitre, Jakub Narebski, Tomasz Kontusz, git

On Wed, Aug 19, 2009 at 2:35 PM, Johannes
Schindelin<Johannes.Schindelin@gmx.de> wrote:
> But here comes an idea: together with Nguy要's sparse series, it is

FWIW, you can write "Nguyen" instead. It might save you one copy/paste
(I take it you don't have a Vietnamese IM ;-)

> conceivable that we support a shallow & narrow clone via the upload-pack
> protocol (also making mithro happy).  The problem with narrow clones was
> not the pack generation side, that is done by a rev-list that can be
> limited to certain paths.  The problem was that we end up with missing
> tree objects.  However, if we can make a sparse checkout, we can avoid
> the problem.

But then git-fsck, git-archive... will die?
-- 
Duy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-19  8:25                   ` Nguyen Thai Ngoc Duy
@ 2009-08-19  9:52                     ` Johannes Schindelin
  0 siblings, 0 replies; 39+ messages in thread
From: Johannes Schindelin @ 2009-08-19  9:52 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Nicolas Pitre, Jakub Narebski, Tomasz Kontusz, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1065 bytes --]

Hi,

On Wed, 19 Aug 2009, Nguyen Thai Ngoc Duy wrote:

> On Wed, Aug 19, 2009 at 2:35 PM, Johannes
> Schindelin<Johannes.Schindelin@gmx.de> wrote:
> > But here comes an idea: together with Nguy要's sparse series, it is
> 
> FWIW, you can write "Nguyen" instead. It might save you one copy/paste 
> (I take it you don't have a Vietnamese IM ;-)

FWIW I originally wrote Nguyễn (not that Chinese(?) character)... I look 
it up everytime I want to write your name by searching my address book for 
"pclouds". ;-)

> > conceivable that we support a shallow & narrow clone via the 
> > upload-pack protocol (also making mithro happy).  The problem with 
> > narrow clones was not the pack generation side, that is done by a 
> > rev-list that can be limited to certain paths.  The problem was that 
> > we end up with missing tree objects.  However, if we can make a sparse 
> > checkout, we can avoid the problem.
> 
> But then git-fsck, git-archive... will die?

Oh, but they should be made aware of the narrow clone, just like for 
shallow clones.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-19  4:42           ` Sitaram Chamarty
@ 2009-08-19  9:53             ` Jakub Narebski
  0 siblings, 0 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-08-19  9:53 UTC (permalink / raw)
  To: Sitaram Chamarty; +Cc: Nicolas Pitre, Tomasz Kontusz, git

On Wed, Aug 19, 2009, Sitaram Chamarty wrote:
> On Wed, Aug 19, 2009 at 12:15 AM, Jakub Narebski<jnareb@gmail.com> wrote:

> > There is another way which we can go to implement resumable clone.
> > Let's git first try to clone whole repository (single pack; BTW what
> > happens if this pack is larger than file size limit for given
> > filesystem?).  If it fails, client ask first for first half of of
> > repository (half as in bisect, but it is server that has to calculate
> > it).  If it downloads, it will ask server for the rest of repository.
> > If it fails, it would reduce size in half again, and ask about 1/4 of
> > repository in packfile first.
> 
> How about an extension where the user can *ask* for a clone of a
> particular HEAD to be sent to him as a git bundle?  Or particular
> revisions (say once a week) were kept as a single file git-bundle,
> made available over HTTP -- easily restartable with byte-range -- and
> anyone who has bandwidth problems first gets that, then changes the
> origin remote URL and does a "pull" to get uptodate?
> 
> I've done this manually a few times when sneakernet bandwidth was
> better than the normal kind, heh, but it seems to me the lowest impact
> solution.
> 
> Yes you'd need some extra space on the server, but you keep only one
> bundle, and maybe replace it every week by cron.  Should work fine
> right now, as is, with a wee bit of manual work by the user, and a
> quick cron entry on the server

This is a good idea, i think, and it can be implemented with various
amount of effort and changes to git, and various amount of seamless
integration.

1. Simplest solution: social (homepage).  Not integrated at all.

   On projects homepage, the one where there is described where project
   repository is and how to get it, you add a link to most recent bundle
   (perhaps in addition to most recent snapshot).  This bundle would be
   served as a static file via HTTP (and perhaps also FTP) by (any) web
   server that supports resuming (range requests).  Or you can make
   server generate bundles on demand, only when they are first requested.

   Most recent might mean latest tagged release, or it might mean daily
   snapshot^W bundle.

   This solution could be integrated into gitweb, either by generic 
   'latest bundle' link in project's README.html (or in site's 
   GITWEB_HOMETEXT, default indextext.html), or by having gitweb
   generate those links (and perhaps bundles as well) by itself.

2. Seamless solution: 'bundle' or 'bundles' capability.  Requires 
   changes to both server and client.

   If server supports (advertises) 'bundle' capability, it can serve
   list of bundles (as HTTP / FTP / rsync URLs) either at client request,
   or after (or before) list of refs if client requests 'bundle' 
   capability.

   If client has support for 'bundles' capability, it terminates 
   connection to sshd or git-daemon, and does ordinary resumable HTTP
   fetch using libcurl.  After bundle is downloaded fully, it clones
   from bundle, and does git-fetch with the same server as before,
   which would then have less to transfer.  Client has also to handle
   situation where bundle download is interrupted, and do not do cleanup,
   allowing for "git clone --continue".

3. Seamless solution: GitTorrent or its simplification: git mirror-sync.

   I think that GitTorrent (see http://git.or.cz/gitwiki/SoC2009Ideas)
   or even its simplification git-mirror-sync would include restartable
   cloning.  It is even among its intended features.  Also this would
   help to download faster via mirrors which can have faster and better
   network connection.

   But this would be most work.

You can implement solution 1. even now...
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-18 21:32               ` Nicolas Pitre
@ 2009-08-19 15:19                 ` Jakub Narebski
  2009-08-19 19:04                   ` Nicolas Pitre
  0 siblings, 1 reply; 39+ messages in thread
From: Jakub Narebski @ 2009-08-19 15:19 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Tomasz Kontusz, git

On Tue, 18 Aug 2009, Nicolas Pitre wrote:
> On Tue, 18 Aug 2009, Jakub Narebski wrote:
> 
>> You can probably get number and size taken by delta and non-delta (base)
>> objects in the packfile somehow.  Neither "git verify-pack -v <packfile>"
>> nor contrib/stats/packinfo.pl did help me arrive at this data.
> 
> Documentation for verify-pack says:
> 
> |When specifying the -v option the format used is:
> |
> |        SHA1 type size size-in-pack-file offset-in-packfile
> |
> |for objects that are not deltified in the pack, and
> |
> |        SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
> |
> |for objects that are deltified.
> 
> So a simple script should be able to give you the answer.

Thanks.

There are 114937 objects in this packfile, including 56249 objects
used as base (can be deltified or not).  git-verify-pack -v shows
that all objects have total size-in-packfile of 33 MB (which agrees
with packfile size of 33 MB), with 17 MB size-in-packfile taken by
deltaified objects, and 16 MB taken by base objects.

  git verify-pack -v | 
    grep -v "^chain" | 
    grep -v "objects/pack/pack-" > verify-pack.out

  sum=0; bsum=0; dsum=0; 
  while read sha1 type size packsize off depth base; do
    echo "$sha1" >> verify-pack.sha1.out
    sum=$(( $sum + $packsize ))
    if [ -n "$base" ]; then 
       echo "$sha1" >> verify-pack.delta.out
       dsum=$(( $dsum + $packsize ))
    else
       echo "$sha1" >> verify-pack.base.out
       bsum=$(( $bsum + $packsize ))
    fi
  done < verify-pack.out
  echo "sum=$sum; bsum=$bsum; dsum=$dsum"
 
>>>> (BTW what happens if this pack is larger than file size limit for 
>>>> given filesystem?).
[...]

>> If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
>> FAT is often used on SSD, on USB drive.  Although if you have  2 GB
>> packfile, you are doing something wrong, or UGFWIINI (Using Git For
>> What It Is Not Intended).
> 
> Hopefully you're not performing a 'git clone' off of a FAT filesystem.  
> For physical transport you may repack with the appropriate switches.

Not off a FAT filesystem, but into a FAT filesystem.
 
[...]

>>> I think it is better to "prime" the repository with the content of the 
>>> top commit in the most straight forward manner using git-archive which 
>>> has the potential to be fully restartable at any point with little 
>>> complexity on the server side.
>> 
>> But didn't it make fully restartable 2.5 MB part out of 37 MB packfile?
> 
> The front of the pack is the critical point.  If you get enough to 
> create the top commit then further transfers can be done incrementally 
> with only the deltas between each commits.

How?  You have some objects that can be used as base; how to tell 
git-daemon that we have them (but not theirs prerequisites), and how
to generate incrementals?

>> A question about pack protocol negotiation.  If clients presents some
>> objects as "have", server can and does assume that client has all 
>> prerequisites for such objects, e.g. for tree objects that it has
>> all objects for files and directories inside tree; for commit it means
>> all ancestors and all objects in snapshot (have top tree, and its 
>> prerequisites).  Do I understand this correctly?
> 
> That works only for commits.

Hmmmm... how do you intent for "prefetch top objects restartable-y first"
to work, then?
 
>> BTW. because of compression it might be more difficult to resume 
>> archive creation in the middle, I think...
> 
> Why so?  the tar+gzip format is streamable.

gzip format uses sliding window in compression.  "cat a b | gzip"
is different from "cat <(gzip a) <(gzip b)".

But that doesn't matter.  If we are interrupted in the middle, we can
uncompress what we have to check how far did we get, and tell server
to send the rest; this way server wouldn't have to even generate 
(but not send) what we get as partial transfer.

P.S. What do you think about 'bundle' capability extension mentioned
     in a side sub-thread?
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-19  7:35                 ` Johannes Schindelin
  2009-08-19  8:25                   ` Nguyen Thai Ngoc Duy
@ 2009-08-19 17:21                   ` Nicolas Pitre
  2009-08-19 22:23                     ` René Scharfe
  1 sibling, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-19 17:21 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jakub Narebski, Tomasz Kontusz, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5306 bytes --]

On Wed, 19 Aug 2009, Johannes Schindelin wrote:

> Hi,
> 
> On Tue, 18 Aug 2009, Nicolas Pitre wrote:
> 
> > On Wed, 19 Aug 2009, Johannes Schindelin wrote:
> > 
> > > But seriously, I miss a very important idea in this discussion: we 
> > > control the Git source code.  So we _can_ add a upload_pack feature 
> > > that a client can ask for after the first failed attempt.
> > 
> > Indeed.  So what do you think about my proposal?  It was included in my 
> > first reply to this thread.
> 
> Did you not talk about an extension of the archive protocol?  That's not 
> what I meant.  The archive protocol can be disabled for completely 
> different reasons than to prevent restartable clones.

And those reasons are?

> But you brought up an important point: shallow repositories.
> 
> Now, the problem, of course, is that if you cannot even get a single ref 
> (shallow'ed to depth 0 -- which reminds me: I think I promised to fix 
> that, but I did not do that yet) due to intermittent network failures, you 
> are borked, as you said.

Exact.

> But here comes an idea: together with Nguyễn's sparse series, it is 
> conceivable that we support a shallow & narrow clone via the upload-pack 
> protocol (also making mithro happy).  The problem with narrow clones was 
> not the pack generation side, that is done by a rev-list that can be 
> limited to certain paths.  The problem was that we end up with missing 
> tree objects.  However, if we can make a sparse checkout, we can avoid 
> the problem.

Sure, if you can salvage as much as you can from a partial pack and 
create a shallow and narrow clone out of it then it should be possible 
to do some restartable clone.  I still think this might be much less 
complex to achieve through git-archive, especially if some files i.e. 
objects are large enough to expose themselves to network outage.  It is 
like the same issue as being able to fetch at least one revision but to 
a lesser degree.  You might be able to get that first revision through 
multiple attempts by gathering missing objects on each attempt.  But if 
you encounter an object large enough you then might be unlucky enough 
not to be able to transfer it all before the next network failure.

With a simple extension to git-archive, any object content could be 
resumed many times from any offset.  Then, deepening the history should 
make use of deltas through the pack protocol which should hopefully 
consist of much smaller transfers and therefore less prone to network 
outage.

That could be sketched like this, supposing user runs
"git clone git://foo.bar/baz":

1) "git ini baz" etc. as usual.

2) "git ls-remote git://foo.bar/baz HEAD" and store the result in
   .git/CLONE_HEAD so not to be confused by the remote HEAD possibly 
   changing before we're done.

3) "git archive --remote=git://foo.bar/baz CLONE_HEAD" and store the 
   result locally. Keep track of how many files are received, and how 
   many bytes for the currently received file.

4) if network connection is broken, loop back to (3) adding
   --skip=${nr_files_received},${nr_bytes_in_curr_file_received} to
   the git-archive argument list.  REmote server simply skips over 
   specified number of files and bytes into the next file.

5) Get content from remote commit object for CLONE_HEAD somehow. (?)

6) "git add . && git write-tree" and make sure the top tree SHA1 matches 
   the one in the commit from (5).

7) "git hash-object -w -t commit" with data obtained in (5), and make 
   sure it matches SHA1 from CLONE_HEAD.

8) Update local HEAD with CLONE_HEAD and set it up as a shallow clone.
   Delete .git/CLONE_HEAD.

9) Run "git fetch" with the --depth parameter to get more revisions.

Notes:

- This mode of operation should probably be optional, like by using 
  --safe or --restartable with 'git clone'.  And since this mode of 
  operation is really meant for people with slow and unreliable network 
  connections, they're unlikely to wish for the whole history to be 
  fetched.  Hence this mode could simply be triggered by the --depth 
  parameter to 'git clone' which would provide a clear depth value to 
  use in (9).

- If the transfer is interrupted locally with ^C then it should be 
  possible to resume it by noticing the presence of .git/CLONE_HEAD
  up front.  DEtermining how many files to skip when resuming with 
  git-archive can be done with $((`git ls-files -o | wc -l` - 1)) and
  $(git ls-files -o | tail -1 | wc -c).

- That probably would be a good idea to have a tgz format to 'git 
  archive' which might be simpler to deal with than the zip format.

- Step (3) could be optimized in many ways, like by directly using 
  hash-object and update-index, or by using a filter to pipe the result 
  directly into fast-import.

- So to say that the above should be pretty easy to implement even 
  with a shell script.  A builtin version could then be made if this 
  proves to actually be useful.  And the server remains stateless with 
  no additional caching needed which would go against any attempt 
  at making a busy server like git.kernel.org share as much of the 
  object store between plenty of mostly identical repositoryes.

> Note: this is not well thought-through, but just a brainstorm-like answer 
> to your ideas.

And so is the above.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-19 15:19                 ` Jakub Narebski
@ 2009-08-19 19:04                   ` Nicolas Pitre
  2009-08-19 19:42                     ` Jakub Narebski
  0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-19 19:04 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Tomasz Kontusz, git

On Wed, 19 Aug 2009, Jakub Narebski wrote:

> There are 114937 objects in this packfile, including 56249 objects
> used as base (can be deltified or not).  git-verify-pack -v shows
> that all objects have total size-in-packfile of 33 MB (which agrees
> with packfile size of 33 MB), with 17 MB size-in-packfile taken by
> deltaified objects, and 16 MB taken by base objects.
> 
>   git verify-pack -v | 
>     grep -v "^chain" | 
>     grep -v "objects/pack/pack-" > verify-pack.out
> 
>   sum=0; bsum=0; dsum=0; 
>   while read sha1 type size packsize off depth base; do
>     echo "$sha1" >> verify-pack.sha1.out
>     sum=$(( $sum + $packsize ))
>     if [ -n "$base" ]; then 
>        echo "$sha1" >> verify-pack.delta.out
>        dsum=$(( $dsum + $packsize ))
>     else
>        echo "$sha1" >> verify-pack.base.out
>        bsum=$(( $bsum + $packsize ))
>     fi
>   done < verify-pack.out
>   echo "sum=$sum; bsum=$bsum; dsum=$dsum"

Your object classification is misleading.  Because an object has no 
base, that doesn't mean it is necessarily a base itself.  You'd have to 
store $base into a separate file and then sort it and remove duplicates 
to know the actual number of base objects.  What you have right now is 
strictly delta objects and non-delta objects. And base objects can 
themselves be delta objects already of course.

Also... my git repo after 'git gc --aggressive' contains a pack which 
size is 22 MB.  Your script tells me:

sum=22930254; bsum=14142012; dsum=8788242

and:

   29558 verify-pack.base.out
   82043 verify-pack.delta.out
  111601 verify-pack.out
  111601 verify-pack.sha1.out

meaning that I have 111601 total objects, of which 29558 are non-deltas 
occupying 14 MB and 82043 are deltas occupying 8 MB.  That certainly 
shows how deltas are space efficient.  And with a minor modification to 
your script, I know that 44985 objects are actually used as a delta 
base.  So, on average, each base is responsible for nearly 2 deltas.

> >>>> (BTW what happens if this pack is larger than file size limit for 
> >>>> given filesystem?).
> [...]
> 
> >> If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
> >> FAT is often used on SSD, on USB drive.  Although if you have  2 GB
> >> packfile, you are doing something wrong, or UGFWIINI (Using Git For
> >> What It Is Not Intended).
> > 
> > Hopefully you're not performing a 'git clone' off of a FAT filesystem.  
> > For physical transport you may repack with the appropriate switches.
> 
> Not off a FAT filesystem, but into a FAT filesystem.

That's what I meant, sorry.  My point still stands.

> > The front of the pack is the critical point.  If you get enough to 
> > create the top commit then further transfers can be done incrementally 
> > with only the deltas between each commits.
> 
> How?  You have some objects that can be used as base; how to tell 
> git-daemon that we have them (but not theirs prerequisites), and how
> to generate incrementals?

Just the same as when you perform a fetch to update your local copy of a 
remote branch: you tell the remote about the commit you have and the one 
you want, and git-repack will create delta objects for the commit you 
want against similar objects from the commit you already have, and skip 
those objects from the commit you want that are already included in the 
commit you have.

> >> A question about pack protocol negotiation.  If clients presents some
> >> objects as "have", server can and does assume that client has all 
> >> prerequisites for such objects, e.g. for tree objects that it has
> >> all objects for files and directories inside tree; for commit it means
> >> all ancestors and all objects in snapshot (have top tree, and its 
> >> prerequisites).  Do I understand this correctly?
> > 
> > That works only for commits.
> 
> Hmmmm... how do you intent for "prefetch top objects restartable-y first"
> to work, then?

See my latest reply to dscho (you were in CC already).

> >> BTW. because of compression it might be more difficult to resume 
> >> archive creation in the middle, I think...
> > 
> > Why so?  the tar+gzip format is streamable.
> 
> gzip format uses sliding window in compression.  "cat a b | gzip"
> is different from "cat <(gzip a) <(gzip b)".
> 
> But that doesn't matter.  If we are interrupted in the middle, we can
> uncompress what we have to check how far did we get, and tell server
> to send the rest; this way server wouldn't have to even generate 
> (but not send) what we get as partial transfer.

You got it.

> P.S. What do you think about 'bundle' capability extension mentioned
>      in a side sub-thread?

I don't like it.  Reason is that it forces the server to be (somewhat) 
stateful by having to keep track of those bundles and cycle them, and it 
doubles the disk usage by having one copy of the repository in the form 
of the original pack(s) and another copy as a bundle.

Of course, the idea of having a cron job generating a bundle and 
offering it for download through HTTP or the like is fine if people are 
OK with that, and that requires zero modifications to git.  But I don't 
think that is a solution that scales.

If you think about git.kernel.org which has maybe hundreds of 
repositories where the big majority of them are actually forks of Linus' 
own repository, then having all those forks reference Linus' repository 
is a big disk space saver (and IO too as the referenced repository is 
likely to remain cached in memory).  Having a bundle ready for each of 
them will simply kill that space advantage, unless they all share the 
same bundle.

Now sharing that common bundle could be done of course, but that makes 
things yet more complex while still wasting IO because some requests 
will hit the common pack and some others will hit the bundle, making 
less efficient usage of the disk cache on the server.

Yet, that bundle would probably not contain the latest revision if it is 
only periodically updated, even less so if it is shared between multiple 
repositories as outlined above.  And what people with slow/unreliable 
network links are probably most interested in is the latest revision and 
maybe a few older revisions, but probably not the whole repository as 
that is simply too long to wait for.  Hence having a big bundle is not 
flexible either with regards to the actual data transfer size.

Hence having a restartable git-archive service to create the top 
revision with the ability to cheaply (in terms of network bandwidth) 
deepen the history afterwards is probably the most straight forward way 
to achieve that.  The server needs no be aware of separate bundles, etc.  
And the shared object store still works as usual with the same cached IO 
whether the data is needed for a traditional fetch or a "git archive" 
operation.

Why "git archive"?  Because its content is well defined.  So if you give 
it a commit SHA1 you will always get the same stream of bytes (after 
decompression) since the way git sort files is strictly defined.  It is 
therefore easy to tell a remote "git archive" instance that we want the 
content for commit xyz but that we already got n files already, and that 
the last file we've got has m bytes.  There is simply no confusion about 
what we've got already, unlike with a partial pack which might need 
yet-to-be-received objects in order to make sense of what has been 
already received.  The server simply has to skip that many files and 
resume the transfer at that point, independently of the compression or 
even the archive format.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-19 19:04                   ` Nicolas Pitre
@ 2009-08-19 19:42                     ` Jakub Narebski
  2009-08-19 21:13                       ` Nicolas Pitre
  0 siblings, 1 reply; 39+ messages in thread
From: Jakub Narebski @ 2009-08-19 19:42 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Tomasz Kontusz, git, Johannes Schindelin

Cc-ed Dscho, so he can easier participate in this subthread.

On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> On Wed, 19 Aug 2009, Jakub Narebski wrote:

> > P.S. What do you think about 'bundle' capability extension mentioned
> >      in a side sub-thread?
> 
> I don't like it.  Reason is that it forces the server to be (somewhat) 
> stateful by having to keep track of those bundles and cycle them, and it 
> doubles the disk usage by having one copy of the repository in the form 
> of the original pack(s) and another copy as a bundle.

I agree about problems with disk usage, but I disagree about server
having to be stateful; server can just simply scan for bundles, and
offer links to them if client requests 'bundles' capability, somewhere
around initial git-ls-remote list of refs.

> Of course, the idea of having a cron job generating a bundle and 
> offering it for download through HTTP or the like is fine if people are 
> OK with that, and that requires zero modifications to git.  But I don't 
> think that is a solution that scales.

Well, offering daily bundle in addition to daily snapshot could be
a good practice, at least until git acquires resumable fetch (resumable
clone).

> 
> If you think about git.kernel.org which has maybe hundreds of 
> repositories where the big majority of them are actually forks of Linus' 
> own repository, then having all those forks reference Linus' repository 
> is a big disk space saver (and IO too as the referenced repository is 
> likely to remain cached in memory).  Having a bundle ready for each of 
> them will simply kill that space advantage, unless they all share the 
> same bundle.

I am thinking about sharing the same bundle for related projects.

> 
> Now sharing that common bundle could be done of course, but that makes 
> things yet more complex while still wasting IO because some requests 
> will hit the common pack and some others will hit the bundle, making 
> less efficient usage of the disk cache on the server.

Hmmm... true (unless bundles are on separate server).

> 
> Yet, that bundle would probably not contain the latest revision if it is 
> only periodically updated, even less so if it is shared between multiple 
> repositories as outlined above.  And what people with slow/unreliable 
> network links are probably most interested in is the latest revision and 
> maybe a few older revisions, but probably not the whole repository as 
> that is simply too long to wait for.  Hence having a big bundle is not 
> flexible either with regards to the actual data transfer size.

I agree that bundle would be useful for restartable clone, and not
useful for restartable fetch.  Well, unless you count (non-existing)
GitTorrent / git-mirror-sync as this solution... ;-)

> 
> Hence having a restartable git-archive service to create the top 
> revision with the ability to cheaply (in terms of network bandwidth) 
> deepen the history afterwards is probably the most straight forward way 
> to achieve that.  The server needs no be aware of separate bundles, etc.  
> And the shared object store still works as usual with the same cached IO 
> whether the data is needed for a traditional fetch or a "git archive" 
> operation.

It's the "cheaply deepen history" that I doubt would be easy.  This is
the most difficult part, I think (see also below).

> 
> Why "git archive"?  Because its content is well defined.  So if you give 
> it a commit SHA1 you will always get the same stream of bytes (after 
> decompression) since the way git sort files is strictly defined.  It is 
> therefore easy to tell a remote "git archive" instance that we want the 
> content for commit xyz but that we already got n files already, and that 
> the last file we've got has m bytes.  There is simply no confusion about 
> what we've got already, unlike with a partial pack which might need 
> yet-to-be-received objects in order to make sense of what has been 
> already received.  The server simply has to skip that many files and 
> resume the transfer at that point, independently of the compression or 
> even the archive format.

Let's reiterate it to check if I understand it correctly:


Any "restartable clone" / "resumable fetch" solution must begin with
a file which is rock-solid stable wrt. reproductability given the same
parameters.  git-archive has this feature, packfile doesn't (so I guess
that bundle also doesn't, unless it was cached / saved on disk).

It would be useful if it was possible to generate part of this rock-solid
file for partial (range, resume) request, without need to generate 
(calculate) parts that client already downloaded.  Otherwise server has
to either waste disk space and IO for caching, or waste CPU (and IO)
on generating part which is not needed and dropping it to /dev/null.
git-archive you say has this feature.

Next you need to tell server that you have those objects got using
resumable download part ("git archive HEAD" in your proposal), and
that it can use them and do not include them in prepared file/pack.
"have" is limited to commits, and "have <sha1>" tells server that
you have <sha1> and all its prerequisites (dependences).  You can't 
use "have <sha1>" with git-archive solution.  I don't know enough
about 'shallow' capability (and what it enables) to know whether
it can be used for that.  Can you elaborate?

Then you have to finish clone / fetch.  All solutions so far include
some kind of incremental improvements.  My first proposal of bisect
fetching 1/nth or predefined size pack is buttom-up solution, where
we build full clone from root commits up.  You propose, from what
I understand build full clone from top commit down, using deepening
from shallow clone.  In this step you either get full incremental
or not; downloading incremental (from what I understand) is not
resumable / they do not support partial fetch.

Do I understand this correctly?
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-19 19:42                     ` Jakub Narebski
@ 2009-08-19 21:13                       ` Nicolas Pitre
  2009-08-20  0:26                         ` Sam Vilain
  2009-08-20  7:37                         ` Jakub Narebski
  0 siblings, 2 replies; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-19 21:13 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Tomasz Kontusz, git, Johannes Schindelin

On Wed, 19 Aug 2009, Jakub Narebski wrote:

> Cc-ed Dscho, so he can easier participate in this subthread.
> 
> On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> > On Wed, 19 Aug 2009, Jakub Narebski wrote:
> 
> > > P.S. What do you think about 'bundle' capability extension mentioned
> > >      in a side sub-thread?
> > 
> > I don't like it.  Reason is that it forces the server to be (somewhat) 
> > stateful by having to keep track of those bundles and cycle them, and it 
> > doubles the disk usage by having one copy of the repository in the form 
> > of the original pack(s) and another copy as a bundle.
> 
> I agree about problems with disk usage, but I disagree about server
> having to be stateful; server can just simply scan for bundles, and
> offer links to them if client requests 'bundles' capability, somewhere
> around initial git-ls-remote list of refs.

But that's the client that has to deal with what the server wants to 
offer, instead of the server actually serving data as the client wants.

> Well, offering daily bundle in addition to daily snapshot could be
> a good practice, at least until git acquires resumable fetch (resumable
> clone).

Outside of Git: maybe.  Through the git protocol: no.  And what would 
that bundle contain over the daily snapshot?  The whole history?  If so 
that goes against the idea that people concerned by all this have slow 
links and probably aren't interested in the time to download it all.  If 
the bundle contains only the top revision then it has no advantage over 
the snapshot.  Somewhere in the middle?  Sure, but then where to draw 
the line?  That's for the client to decide, not the server 
administrator.

And what if you start your slow transfer which breaks in the middle.  
The next morning you want to restart it in the hope that you might 
resume the transfer of the bundle that is incomplete.  But crap, the 
server has updated its bundle and your half-bundle is now useless. 
You've wasted your bandwidth for nothing.

> > If you think about git.kernel.org which has maybe hundreds of 
> > repositories where the big majority of them are actually forks of Linus' 
> > own repository, then having all those forks reference Linus' repository 
> > is a big disk space saver (and IO too as the referenced repository is 
> > likely to remain cached in memory).  Having a bundle ready for each of 
> > them will simply kill that space advantage, unless they all share the 
> > same bundle.
> 
> I am thinking about sharing the same bundle for related projects.

... meaning more administrative burden.

> > Now sharing that common bundle could be done of course, but that makes 
> > things yet more complex while still wasting IO because some requests 
> > will hit the common pack and some others will hit the bundle, making 
> > less efficient usage of the disk cache on the server.
> 
> Hmmm... true (unless bundles are on separate server).

... meaning additional but avoidable costs.

> > Yet, that bundle would probably not contain the latest revision if it is 
> > only periodically updated, even less so if it is shared between multiple 
> > repositories as outlined above.  And what people with slow/unreliable 
> > network links are probably most interested in is the latest revision and 
> > maybe a few older revisions, but probably not the whole repository as 
> > that is simply too long to wait for.  Hence having a big bundle is not 
> > flexible either with regards to the actual data transfer size.
> 
> I agree that bundle would be useful for restartable clone, and not
> useful for restartable fetch.  Well, unless you count (non-existing)
> GitTorrent / git-mirror-sync as this solution... ;-)

I don't think fetches after a clone are such an issue.  They are 
typically transfers being orders of magnitude smaller than the initial 
clone.  Same goes for fetches to deepen a shallow clone which are in 
fact fetches going back in history instead of forward.  I still stands 
by my assertion that bundles are suboptimal for a restartable clone.

As for GitTorrent / git-mirror-sync... those are still vaporwares to me 
and I therefore have doubts about their actual feasability. So no, I 
don't count on them.

> > Hence having a restartable git-archive service to create the top 
> > revision with the ability to cheaply (in terms of network bandwidth) 
> > deepen the history afterwards is probably the most straight forward way 
> > to achieve that.  The server needs no be aware of separate bundles, etc.  
> > And the shared object store still works as usual with the same cached IO 
> > whether the data is needed for a traditional fetch or a "git archive" 
> > operation.
> 
> It's the "cheaply deepen history" that I doubt would be easy.  This is
> the most difficult part, I think (see also below).

Don't think so.  Try this:

	mkdir test
	cd test
	git init
	git fetch --depth=1 git://git.kernel.org/pub/scm/git/git.git

REsult:

remote: Counting objects: 1824, done.
remote: Compressing objects: 100% (1575/1575), done.
Receiving objects: 100% (1824/1824), 3.01 MiB | 975 KiB/s, done.
remote: Total 1824 (delta 299), reused 1165 (delta 180)
Resolving deltas: 100% (299/299), done.
From git://git.kernel.org/pub/scm/git/git
 * branch            HEAD       -> FETCH_HEAD

You'll get the very latest revision for HEAD, and only that.  The size 
of the transfer will be roughly the size of a daily snapshot, except it 
is fully up to date.  It is however non resumable in the event of a 
network outage.  My proposal is to replace this with a "git archive" 
call.  It won't get all branches, but for the purpose of initialising 
one's repository that should be good enough.  And the "git archive" can 
be fully resumable as I explained.

Now to deepen that history.  Let's say you want 10 more revisions going 
back then you simply perform the fetch again with a --depth=10.  Right 
now it doesn't seem to work optimally, but the pack that is then being 
sent could be made of deltas against objects found in the commits we 
already have.  Currently it seems that a pack that also includes those 
objects we already have in addition to those we want is created, which 
is IMHO a flaw in the shallow support that shouldn't be too hard to fix.  
Each level of deepening should then be as small as standard fetches 
going forward when updating the repository with new revisions.

> > Why "git archive"?  Because its content is well defined.  So if you give 
> > it a commit SHA1 you will always get the same stream of bytes (after 
> > decompression) since the way git sort files is strictly defined.  It is 
> > therefore easy to tell a remote "git archive" instance that we want the 
> > content for commit xyz but that we already got n files already, and that 
> > the last file we've got has m bytes.  There is simply no confusion about 
> > what we've got already, unlike with a partial pack which might need 
> > yet-to-be-received objects in order to make sense of what has been 
> > already received.  The server simply has to skip that many files and 
> > resume the transfer at that point, independently of the compression or 
> > even the archive format.
> 
> Let's reiterate it to check if I understand it correctly:
> 
> Any "restartable clone" / "resumable fetch" solution must begin with
> a file which is rock-solid stable wrt. reproductability given the same
> parameters.  git-archive has this feature, packfile doesn't (so I guess
> that bundle also doesn't, unless it was cached / saved on disk).

Right.

> It would be useful if it was possible to generate part of this rock-solid
> file for partial (range, resume) request, without need to generate 
> (calculate) parts that client already downloaded.  Otherwise server has
> to either waste disk space and IO for caching, or waste CPU (and IO)
> on generating part which is not needed and dropping it to /dev/null.
> git-archive you say has this feature.

"Could easily have" is more appropriate.

> Next you need to tell server that you have those objects got using
> resumable download part ("git archive HEAD" in your proposal), and
> that it can use them and do not include them in prepared file/pack.
> "have" is limited to commits, and "have <sha1>" tells server that
> you have <sha1> and all its prerequisites (dependences).  You can't 
> use "have <sha1>" with git-archive solution.  I don't know enough
> about 'shallow' capability (and what it enables) to know whether
> it can be used for that.  Can you elaborate?

See above, or Documentation/technical/shallow.txt.

> Then you have to finish clone / fetch.  All solutions so far include
> some kind of incremental improvements.  My first proposal of bisect
> fetching 1/nth or predefined size pack is buttom-up solution, where
> we build full clone from root commits up.  You propose, from what
> I understand build full clone from top commit down, using deepening
> from shallow clone.  In this step you either get full incremental
> or not; downloading incremental (from what I understand) is not
> resumable / they do not support partial fetch.

Right.  However, like I said, the incremental part should be much 
smaller and therefore less susceptible to network troubles.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-19 17:21                   ` Nicolas Pitre
@ 2009-08-19 22:23                     ` René Scharfe
  0 siblings, 0 replies; 39+ messages in thread
From: René Scharfe @ 2009-08-19 22:23 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Schindelin, Jakub Narebski, Tomasz Kontusz, git

Nicolas Pitre schrieb:
> 3) "git archive --remote=git://foo.bar/baz CLONE_HEAD" and store the 
>    result locally. Keep track of how many files are received, and how 
>    many bytes for the currently received file.
> 
> 4) if network connection is broken, loop back to (3) adding
>    --skip=${nr_files_received},${nr_bytes_in_curr_file_received} to
>    the git-archive argument list.  REmote server simply skips over 
>    specified number of files and bytes into the next file.
> 
> 5) Get content from remote commit object for CLONE_HEAD somehow. (?)

[...]

> - That probably would be a good idea to have a tgz format to 'git
>   archive' which might be simpler to deal with than the zip format.

Adding support for the tgz format would be useful anyway, I guess, and
is easy to implement.

And adding support for cpio (and cpio.gz) and writing an extractor for
it should be simpler than writing a tar extractor alone.

One needs to take a closer look at the limits of the chosen archive
format (file name length, supported file types and attributes, etc.) to
make sure any archive can be turned back into the same git tree.

The commit object could be sent as the first (fake) file of the archive.

You'd need a way to turn off the effect of the attributes export-subst
and export-ignore.

Currently, convert_to_working_tree() is used on the contents of all
files in an archive.  You'd need a way to turn that off, too.

Adding a new format type is probably the easiest way to bundle the
special requirements of the previous three paragraphs.

René

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-19 21:13                       ` Nicolas Pitre
@ 2009-08-20  0:26                         ` Sam Vilain
  2009-08-20  7:37                         ` Jakub Narebski
  1 sibling, 0 replies; 39+ messages in thread
From: Sam Vilain @ 2009-08-20  0:26 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin

On Wed, 2009-08-19 at 17:13 -0400, Nicolas Pitre wrote:
> > It's the "cheaply deepen history" that I doubt would be easy.  This is
> > the most difficult part, I think (see also below).
> 
> Don't think so.  Try this:
> 
> 	mkdir test
> 	cd test
> 	git init
> 	git fetch --depth=1 git://git.kernel.org/pub/scm/git/git.git
> 
> REsult:
> 
> remote: Counting objects: 1824, done.
> remote: Compressing objects: 100% (1575/1575), done.
> Receiving objects: 100% (1824/1824), 3.01 MiB | 975 KiB/s, done.
> remote: Total 1824 (delta 299), reused 1165 (delta 180)
> Resolving deltas: 100% (299/299), done.
> From git://git.kernel.org/pub/scm/git/git
>  * branch            HEAD       -> FETCH_HEAD
> 
> You'll get the very latest revision for HEAD, and only that.  The size 
> of the transfer will be roughly the size of a daily snapshot, except it 
> is fully up to date.  It is however non resumable in the event of a 
> network outage.  My proposal is to replace this with a "git archive" 
> call.  It won't get all branches, but for the purpose of initialising 
> one's repository that should be good enough.  And the "git archive" can 
> be fully resumable as I explained.
> 
> Now to deepen that history.  Let's say you want 10 more revisions going 
> back then you simply perform the fetch again with a --depth=10.  Right 
> now it doesn't seem to work optimally, but the pack that is then being 
> sent could be made of deltas against objects found in the commits we 
> already have.  Currently it seems that a pack that also includes those 
> objects we already have in addition to those we want is created, which 
> is IMHO a flaw in the shallow support that shouldn't be too hard to fix.  
> Each level of deepening should then be as small as standard fetches 
> going forward when updating the repository with new revisions.

Nicholas, apart from starting with most recent commits and working
backwards, this is very similar to the "bundle slicing" idea defined in
GitTorrent.  What the GitTorrent research project has so far achieved is
defining a slicing algorithm, and figuring out how well slicing works,
in terms of wasted bandwidth.

If you do it right, then you can support download spreading across
mirrors, too.  Eg, given a starting point, a 'slice size' - which I
based on uncompressed object size but could as well be based on commit
count - and a slice number to fetch, you should be able to look up in
the revision list index the revisions to select and then make a thin
pack corresponding to those commits.  Currently creating this index is
the slowest part of creating bundle fragments in my Perl implementation.

Once Nick Edelen's project is mergeable, we have a mechanism for being
able to relatively quickly draw a manifest of objects for these slices.

So how much bandwidth is lost?

Eg, for git.git, taking the complete object list, slicing it into 1024k
(uncompressed) bundle slices, and making thin packs from those slices we
get:

Generating index...
Length is 1291327524, 1232 blocks
Slice #0: 1050390 => 120406 (11%)
Slice #1: 1058162 => 124978 (11%)
Slice #2: 1049858 => 104363 (9%)
...
Slice #51: 1105090 => 43140 (3%)
Slice #52: 1091282 => 45367 (4%)
Slice #53: 1067675 => 39792 (3%)
...
Slice #211: 1086238 => 25451 (2%)
Slice #212: 1055705 => 31294 (2%)
Slice #213: 1059460 => 7767 (0%)
...
Slice #1129: 1109209 => 38182 (3%)
Slice #1130: 1125925 => 29829 (2%)
Slice #1131: 1120203 => 14446 (1%)
Final slice: 623055 => 49345
Overall compressed: 39585851
Calculating Repository bundle size...
Counting objects: 107369, done.
Compressing objects: 100% (28059/28059), done.
Writing objects: 100% (107369/107369), 29.20 MiB | 48321 KiB/s, done.
Total 107369 (delta 78185), reused 106770 (delta 77609)
Bundle size: 30638967
Overall inefficiency: 29%

In the above output, the first figure is the complete un-delta'd,
uncompressed size of the slice - that is, the size of all of the new
objects that the commit introduces.  The second figure is the full size
of a thin pack with those objects in it.  ie the above tells me that in
git.git there are 1.2GB of uncompressed objects.  Each slice ends up
varying in size between about 10k and 200k, but most of the slices end
up between 15k and 50k.

Actually the test script was thrown off by a loose root and that added
about 3MB to the compressed size, so the overall inefficiency with this
block size is actually more like 20%.  I think I am running into the
flaw you mention above, too, especially when I do a larger block size
run:

Generating index...
Length is 1291327524, 62 blocks
Slice #0: 21000218 => 1316165 (6%)
Slice #1: 20988208 => 1107636 (5%)
...
Slice #59: 21102776 => 1387722 (6%)
Slice #60: 20974960 => 876648 (4%)
Final slice: 6715954 => 261218
Overall compressed: 50071857
Calculating Repository bundle size...
Counting objects: 107369, done.
Compressing objects: 100% (28059/28059), done.
Writing objects: 100% (107369/107369), 29.20 MiB | 48353 KiB/s, done.
Total 107369 (delta 78185), reused 106770 (delta 77609)
Bundle size: 30638967
Overall inefficiency: 63%

Somehow we made larger packs but the total packed size was larger.

Trying with 100MB "blocks" I get:

Generating index...
Length is 1291327524, 13 blocks
Slice #0: 104952661 => 4846553 (4%)
Slice #1: 104898188 => 2830056 (2%)
Slice #2: 105007998 => 2856535 (2%)
Slice #3: 104909972 => 2583402 (2%)
Slice #4: 104909440 => 2187708 (2%)
Slice #5: 104859786 => 2555686 (2%)
Slice #6: 104873317 => 2358914 (2%)
Slice #7: 104881597 => 2183894 (2%)
Slice #8: 104863418 => 3555224 (3%)
Slice #9: 104896599 => 3192564 (3%)
Slice #10: 104876697 => 3895707 (3%)
Slice #11: 104903491 => 3731555 (3%)
Final slice: 32494360 => 1270887
Overall compressed: 38048685
Calculating Repository bundle size...
Counting objects: 107369, done.
Compressing objects: 100% (28059/28059), done.
Writing objects: 100% (107369/107369), 29.20 MiB | 48040 KiB/s, done.
Total 107369 (delta 78185), reused 106770 (delta 77609)
Bundle size: 30638967
Overall inefficiency: 24%

In the above, we broke the git.git download into 13 partial downloads of
a few meg each, at the expense of an extra 24% of download.

Anyway I've hopefully got more to add to this but this will do for a
starting point.

Sam

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-19 21:13                       ` Nicolas Pitre
  2009-08-20  0:26                         ` Sam Vilain
@ 2009-08-20  7:37                         ` Jakub Narebski
  2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
                                             ` (2 more replies)
  1 sibling, 3 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-08-20  7:37 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Tomasz Kontusz, git, Johannes Schindelin

On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> On Wed, 19 Aug 2009, Jakub Narebski wrote:
> > 
> > On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> > > On Wed, 19 Aug 2009, Jakub Narebski wrote:

[...]
> > > Yet, that bundle would probably not contain the latest revision if it is 
> > > only periodically updated, even less so if it is shared between multiple 
> > > repositories as outlined above.  And what people with slow/unreliable 
> > > network links are probably most interested in is the latest revision and 
> > > maybe a few older revisions, but probably not the whole repository as 
> > > that is simply too long to wait for.  Hence having a big bundle is not 
> > > flexible either with regards to the actual data transfer size.
> > 
> > I agree that bundle would be useful for restartable clone, and not
> > useful for restartable fetch.  Well, unless you count (non-existing)
> > GitTorrent / git-mirror-sync as this solution... ;-)
> 
> I don't think fetches after a clone are such an issue.  They are 
> typically transfers being orders of magnitude smaller than the initial 
> clone.  Same goes for fetches to deepen a shallow clone which are in 
> fact fetches going back in history instead of forward.  I still stands 
> by my assertion that bundles are suboptimal for a restartable clone.

They are good as a workaround for lack of resumable clone in a pinch,
but I agree that because  a) like packfile they cannot guarantee that
for the same arguments (endpoints) they would generate the same file,
b) you can't generate currently resume pat of bundle only (and it would
be probably difficult to add it)  they are not optimal solution.

> As for GitTorrent / git-mirror-sync... those are still vaporwares to me 
> and I therefore have doubts about their actual feasability. So no, I 
> don't count on them.

Well... there is _some_ code.

> > > Hence having a restartable git-archive service to create the top 
> > > revision with the ability to cheaply (in terms of network bandwidth) 
> > > deepen the history afterwards is probably the most straight forward way 
> > > to achieve that.  The server needs no be aware of separate bundles, etc.  
> > > And the shared object store still works as usual with the same cached IO 
> > > whether the data is needed for a traditional fetch or a "git archive" 
> > > operation.
> > 
> > It's the "cheaply deepen history" that I doubt would be easy.  This is
> > the most difficult part, I think (see also below).
> 
> Don't think so.  Try this:
> 
> 	mkdir test
> 	cd test
> 	git init
> 	git fetch --depth=1 git://git.kernel.org/pub/scm/git/git.git
> 
> Result:
> 
> remote: Counting objects: 1824, done.
> remote: Compressing objects: 100% (1575/1575), done.
> Receiving objects: 100% (1824/1824), 3.01 MiB | 975 KiB/s, done.
> remote: Total 1824 (delta 299), reused 1165 (delta 180)
> Resolving deltas: 100% (299/299), done.
> From git://git.kernel.org/pub/scm/git/git
>  * branch            HEAD       -> FETCH_HEAD
> 
> You'll get the very latest revision for HEAD, and only that.  The size 
> of the transfer will be roughly the size of a daily snapshot, except it 
> is fully up to date.  It is however non resumable in the event of a 
> network outage.  My proposal is to replace this with a "git archive" 
> call.  It won't get all branches, but for the purpose of initialising 
> one's repository that should be good enough.  And the "git archive" can 
> be fully resumable as I explained.

It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
(well, that of course depends on repository).  Not that much that is
resumable.

> Now to deepen that history.  Let's say you want 10 more revisions going 
> back then you simply perform the fetch again with a --depth=10.  Right 
> now it doesn't seem to work optimally, but the pack that is then being 
> sent could be made of deltas against objects found in the commits we 
> already have.  Currently it seems that a pack that also includes those 
> objects we already have in addition to those we want is created, which 
> is IMHO a flaw in the shallow support that shouldn't be too hard to fix.  
> Each level of deepening should then be as small as standard fetches 
> going forward when updating the repository with new revisions.

You would have the same (or at least quite similar) problems with 
deepening part (the 'incrementals' transfer part) as you found with my
first proposal of server bisection / division of rev-list, and serving
1/Nth of revisions (where N is selected so packfile is reasonable) to
client as incrementals.  Yours is top-down, mine was bottom-up approach
to sending series of smaller packs.  The problem is how to select size
of incrementals, and that incrementals are all-or-nothing (but see also
comment below).

In proposal using git-archive and shallow clone deepening as incrementals
you have this small seed (how small it depends on repository: 50% - 5%)
which is resumable.  And presumably with deepening you can somehow make
some use from incomplete packfile, only part of which was transferred 
before network error / disconnect.  And even tell server about objects
which you managed to extract from *.pack.part.

> > > Why "git archive"?  Because its content is well defined.  So if you give 
> > > it a commit SHA1 you will always get the same stream of bytes (after 
> > > decompression) since the way git sort files is strictly defined.  It is 
> > > therefore easy to tell a remote "git archive" instance that we want the 
> > > content for commit xyz but that we already got n files already, and that 
> > > the last file we've got has m bytes.  There is simply no confusion about 
> > > what we've got already, unlike with a partial pack which might need 
> > > yet-to-be-received objects in order to make sense of what has been 
> > > already received.  The server simply has to skip that many files and 
> > > resume the transfer at that point, independently of the compression or 
> > > even the archive format.
> > 
> > Let's reiterate it to check if I understand it correctly:
> > 
> > Any "restartable clone" / "resumable fetch" solution must begin with
> > a file which is rock-solid stable wrt. reproductability given the same
> > parameters.  git-archive has this feature, packfile doesn't (so I guess
> > that bundle also doesn't, unless it was cached / saved on disk).
> 
> Right.

*NEW IDEA*

Another solution would be to try to come up with some sort of stable
sorting of objects so that packfile generated for the same parameters
(endpoints) would be always byte-for-byte the same.  But that might be
difficult, or even impossible.

Well, we could send the list of objects in pack in order used later by
pack creation to client (non-resumable but small part), and if packfile
transport was interrupted in the middle client would compare list of 
complete objects in part of packfile against this manifest, and sent
request to server with *sorted* list of object it doesn't have yet.
Server would probably have to check validity of objects list first (the
object list might be needed to be more than just object list; it might
need to specify topology of deltas, i.e. which objects are base for which
ones).  Then it would generate rest of packfile.
 
> > It would be useful if it was possible to generate part of this rock-solid
> > file for partial (range, resume) request, without need to generate 
> > (calculate) parts that client already downloaded.  Otherwise server has
> > to either waste disk space and IO for caching, or waste CPU (and IO)
> > on generating part which is not needed and dropping it to /dev/null.
> > git-archive you say has this feature.
> 
> "Could easily have" is more appropriate.

O.K.  And I can see how this can be easy done.

> > Next you need to tell server that you have those objects got using
> > resumable download part ("git archive HEAD" in your proposal), and
> > that it can use them and do not include them in prepared file/pack.
> > "have" is limited to commits, and "have <sha1>" tells server that
> > you have <sha1> and all its prerequisites (dependences).  You can't 
> > use "have <sha1>" with git-archive solution.  I don't know enough
> > about 'shallow' capability (and what it enables) to know whether
> > it can be used for that.  Can you elaborate?
> 
> See above, or Documentation/technical/shallow.txt.
 
Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
and "deepen" commands from 'shallow' capability extension to git pack
protocol (http://git-scm.com/gitserver.txt).

> > Then you have to finish clone / fetch.  All solutions so far include
> > some kind of incremental improvements.  My first proposal of bisect
> > fetching 1/nth or predefined size pack is buttom-up solution, where
> > we build full clone from root commits up.  You propose, from what
> > I understand build full clone from top commit down, using deepening
> > from shallow clone.  In this step you either get full incremental
> > or not; downloading incremental (from what I understand) is not
> > resumable / they do not support partial fetch.
> 
> Right.  However, like I said, the incremental part should be much 
> smaller and therefore less susceptible to network troubles.

If you have 7% total pack size of git-archive resumable part, how small
do you plan to have those incremental deepening?  Besides in my 1/Nth
proposal those bottom-up packs werealso meant to be sufficiently small
to avoid network troubles.


P.S. As you can see implementing resumable clone isn't easy...

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-20  7:37                         ` Jakub Narebski
@ 2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
  2009-08-20  8:23                             ` Jakub Narebski
  2009-08-20 18:41                           ` Nicolas Pitre
  2009-08-20 22:57                           ` Sam Vilain
  2 siblings, 1 reply; 39+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2009-08-20  7:48 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Nicolas Pitre, Tomasz Kontusz, git, Johannes Schindelin

On Thu, Aug 20, 2009 at 2:37 PM, Jakub Narebski<jnareb@gmail.com> wrote:
> *NEW IDEA*
>
> Another solution would be to try to come up with some sort of stable
> sorting of objects so that packfile generated for the same parameters
> (endpoints) would be always byte-for-byte the same.  But that might be
> difficult, or even impossible.

Isn't it the idea of commit reels [1] from git-torrent?

[1] http://gittorrent.utsl.gen.nz/rfc.html#org-reels
-- 
Duy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
@ 2009-08-20  8:23                             ` Jakub Narebski
  0 siblings, 0 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-08-20  8:23 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Nicolas Pitre, Tomasz Kontusz, git, Johannes Schindelin

Dnia czwartek 20. sierpnia 2009 09:48, Nguyen Thai Ngoc Duy napisał:
> On Thu, Aug 20, 2009 at 2:37 PM, Jakub Narebski<jnareb@gmail.com> wrote:

> > *NEW IDEA*
> >
> > Another solution would be to try to come up with some sort of stable
> > sorting of objects so that packfile generated for the same parameters
> > (endpoints) would be always byte-for-byte the same.  But that might be
> > difficult, or even impossible.
> 
> Isn't it the idea of commit reels [1] from git-torrent?
> 
> [1] http://gittorrent.utsl.gen.nz/rfc.html#org-reels

Well, I didn't thought that this idea didn't occur to anybody else.
What I meant here was that it is a new idea mentioned in this subthread.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-20  7:37                         ` Jakub Narebski
  2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
@ 2009-08-20 18:41                           ` Nicolas Pitre
  2009-08-21 10:07                             ` Jakub Narebski
  2009-08-20 22:57                           ` Sam Vilain
  2 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-20 18:41 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Tomasz Kontusz, git, Johannes Schindelin

On Thu, 20 Aug 2009, Jakub Narebski wrote:

> On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> > You'll get the very latest revision for HEAD, and only that.  The size 
> > of the transfer will be roughly the size of a daily snapshot, except it 
> > is fully up to date.  It is however non resumable in the event of a 
> > network outage.  My proposal is to replace this with a "git archive" 
> > call.  It won't get all branches, but for the purpose of initialising 
> > one's repository that should be good enough.  And the "git archive" can 
> > be fully resumable as I explained.
> 
> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
> (well, that of course depends on repository).  Not that much that is
> resumable.

Take the Linux kernel then.  It is more like 75 MB.

> > Now to deepen that history.  Let's say you want 10 more revisions going 
> > back then you simply perform the fetch again with a --depth=10.  Right 
> > now it doesn't seem to work optimally, but the pack that is then being 
> > sent could be made of deltas against objects found in the commits we 
> > already have.  Currently it seems that a pack that also includes those 
> > objects we already have in addition to those we want is created, which 
> > is IMHO a flaw in the shallow support that shouldn't be too hard to fix.  
> > Each level of deepening should then be as small as standard fetches 
> > going forward when updating the repository with new revisions.
> 
> You would have the same (or at least quite similar) problems with 
> deepening part (the 'incrementals' transfer part) as you found with my
> first proposal of server bisection / division of rev-list, and serving
> 1/Nth of revisions (where N is selected so packfile is reasonable) to
> client as incrementals.  Yours is top-down, mine was bottom-up approach
> to sending series of smaller packs.  The problem is how to select size
> of incrementals, and that incrementals are all-or-nothing (but see also
> comment below).

Yes and no.  Combined with a slight reordering of commit objects, it 
could be possible to receive a partial pack and still be able to extract 
a bunch of full revisions.  The biggest issue is to be able to transfer 
revision x (75 MB for Linux), but revision x-1 usually requires only a 
few kilobytes, revision x-2 a few other kilobytes, etc.  Remember that 
you are likely to have only a few deltas from one revision to another, 
which is not the case for the very first revision you get.  A special 
mode to pack-object could place commit objects only after all the 
objects needed to create that revision.  So once you get a commit object 
on the receiving end, you could assume that all objects reachable from 
that commit are already received, or you had them locally already.

> In proposal using git-archive and shallow clone deepening as incrementals
> you have this small seed (how small it depends on repository: 50% - 5%)
> which is resumable.  And presumably with deepening you can somehow make
> some use from incomplete packfile, only part of which was transferred 
> before network error / disconnect.  And even tell server about objects
> which you managed to extract from *.pack.part.

yes.  And at that point resuming the transfer is just another case of 
shallow repository deepening.

> *NEW IDEA*
> 
> Another solution would be to try to come up with some sort of stable
> sorting of objects so that packfile generated for the same parameters
> (endpoints) would be always byte-for-byte the same.  But that might be
> difficult, or even impossible.

And I don't want to commit to that either.  Having some flexibility in 
object ordering makes it possible to improve on the packing heuristics.  
We certainly should avoid imposing strong restrictions like that for 
little gain.  Even the deltas are likely to be different from one 
request to another when using threads as one thread might be getting 
more CPU time than another slightly modifying the outcome.

> Well, we could send the list of objects in pack in order used later by
> pack creation to client (non-resumable but small part), and if packfile
> transport was interrupted in the middle client would compare list of 
> complete objects in part of packfile against this manifest, and sent
> request to server with *sorted* list of object it doesn't have yet.

Well... actually that's one of the item for pack V4.  Lots of SHA1s are 
duplicated in tree and commit objects, in addition to the pack index 
file.  With pack v4 all those SHA1s would be stored only once in a table 
and objects would index that table instead.

Still, that is not _that_ small though.  Just look at the size of the 
pack index file for the Linux repository to give you an idea.

> Server would probably have to check validity of objects list first (the
> object list might be needed to be more than just object list; it might
> need to specify topology of deltas, i.e. which objects are base for which
> ones).  Then it would generate rest of packfile.

I'm afraid that has the looks of something adding lots of complexity to 
a piece of git that is already quite complex already, namely 
pack-objects.  And there is already only a few individuals with their 
brain around it.

> > > It would be useful if it was possible to generate part of this rock-solid
> > > file for partial (range, resume) request, without need to generate 
> > > (calculate) parts that client already downloaded.  Otherwise server has
> > > to either waste disk space and IO for caching, or waste CPU (and IO)
> > > on generating part which is not needed and dropping it to /dev/null.
> > > git-archive you say has this feature.
> > 
> > "Could easily have" is more appropriate.
> 
> O.K.  And I can see how this can be easy done.
> 
> > > Next you need to tell server that you have those objects got using
> > > resumable download part ("git archive HEAD" in your proposal), and
> > > that it can use them and do not include them in prepared file/pack.
> > > "have" is limited to commits, and "have <sha1>" tells server that
> > > you have <sha1> and all its prerequisites (dependences).  You can't 
> > > use "have <sha1>" with git-archive solution.  I don't know enough
> > > about 'shallow' capability (and what it enables) to know whether
> > > it can be used for that.  Can you elaborate?
> > 
> > See above, or Documentation/technical/shallow.txt.
>  
> Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
> and "deepen" commands from 'shallow' capability extension to git pack
> protocol (http://git-scm.com/gitserver.txt).

404 Not Found

Maybe that should be committed to git in Documentation/technical/  as 
well?

> > > Then you have to finish clone / fetch.  All solutions so far include
> > > some kind of incremental improvements.  My first proposal of bisect
> > > fetching 1/nth or predefined size pack is buttom-up solution, where
> > > we build full clone from root commits up.  You propose, from what
> > > I understand build full clone from top commit down, using deepening
> > > from shallow clone.  In this step you either get full incremental
> > > or not; downloading incremental (from what I understand) is not
> > > resumable / they do not support partial fetch.
> > 
> > Right.  However, like I said, the incremental part should be much 
> > smaller and therefore less susceptible to network troubles.
> 
> If you have 7% total pack size of git-archive resumable part, how small
> do you plan to have those incremental deepening?  Besides in my 1/Nth
> proposal those bottom-up packs werealso meant to be sufficiently small
> to avoid network troubles.

Two issues here: 1) people with slow links might not be interested in a 
deep history as it costs them time.  2) Extra revisions should typically 
require only a few KB each, therefore we might manage to ask for the 
full history after the initial revision is downloaded and salvage as 
much as we can if a network outage is encountered.  There is no need for 
arbitrary size, unless the user decides arbitrarily to get only 10 more 
revisions, or 100 more, etc.

> P.S. As you can see implementing resumable clone isn't easy...

I've been saying that all along for quite a while now.   ;-)


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-20  7:37                         ` Jakub Narebski
  2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
  2009-08-20 18:41                           ` Nicolas Pitre
@ 2009-08-20 22:57                           ` Sam Vilain
  2 siblings, 0 replies; 39+ messages in thread
From: Sam Vilain @ 2009-08-20 22:57 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Nicolas Pitre, Tomasz Kontusz, git, Johannes Schindelin, nick edelen

On Thu, 2009-08-20 at 09:37 +0200, Jakub Narebski wrote:
> You would have the same (or at least quite similar) problems with 
> deepening part (the 'incrementals' transfer part) as you found with my
> first proposal of server bisection / division of rev-list, and serving
> 1/Nth of revisions (where N is selected so packfile is reasonable) to
> client as incrementals.  Yours is top-down, mine was bottom-up approach
> to sending series of smaller packs.  The problem is how to select size
> of incrementals, and that incrementals are all-or-nothing (but see also
> comment below).

I've defined a way to do this which doesn't have the complexity of
bisect in GitTorrent, making the compromise that you can't guarantee
each chunk is exactly the same size... I'll have a crack at doing it
based on the rev-cache code in C instead of the horrendously slow
Perl/Berkeley solution I have at the moment to see how well it fares.

> Another solution would be to try to come up with some sort of stable
> sorting of objects so that packfile generated for the same parameters
> (endpoints) would be always byte-for-byte the same.  But that might be
> difficult, or even impossible.

delta compression is not repeatable enough for this.

This was an assumption made by the first version of GitTorrent, that
this would be an appropriate solution.  

So, first you have to sort the objects - that's fine, --date-order is a
good starting point, then I reasoned that interleaving new objects for
each commit with commit objects would be a useful sort order.  You also
need to tie-break for commits with the same commit date; I just used the
SHA-1 of the commit for that.  Finally, when making packs to avoid
excessive transfer you have to try to make sure that they are "thin"
packs.

Currently, thin packs can only work starting at the beginning of history
and working forward, which is opposite to what happens most of the time
in packs.  I think this is the source of much of the inefficiency caused
by chopping up the object lists mentioned in my other e-mail.  It might
be possible, if you could also know which earlier objects were using
this object as a delta base, to try delta'ing against all those objects
and see which one results in the smallest delta.

> Well, we could send the list of objects in pack in order used later by
> pack creation to client (non-resumable but small part), and if packfile
> transport was interrupted in the middle client would compare list of 
> complete objects in part of packfile against this manifest, and sent
> request to server with *sorted* list of object it doesn't have yet.
> Server would probably have to check validity of objects list first (the
> object list might be needed to be more than just object list; it might
> need to specify topology of deltas, i.e. which objects are base for which
> ones).  Then it would generate rest of packfile.

Mmm.  It's a bit chatty, that.  Object lists add another 10-20% on,
which I think should be avoidable if the thin pack problem, plus the
problem of some objects ending up in more than one of the thin packs
that are created, should be reduced to very little.

Sam

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-20 18:41                           ` Nicolas Pitre
@ 2009-08-21 10:07                             ` Jakub Narebski
  2009-08-21 10:26                               ` Matthieu Moy
  2009-08-21 21:07                               ` Nicolas Pitre
  0 siblings, 2 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-08-21 10:07 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon

On Thu, 20 Aug 2009, Nicolas Pitre wrote:
> On Thu, 20 Aug 2009, Jakub Narebski wrote:
>> On Wed, 19 Aug 2009, Nicolas Pitre wrote:

>>> You'll get the very latest revision for HEAD, and only that.  The size 
>>> of the transfer will be roughly the size of a daily snapshot, except it 
>>> is fully up to date.  It is however non resumable in the event of a 
>>> network outage.  My proposal is to replace this with a "git archive" 
>>> call.  It won't get all branches, but for the purpose of initialising 
>>> one's repository that should be good enough.  And the "git archive" can 
>>> be fully resumable as I explained.
>> 
>> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
>> (well, that of course depends on repository).  Not that much that is
>> resumable.
> 
> Take the Linux kernel then.  It is more like 75 MB.

Ah... good example.

On the other hand Linux is fairly large project in terms of LoC, but
it had its history cut when moving to Git, so the ratio of git-archive
of HEAD to the size of packfile is overemphasized here.

>>> Now to deepen that history.  Let's say you want 10 more revisions going 
>>> back then you simply perform the fetch again with a --depth=10.  Right 
>>> now it doesn't seem to work optimally, but the pack that is then being 
>>> sent could be made of deltas against objects found in the commits we 
>>> already have.  Currently it seems that a pack that also includes those 
>>> objects we already have in addition to those we want is created, which 
>>> is IMHO a flaw in the shallow support that shouldn't be too hard to fix.  
>>> Each level of deepening should then be as small as standard fetches 
>>> going forward when updating the repository with new revisions.
>> 
>> You would have the same (or at least quite similar) problems with 
>> deepening part (the 'incrementals' transfer part) as you found with my
>> first proposal of server bisection / division of rev-list, and serving
>> 1/Nth of revisions (where N is selected so packfile is reasonable) to
>> client as incrementals.  Yours is top-down, mine was bottom-up approach
>> to sending series of smaller packs.  The problem is how to select size
>> of incrementals, and that incrementals are all-or-nothing (but see also
>> comment below).
> 
> Yes and no.  Combined with a slight reordering of commit objects, it 
> could be possible to receive a partial pack and still be able to extract 
> a bunch of full revisions.  The biggest issue is to be able to transfer 
> revision x (75 MB for Linux), but revision x-1 usually requires only a 
> few kilobytes, revision x-2 a few other kilobytes, etc.  Remember that 
> you are likely to have only a few deltas from one revision to another, 
> which is not the case for the very first revision you get.

Let's reiterate again, to be sure that I understand this correctly:


You make use here of a few facts:

1. Objects in packfile are _usually_ sorted in recency order, with most
   recent commits, and most recent versions of trees and tags being in
   the front of pack file, and being base objects for a large set of 
   objects.  Note the "usually" part; it is not set in stone as for RCS
   (and CVS) reverse delta based repository format.

2. There is support in git pack format to do 'deepening' of shallow
   clone, which means that git can generate incrementals in top-down
   order, _similar to how objects are ordered in packfile_.

3. git-archive output is stable.  _git-archive can be made resumable_
   (with range/partial requests), and can be made so it can create
   single-head depth 0 shallow clone.

Also, with top-down deepening order even if you don't use 
'git clone --continue' but 'git clone --skip' (or something), you
would have got usable shallow clone.  In the most extreme case when
you are able to get only the fully resumable part, i.e. git-archive
part (with top commit), you would have single-branch depth 0
shallow clone (not very usable, but still better than nothing).

> A special 
> mode to pack-object could place commit objects only after all the 
> objects needed to create that revision.  So once you get a commit object 
> on the receiving end, you could assume that all objects reachable from 
> that commit are already received, or you had them locally already.

Yes, with such mode (which I think wouldn't reduce / interfere with
ability for upload-pack to pack more tightly by reordering objects
and choosing different deltas) it would be easy to do a salvage of
a partially completed / transferred packfile.  Even if there is no
extension to tell git server which objects we have ("have" is only
about commits), if there is at least one commit object in received
part of packfile, we can try to continue from later (from more);
there is less left to download.

> 
>> In proposal using git-archive and shallow clone deepening as incrementals
>> you have this small seed (how small it depends on repository: 50% - 5%)
>> which is resumable.  And presumably with deepening you can somehow make
>> some use from incomplete packfile, only part of which was transferred 
>> before network error / disconnect.  And even tell server about objects
>> which you managed to extract from *.pack.part.
> 
> yes.  And at that point resuming the transfer is just another case of 
> shallow repository deepening.

Also for deepening top-down incrementals like in your proposal you can
have 'salvage' operation which tries to use something out of partially
transferred packfile (partially downloaded incremental).  It is not, 
I think, the case with my earlier 'server bisect' bottom-up incrementals
idea.

> 
>> *NEW IDEA*
>> 
>> Another solution would be to try to come up with some sort of stable
>> sorting of objects so that packfile generated for the same parameters
>> (endpoints) would be always byte-for-byte the same.  But that might be
>> difficult, or even impossible.
> 
> And I don't want to commit to that either.  Having some flexibility in 
> object ordering makes it possible to improve on the packing heuristics.  
> We certainly should avoid imposing strong restrictions like that for 
> little gain.  Even the deltas are likely to be different from one 
> request to another when using threads as one thread might be getting 
> more CPU time than another slightly modifying the outcome.

Right.

>> Well, we could send the list of objects in pack in order used later by
>> pack creation to client (non-resumable but small part), and if packfile
>> transport was interrupted in the middle client would compare list of 
>> complete objects in part of packfile against this manifest, and sent
>> request to server with *sorted* list of object it doesn't have yet.
> 
> Well... actually that's one of the item for pack V4.  Lots of SHA1s are 
> duplicated in tree and commit objects, in addition to the pack index 
> file.  With pack v4 all those SHA1s would be stored only once in a table 
> and objects would index that table instead.
> 
> Still, that is not _that_ small though.  Just look at the size of the 
> pack index file for the Linux repository to give you an idea.

Well, as such plan (map) of a packfile wouldn't be much smaller than
pack index, if pack index file is large enough (or connection crappy
enough) that it couldn't be transferred without interruption.

Nevertheless 34 MB index for largest 310 MB packfile in Linux kernel
http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/objects/pack/
isn't something very large.  And objects-list / plan of packfile would
be of comparable size.
 

I was envisioning packfile plan (packfile map) as something like that

  sha1 TERMINATOR

for objects that are not deltified in the packfile, and

  sha1 SEPARATOR base-sha1 TERMINATOR

for objects that are deltified (or something like that; we could use
pkt-line format instead).

>> Server would probably have to check validity of objects list first (the
>> object list might be needed to be more than just object list; it might
>> need to specify topology of deltas, i.e. which objects are base for which
>> ones).  Then it would generate rest of packfile.
> 
> I'm afraid that has the looks of something adding lots of complexity to 
> a piece of git that is already quite complex already, namely 
> pack-objects.  And there is already only a few individuals with their 
> brain around it.

Well, with complexity or server CPU/IO because one should not trust 
client (unfortunately), with 'packfile plan' transfer being non-resumable,
and also with requiring packv4 or a temporary file or memory to send 
packfile plan (packfile map)... I think we can scrape that half-baked
idea.

>>>> [...] I don't know enough
>>>> about 'shallow' capability (and what it enables) to know whether
>>>> it can be used for that.  Can you elaborate?
>>> 
>>> See above, or Documentation/technical/shallow.txt.
>>  
>> Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
>> and "deepen" commands from 'shallow' capability extension to git pack
>> protocol (http://git-scm.com/gitserver.txt).
> 
> 404 Not Found
> 
> Maybe that should be committed to git in Documentation/technical/  as 
> well?

This was plain text RFC for the Git Packfile Protocol, generated from
rfc2629 XML sources at http://github.com/schacon/gitserver-rfc


Scott, what happened to http://git-scm.com/gitserver.txt? 

And could you create 'rfc' or 'text' branch in gitserver-rfc
repository, with processed plain/text output, similar to 'man' and
'html' branches in git.git repository? TIA.


_Some_ description of pack protocol can be found in git mailing list
archives
  http://thread.gmane.org/gmane.comp.version-control.git/118956
in "The Git Community Book"
  http://book.git-scm.com/7_transfer_protocols.html
  http://github.com/schacon/gitbook/blob/master/text/54_Transfer_Protocols/0_Transfer_Protocols.markdown
and in "Pro Git"
  http://progit.org/book/ch9-6.html
  http://github.com/progit/progit/blob/master/en/09-git-internals/01-chapter9.markdown

The description in Documentation/technical/pack-protocol.txt is very
brief, and Documentation/technical/shallow.txt doesn't cover 'shallow'
capability of git pack protocol.
 
>>>> Then you have to finish clone / fetch.  All solutions so far include
>>>> some kind of incremental improvements.  My first proposal of bisect
>>>> fetching 1/nth or predefined size pack is buttom-up solution, where
>>>> we build full clone from root commits up.  You propose, from what
>>>> I understand build full clone from top commit down, using deepening
>>>> from shallow clone.  In this step you either get full incremental
>>>> or not; downloading incremental (from what I understand) is not
>>>> resumable / they do not support partial fetch.
>>> 
>>> Right.  However, like I said, the incremental part should be much 
>>> smaller and therefore less susceptible to network troubles.
>> 
>> If you have 7% total pack size of git-archive resumable part, how small
>> do you plan to have those incremental deepening?  Besides in my 1/Nth
>> proposal those bottom-up packs werealso meant to be sufficiently small
>> to avoid network troubles.
> 
> Two issues here: 1) people with slow links might not be interested in a 
> deep history as it costs them time.  2) Extra revisions should typically 
> require only a few KB each, therefore we might manage to ask for the 
> full history after the initial revision is downloaded and salvage as 
> much as we can if a network outage is encountered.  There is no need for 
> arbitrary size, unless the user decides arbitrarily to get only 10 more 
> revisions, or 100 more, etc.

Those two features of your proposal: 
1.) It is possible salvage of partially transferred packfiles (so there
    is no requirement to guess accurately what size should they be),
2.) After completing initial git-archive transfer, you can convert 
    incomplete clone into functioning repository.  It would be shallow
    clone, and can be missing some branches and tags, but you can work
    with it if network connection fails completely.
make it very compelling.

>> P.S. As you can see implementing resumable clone isn't easy...
> 
> I've been saying that all along for quite a while now.   ;-)

Well, on the other hand side we have example of how long it took to
come to current implementation of git submodules.  But if finally
got done.


The git-archive + deepening approach you proposed can be split into
smaller individual improvements.  You don't need to implement it all
at once.

1. Improve deepening of shallow clone.  This means sending only required
   objects, and being able to use as a base objects that other side has
   and send thin pack.

2. Add support for resuming (range request) of git-archive.  It is up
   to client to translate size of partial transfer of compressed file
   into range request of original (uncompressed) archive.

3. Create new git-archive pseudoformat, used to transfer single commit
   (with commit object and original branch name in some extended header,
   similar to how commit ID is stored in extended pax header or ZIP
   comment).  It would imply not using export-* gitattributes.

4. Implement alternate ordering of objects in packfile, so commit object
   is put immediately after all its prerequisites.

5. Implement 'salvage' operation, which given partially transferred 
   packfile would deepen shallow clone, or advance tracking branches,
   ensuring that repository would pass fsck after this operation.

   Probably requires 4; might be not possible or much harder to salvage
   anything with current ordering of objects in packfile.

6. Implement resumable clone ("git clone --keep <URL> [<directory>]",
   "git clone --resume" / "git clone --continue", "git clone --abort",
   "git clone --make-shallow" / "git clone --salvage").

   Requires 1-5.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-21 10:07                             ` Jakub Narebski
@ 2009-08-21 10:26                               ` Matthieu Moy
  2009-08-21 21:07                               ` Nicolas Pitre
  1 sibling, 0 replies; 39+ messages in thread
From: Matthieu Moy @ 2009-08-21 10:26 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Nicolas Pitre, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon

Jakub Narebski <jnareb@gmail.com> writes:

> On the other hand Linux is fairly large project in terms of LoC, but
> it had its history cut when moving to Git, so the ratio of git-archive
> of HEAD to the size of packfile is overemphasized here.

Emacs can be a good example if you want a project with a loooong
history.

emacs.git$ git ll | wc -l  
100651
emacs.git$ du -sh emacs.tar.gz .git/objects/pack/pack-144583582d53e273028966c6de2b3fb2fe3504bc.pack 
29M	emacs.tar.gz
138M	.git/objects/pack/pack-144583582d53e273028966c6de2b3fb2fe3504bc.pack

(from git://git.savannah.gnu.org/emacs.git )

-- 
Matthieu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-21 10:07                             ` Jakub Narebski
  2009-08-21 10:26                               ` Matthieu Moy
@ 2009-08-21 21:07                               ` Nicolas Pitre
  2009-08-21 21:41                                 ` Jakub Narebski
  2009-08-21 23:07                                 ` Sam Vilain
  1 sibling, 2 replies; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-21 21:07 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon

On Fri, 21 Aug 2009, Jakub Narebski wrote:

> On Thu, 20 Aug 2009, Nicolas Pitre wrote:
> > On Thu, 20 Aug 2009, Jakub Narebski wrote:
> >> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
> >> (well, that of course depends on repository).  Not that much that is
> >> resumable.
> > 
> > Take the Linux kernel then.  It is more like 75 MB.
> 
> Ah... good example.
> 
> On the other hand Linux is fairly large project in terms of LoC, but
> it had its history cut when moving to Git, so the ratio of git-archive
> of HEAD to the size of packfile is overemphasized here.

That doesn't matter.  You still need that amount of data up front to do 
anything.  And I doubt people with slow links will want the full history 
anyway, regardless if it goes backward 4 years or 18 years back.

> You make use here of a few facts:
> 
> 1. Objects in packfile are _usually_ sorted in recency order, with most
>    recent commits, and most recent versions of trees and tags being in
>    the front of pack file, and being base objects for a large set of 
>    objects.  Note the "usually" part; it is not set in stone as for RCS
>    (and CVS) reverse delta based repository format.

Exact.  In theory the object order could be totally random and the pack 
would still be valid.  The only restriction at the moment has to do with 
OFS_DELTA objects as the reference to the base object is encoded as a 
downward offset from the beginning of that OFS_DELTA object.  Hence the 
base object has to appear first.  In the case of REF_DELTA objects, the 
base can be located anywhere in the pack (or anywhere else outside of 
the pack in the thin pack case).

> 2. There is support in git pack format to do 'deepening' of shallow
>    clone, which means that git can generate incrementals in top-down
>    order, _similar to how objects are ordered in packfile_.

Well... the pack format was not meant for that "support".  The fact that 
the typical object order used by pack-objects when serving fetch request 
is amenable to incremental top-down updates is rather coincidental and 
not really planned.

> 3. git-archive output is stable.  _git-archive can be made resumable_
>    (with range/partial requests), and can be made so it can create
>    single-head depth 0 shallow clone.
> 
> Also, with top-down deepening order even if you don't use 
> 'git clone --continue' but 'git clone --skip' (or something), you
> would have got usable shallow clone.  In the most extreme case when
> you are able to get only the fully resumable part, i.e. git-archive
> part (with top commit), you would have single-branch depth 0
> shallow clone (not very usable, but still better than nothing).

Right.

> > A special 
> > mode to pack-object could place commit objects only after all the 
> > objects needed to create that revision.  So once you get a commit object 
> > on the receiving end, you could assume that all objects reachable from 
> > that commit are already received, or you had them locally already.
> 
> Yes, with such mode (which I think wouldn't reduce / interfere with
> ability for upload-pack to pack more tightly by reordering objects
> and choosing different deltas) it would be easy to do a salvage of
> a partially completed / transferred packfile.  Even if there is no
> extension to tell git server which objects we have ("have" is only
> about commits), if there is at least one commit object in received
> part of packfile, we can try to continue from later (from more);
> there is less left to download.

Exact.  Suffice to set the last received commit(s) (after validation) as 
one of the shallow points.

> >> Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
> >> and "deepen" commands from 'shallow' capability extension to git pack
> >> protocol (http://git-scm.com/gitserver.txt).
> > 
> > 404 Not Found
> > 
> > Maybe that should be committed to git in Documentation/technical/  as 
> > well?
> 
> This was plain text RFC for the Git Packfile Protocol, generated from
> rfc2629 XML sources at http://github.com/schacon/gitserver-rfc

I suggest you track it down and prod/propose a version for merging in 
the git repository.

> The description in Documentation/technical/pack-protocol.txt is very
> brief, and Documentation/technical/shallow.txt doesn't cover 'shallow'
> capability of git pack protocol.

Yeah.  I finally had a look directly at the code to understand how it 
works.

> >> P.S. As you can see implementing resumable clone isn't easy...
> > 
> > I've been saying that all along for quite a while now.   ;-)
> 
> Well, on the other hand side we have example of how long it took to
> come to current implementation of git submodules.  But if finally
> got done.

In this case there is still no new line of code what so ever.  Thinking 
it through is what takes time.

> The git-archive + deepening approach you proposed can be split into
> smaller individual improvements.  You don't need to implement it all
> at once.
> 
> 1. Improve deepening of shallow clone.  This means sending only required
>    objects, and being able to use as a base objects that other side has
>    and send thin pack.

Yes.  And now that I understand how shallow clones are implemented, I 
Probably will fix that flaw soon.  Won't be hard at all.

> 2. Add support for resuming (range request) of git-archive.  It is up
>    to client to translate size of partial transfer of compressed file
>    into range request of original (uncompressed) archive.
> 
> 3. Create new git-archive pseudoformat, used to transfer single commit
>    (with commit object and original branch name in some extended header,
>    similar to how commit ID is stored in extended pax header or ZIP
>    comment).  It would imply not using export-* gitattributes.

The format I was envisioning is really simple:

First the size of the raw commit object data content in decimal, 
followed by a 0 byte, followed by the actual content of the commit 
object, followed by a 0 byte.  (Note: this could be the exact same 
content as the canonical commit object data with the "commit" prefix, 
but as all the rest are all blob content this would be redundant.)

Then, for each file:

 - The file mode in octal notation just as in tree objects
 - a space
 - the size of the file in decimal
 - a tab
 - the full path of the file
 - a 0 byte
 - the file content as found in the corresponding blob
 - a 0 byte

And finally some kind of marker to indicate the end of the stream.

Put the lot through zlib and you're done.

> 4. Implement alternate ordering of objects in packfile, so commit object
>    is put immediately after all its prerequisites.

That would require some changes in the object enumeration code which is 
an area of the code I don't know well.

> 5. Implement 'salvage' operation, which given partially transferred 
>    packfile would deepen shallow clone, or advance tracking branches,
>    ensuring that repository would pass fsck after this operation.
> 
>    Probably requires 4; might be not possible or much harder to salvage
>    anything with current ordering of objects in packfile.
> 
> 6. Implement resumable clone ("git clone --keep <URL> [<directory>]",
>    "git clone --resume" / "git clone --continue", "git clone --abort",
>    "git clone --make-shallow" / "git clone --salvage").

Right.  This is all doable fairly easily.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-21 21:07                               ` Nicolas Pitre
@ 2009-08-21 21:41                                 ` Jakub Narebski
  2009-08-22  0:59                                   ` Nicolas Pitre
  2009-08-21 23:07                                 ` Sam Vilain
  1 sibling, 1 reply; 39+ messages in thread
From: Jakub Narebski @ 2009-08-21 21:41 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon

On Fri, 21 Aug 2009, Nicolas Pitre wrote:
> On Fri, 21 Aug 2009, Jakub Narebski wrote:
>> On Thu, 20 Aug 2009, Nicolas Pitre wrote:
>>> On Thu, 20 Aug 2009, Jakub Narebski wrote:

>>>> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
>>>> (well, that of course depends on repository).  Not that much that is
>>>> resumable.
>>> 
>>> Take the Linux kernel then.  It is more like 75 MB.
>> 
>> Ah... good example.
>> 
>> On the other hand Linux is fairly large project in terms of LoC, but
>> it had its history cut when moving to Git, so the ratio of git-archive
>> of HEAD to the size of packfile is overemphasized here.
> 
> That doesn't matter.  You still need that amount of data up front to do 
> anything.  And I doubt people with slow links will want the full history 
> anyway, regardless if it goes backward 4 years or 18 years back.

On the other hand unreliable link doesn't need to mean unreasonably
slow link.

Hopefully GitTorrent / git-mirror-sync would finally come out of 
vapourware and wouldn't share the fate of Duke Nukem Forever ;-),
and we would have this as an alternative to clone large repositories.
Well, supposedly there is some code, and last year GSoC project at
least shook the dust out of initial design and made it simplier, IIUC.
 
>> You make use here of a few facts:
[...]

>> 2. There is support in git pack format to do 'deepening' of shallow
>>    clone, which means that git can generate incrementals in top-down
>>    order, _similar to how objects are ordered in packfile_.
> 
> Well... the pack format was not meant for that "support".  The fact that 
> the typical object order used by pack-objects when serving fetch request 
> is amenable to incremental top-down updates is rather coincidental and 
> not really planned.

Ooops.  I meant "git pack PROTOCOL" here, not "git pack _format_".
the one about want/have/shallow/deepen exchange.
 
[...]
>>> A special 
>>> mode to pack-object could place commit objects only after all the 
>>> objects needed to create that revision.  So once you get a commit object 
>>> on the receiving end, you could assume that all objects reachable from 
>>> that commit are already received, or you had them locally already.
>> 
>> Yes, with such mode (which I think wouldn't reduce / interfere with
>> ability for upload-pack to pack more tightly by reordering objects
>> and choosing different deltas) it would be easy to do a salvage of
>> a partially completed / transferred packfile.  Even if there is no
>> extension to tell git server which objects we have ("have" is only
>> about commits), if there is at least one commit object in received
>> part of packfile, we can try to continue from later (from more);
>> there is less left to download.
> 
> Exact.  Suffice to set the last received commit(s) (after validation) as 
> one of the shallow points.

Assuming that received commit is full (has all prerequisites), and
is connected to the rest of body of partially [shallow] cloned 
repository.

>>>> Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
>>>> and "deepen" commands from 'shallow' capability extension to git pack
>>>> protocol (http://git-scm.com/gitserver.txt).
>>> 
>>> 404 Not Found
>>> 
>>> Maybe that should be committed to git in Documentation/technical/  as 
>>> well?
>> 
>> This was plain text RFC for the Git Packfile Protocol, generated from
>> rfc2629 XML sources at http://github.com/schacon/gitserver-rfc
> 
> I suggest you track it down and prod/propose a version for merging in 
> the git repository.

Scott Chacon was (and is) CC-ed.
 
I don't know if you remember mentioned discussion about pack protocol, 
stemming from the fact that some of git (re)implementations (Dulwich,
JGit) failed to implement it properly, where properly = same as 
git-core, i.e. the original implementation in C... because there were
not enough documentation.


>>>> P.S. As you can see implementing resumable clone isn't easy...
>>> 
>>> I've been saying that all along for quite a while now.   ;-)
>> 
>> Well, on the other hand side we have example of how long it took to
>> come to current implementation of git submodules.  But if finally
>> got done.
> 
> In this case there is still no new line of code what so ever.  Thinking 
> it through is what takes time.

Measure twice, cut once :-)

In this case I think design upfront is a good solution.
 
>> The git-archive + deepening approach you proposed can be split into
>> smaller individual improvements.  You don't need to implement it all
>> at once.
[...]

>> 3. Create new git-archive pseudoformat, used to transfer single commit
>>    (with commit object and original branch name in some extended header,
>>    similar to how commit ID is stored in extended pax header or ZIP
>>    comment).  It would imply not using export-* gitattributes.
> 
> The format I was envisioning is really simple:
> 
> First the size of the raw commit object data content in decimal, 
> followed by a 0 byte, followed by the actual content of the commit 
> object, followed by a 0 byte.  (Note: this could be the exact same 
> content as the canonical commit object data with the "commit" prefix, 
> but as all the rest are all blob content this would be redundant.)
> 
> Then, for each file:
> 
>  - The file mode in octal notation just as in tree objects
>  - a space
>  - the size of the file in decimal
>  - a tab
>  - the full path of the file
>  - a 0 byte
>  - the file content as found in the corresponding blob
>  - a 0 byte
> 
> And finally some kind of marker to indicate the end of the stream.
> 
> Put the lot through zlib and you're done.

So you don't want to just tack commit object (as extended pax header,
or a comment - if it is at all possible) to the existing 'tar' and
'zip' archive formats.  Probably better to design format from scratch.
 
>> 4. Implement alternate ordering of objects in packfile, so commit object
>>    is put immediately after all its prerequisites.
> 
> That would require some changes in the object enumeration code which is 
> an area of the code I don't know well.

Oh.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-21 21:07                               ` Nicolas Pitre
  2009-08-21 21:41                                 ` Jakub Narebski
@ 2009-08-21 23:07                                 ` Sam Vilain
  2009-08-22  3:37                                   ` Nicolas Pitre
  1 sibling, 1 reply; 39+ messages in thread
From: Sam Vilain @ 2009-08-21 23:07 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon

On Fri, 2009-08-21 at 17:07 -0400, Nicolas Pitre wrote:
> > 2. There is support in git pack format to do 'deepening' of shallow
> >    clone, which means that git can generate incrementals in top-down
> >    order, _similar to how objects are ordered in packfile_.
> 
> Well... the pack format was not meant for that "support".  The fact
> that 
> the typical object order used by pack-objects when serving fetch
> request 
> is amenable to incremental top-down updates is rather coincidental
> and 
> not really planned.

Mmm.  And the problem with 'thin' packs is that they normally allow
deltas the other way.

I think the first step here would be to allow thin pack generation to
accept a bounded range of commits, any of the objects within which may
be used as delta base candidates.  That way, these "top down" thin packs
can be generated.  Currently of course it just uses the --not and makes
"bottom up" thin packs.

> > Another solution would be to try to come up with some sort of stable
> > sorting of objects so that packfile generated for the same
> > parameters (endpoints) would be always byte-for-byte the same.  But
> > that might be difficult, or even impossible.
>
> And I don't want to commit to that either.  Having some flexibility
> in object ordering makes it possible to improve on the packing
> heuristics.

You don't have to lose that for storage.  It's only for generating the
thin packs that it matters; also, the restriction is relaxed when it
comes to objects which are all being sent in the same pack, which can
freely delta amongst themselves in any direction.

What did you think about the bundle slicing stuff?

Sam

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-21 21:41                                 ` Jakub Narebski
@ 2009-08-22  0:59                                   ` Nicolas Pitre
  0 siblings, 0 replies; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-22  0:59 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4528 bytes --]

On Fri, 21 Aug 2009, Jakub Narebski wrote:

> On Fri, 21 Aug 2009, Nicolas Pitre wrote:
> > On Fri, 21 Aug 2009, Jakub Narebski wrote:
> >> On Thu, 20 Aug 2009, Nicolas Pitre wrote:
> >>> On Thu, 20 Aug 2009, Jakub Narebski wrote:
> 
> >>>> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
> >>>> (well, that of course depends on repository).  Not that much that is
> >>>> resumable.
> >>> 
> >>> Take the Linux kernel then.  It is more like 75 MB.
> >> 
> >> Ah... good example.
> >> 
> >> On the other hand Linux is fairly large project in terms of LoC, but
> >> it had its history cut when moving to Git, so the ratio of git-archive
> >> of HEAD to the size of packfile is overemphasized here.
> > 
> > That doesn't matter.  You still need that amount of data up front to do 
> > anything.  And I doubt people with slow links will want the full history 
> > anyway, regardless if it goes backward 4 years or 18 years back.
> 
> On the other hand unreliable link doesn't need to mean unreasonably
> slow link.

In my experience speed and reliability are more or less tied together.
And the slower is your link, the longer your transfer will last, the 
greater are the chances for you to have troubles.

> Hopefully GitTorrent / git-mirror-sync would finally come out of 
> vapourware and wouldn't share the fate of Duke Nukem Forever ;-),
> and we would have this as an alternative to clone large repositories.

Well... Maybe.

> Well, supposedly there is some code, and last year GSoC project at
> least shook the dust out of initial design and made it simplier, IIUC.

The BitTorrent protocol is a nifty thing (although I doubt the 
intertainment industry think so).  But its efficiency relies on the fact 
that many many people are expected to download the same stuff at the 
same time.  I have some doubts about the availability of the right 
conditions in the context of git for a BitTorrent-like protocol to work 
well in practice.  But this is Open Source and no one has to wait for me 
or anyone else to be convinced before attempting it and showing results 
to the world.

> >> This was plain text RFC for the Git Packfile Protocol, generated from
> >> rfc2629 XML sources at http://github.com/schacon/gitserver-rfc
> > 
> > I suggest you track it down and prod/propose a version for merging in 
> > the git repository.
> 
> Scott Chacon was (and is) CC-ed.

He might not have followed all our exchange so deeply in this thread 
though.  So maybe another thread with him in the To: field might be 
required to get his attention.

> I don't know if you remember mentioned discussion about pack protocol, 
> stemming from the fact that some of git (re)implementations (Dulwich,
> JGit) failed to implement it properly, where properly = same as 
> git-core, i.e. the original implementation in C... because there were
> not enough documentation.

Yes I followed the discussion.  I still think that, since that 
documentation exists now, that would be a good idea to have a copy 
included in the git sources.

> > The format I was envisioning is really simple:
> > 
> > First the size of the raw commit object data content in decimal, 
> > followed by a 0 byte, followed by the actual content of the commit 
> > object, followed by a 0 byte.  (Note: this could be the exact same 
> > content as the canonical commit object data with the "commit" prefix, 
> > but as all the rest are all blob content this would be redundant.)
> > 
> > Then, for each file:
> > 
> >  - The file mode in octal notation just as in tree objects
> >  - a space
> >  - the size of the file in decimal
> >  - a tab
> >  - the full path of the file
> >  - a 0 byte
> >  - the file content as found in the corresponding blob
> >  - a 0 byte
> > 
> > And finally some kind of marker to indicate the end of the stream.
> > 
> > Put the lot through zlib and you're done.
> 
> So you don't want to just tack commit object (as extended pax header,
> or a comment - if it is at all possible) to the existing 'tar' and
> 'zip' archive formats.  Probably better to design format from scratch.

As René Scharfe mentioned, the existing archive formats have limitations 
and complexities that we might simply avoid altogether by creating a 
simpler format that is more likely to never fail to faithfully reproduce 
a git revision content.  Maybe the git-fast-import format could do it 
even better, and maybe not.  That's an implementation detail that needs 
to be worked out once one is ready to get real with actual coding.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-21 23:07                                 ` Sam Vilain
@ 2009-08-22  3:37                                   ` Nicolas Pitre
  2009-08-22  5:50                                     ` Sam Vilain
  0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-22  3:37 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon

On Sat, 22 Aug 2009, Sam Vilain wrote:

> On Fri, 2009-08-21 at 17:07 -0400, Nicolas Pitre wrote:
> > > 2. There is support in git pack format to do 'deepening' of shallow
> > >    clone, which means that git can generate incrementals in top-down
> > >    order, _similar to how objects are ordered in packfile_.
> > 
> > Well... the pack format was not meant for that "support".  The fact
> > that 
> > the typical object order used by pack-objects when serving fetch
> > request 
> > is amenable to incremental top-down updates is rather coincidental
> > and 
> > not really planned.
> 
> Mmm.  And the problem with 'thin' packs is that they normally allow
> deltas the other way.

Sure.  The pack format is flexible.

> I think the first step here would be to allow thin pack generation to
> accept a bounded range of commits, any of the objects within which may
> be used as delta base candidates.  That way, these "top down" thin packs
> can be generated.  Currently of course it just uses the --not and makes
> "bottom up" thin packs.

The pack is still almost top-down.  It's only the missing delta base 
that are in the other direction, refering to objects you have locally 
and therefore older.

> > > Another solution would be to try to come up with some sort of stable
> > > sorting of objects so that packfile generated for the same
> > > parameters (endpoints) would be always byte-for-byte the same.  But
> > > that might be difficult, or even impossible.
> >
> > And I don't want to commit to that either.  Having some flexibility
> > in object ordering makes it possible to improve on the packing
> > heuristics.
> 
> You don't have to lose that for storage.  It's only for generating the
> thin packs that it matters;

What matters?

> also, the restriction is relaxed when it
> comes to objects which are all being sent in the same pack, which can
> freely delta amongst themselves in any direction.

That's always the case within a pack, but only for REF_DELTA objects.  
The OFS_DELTA objects have to be ordered. And yes, having deltas across 
packs is disallowed to avoid cycles and to keep the database robust.  
The only exception is for thin packs, but those are never created on 
disk. Thin packs are only used for transport and quickly "fixed" upon 
reception by appending the missing objects to them so they are not 
"thin" anymore.

> What did you think about the bundle slicing stuff?

If I didn't comment on it already, then I probably missed it and have no 
idea.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-22  3:37                                   ` Nicolas Pitre
@ 2009-08-22  5:50                                     ` Sam Vilain
  2009-08-22  8:13                                       ` Nicolas Pitre
  0 siblings, 1 reply; 39+ messages in thread
From: Sam Vilain @ 2009-08-22  5:50 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon

On Fri, 2009-08-21 at 23:37 -0400, Nicolas Pitre wrote:
> What did you think about the bundle slicing stuff?
> 
> If I didn't comment on it already, then I probably missed it and have no 
> idea.

I really tire of repeating myself for your sole benefit.  Please show
some consideration for other people in the conversation by trying to
listen.  Thank-you.

> > I think the first step here would be to allow thin pack generation to
> > accept a bounded range of commits, any of the objects within which may
> > be used as delta base candidates.  That way, these "top down" thin packs
> > can be generated.  Currently of course it just uses the --not and makes
> > "bottom up" thin packs.
> 
> The pack is still almost top-down.  It's only the missing delta base 
> that are in the other direction, refering to objects you have locally 
> and therefore older.

Ok, but right now there's no way to specify that you want a thin pack,
where the allowable base objects are *newer* than the commit range you
wish to include.

What I said in my other e-mail where I showed how well it works taking
a given bundle, and slicing it into a series of thin packs, was that it
seems to add a bit of extra size to the resultant packs - best I got for
slicing up the entire git.git run was about 20%.  If this can be
reduced to under 10% (say), then sending bundle slices would be quite
reasonable by default for the benefit of making large fetches
restartable, or even spreadable across multiple mirrors.

The object sorting stuff is something of a distraction; it's required
for download spreading but not for the case at hand now.

Sam

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-22  5:50                                     ` Sam Vilain
@ 2009-08-22  8:13                                       ` Nicolas Pitre
  2009-08-23 10:37                                         ` Sam Vilain
  0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-22  8:13 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5187 bytes --]

On Sat, 22 Aug 2009, Sam Vilain wrote:

> On Fri, 2009-08-21 at 23:37 -0400, Nicolas Pitre wrote:
> > > What did you think about the bundle slicing stuff?
> > 
> > If I didn't comment on it already, then I probably missed it and have no 
> > idea.
> 
> I really tire of repeating myself for your sole benefit.  Please show
> some consideration for other people in the conversation by trying to
> listen.  Thank-you.

I'm sorry but I have way too many emails to consider reading.  This is 
like ethernet: not a reliable transport, and lost packets means you have 
to retransmit.  Cut and paste does wonders, or even a link to previous 
post.

> > > I think the first step here would be to allow thin pack generation to
> > > accept a bounded range of commits, any of the objects within which may
> > > be used as delta base candidates.  That way, these "top down" thin packs
> > > can be generated.  Currently of course it just uses the --not and makes
> > > "bottom up" thin packs.
> > 
> > The pack is still almost top-down.  It's only the missing delta base 
> > that are in the other direction, refering to objects you have locally 
> > and therefore older.
> 
> Ok, but right now there's no way to specify that you want a thin pack,
> where the allowable base objects are *newer* than the commit range you
> wish to include.

Sure you can.  Try this:

	( echo "-$(git rev-parse v1.6.4)"; \
	  git rev-list --objects v1.6.2..v1.6.3 ) | \
		git pack-objects --progress --stdout > foo.pack

That'll give you a thin pack for the _new_ objects that _appeared_ 
between v1.6.2 and v1.6.3, but which external delta base objects are 
found in v1.6.4.

If you want _all_ the objects that are referenced from commits between 
v1.6.2 and v1.6.3 then you just have to list them all for v1.6.2 in 
addition to the rest:

	( echo "-$(git rev-parse v1.6.4)"; \
	  git rev-list --objects v1.6.2..v1.6.3; \
	  git ls-tree -t -r v1.6.2 | cut  -d' ' -f 3- | tr "\t" " "; ) | \
		git pack-objects --progress --stdout > foo.pack

> What I said in my other e-mail where I showed how well it works taking
> a given bundle, and slicing it into a series of thin packs, was that it
> seems to add a bit of extra size to the resultant packs - best I got for
> slicing up the entire git.git run was about 20%.  If this can be
> reduced to under 10% (say), then sending bundle slices would be quite
> reasonable by default for the benefit of making large fetches
> restartable, or even spreadable across multiple mirrors.

In theory you could have about no overhead.  That all depends how you 
slice the pack.  If you want a pack to contain a fixed number of commits 
(such that all objects introduced by a given commit are all in the same 
pack) then you are of course putting a constraint on the possible delta 
matches and compression result might be suboptimal.  In comparison, with 
a single big pack a given blob can delta against a blob from a 
completely distant commit in the history graph if that provides a better 
compression ratio.

If you slice your pack according to a size treshold, then you might 
consider the --max-pack-size= argument to pack-objects.  This currently 
doesn't produce thin pack as delta objects whose base are stored in a 
different pack than their base because of a pack split are simply not 
stored as delta.  Only a few line of code would need to be modified in 
order to store those deltas nevertheless and turn those packs into thin 
packs, preserving the optimal delta match.  Of course cross pack delta 
reference have to be REF_DELTA objects with headers about 16 to 17 bytes 
larger than those of OFS_DELTA objects, so you will still have some 
overhead.

> The object sorting stuff is something of a distraction; it's required
> for download spreading but not for the case at hand now.

Well, the idea of spreading small packs has its drawbacks.  You still 
might need to get a sizeable portion of them to get at least one usable 
commit.  And ideally you want the top commit in priority, which pretty 
much impose an ordering on the packs you're likely to want first, unlike 
with BitTorrent where you don't care as you normally want all 
the blocks anyway.

If the goal is to make for faster downloads, then you could simply make 
a bundle, copy it on multiple server, and slice your download across 
those servers.  This has the disadvantage of being static data that 
doubles the disk (and cache) usage.  That doesn't work too well with 
shallow clones though.

If you were envisioning _clients_ à la BitTorrent putting up pack slices 
instead, then in that case the slices have to be well defined entities, 
like packs containing objects for known range of commits, but then we're 
back to the delta inefficiency I mentioned above. And again this might 
work only if a lot of people are interested in the same repository at 
the same time, and of course most people have no big insentive to "seed" 
once they got their copy. So I'm not sure if that might work that well 
in practice.

This certainly still looks like a pretty cool project.  But it is not 
all the cool stuff that works well in real conditions I'm afraid.  Just 
my opinion of course.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Continue git clone after interruption
  2009-08-22  8:13                                       ` Nicolas Pitre
@ 2009-08-23 10:37                                         ` Sam Vilain
  0 siblings, 0 replies; 39+ messages in thread
From: Sam Vilain @ 2009-08-23 10:37 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon

On Sat, 2009-08-22 at 04:13 -0400, Nicolas Pitre wrote:
> > Ok, but right now there's no way to specify that you want a thin pack,
> > where the allowable base objects are *newer* than the commit range you
> > wish to include.
> 
> Sure you can.  Try this:
> 
> 	( echo "-$(git rev-parse v1.6.4)"; \
> 	  git rev-list --objects v1.6.2..v1.6.3 ) | \
> 		git pack-objects --progress --stdout > foo.pack
> 
> That'll give you a thin pack for the _new_ objects that _appeared_ 
> between v1.6.2 and v1.6.3, but which external delta base objects are 
> found in v1.6.4.

Aha.  I guess I had made an assumption about where that '-' lets
pack-objects find deltas from that aren't true.

> > What I said in my other e-mail where I showed how well it works taking
> > a given bundle, and slicing it into a series of thin packs, was that it
> > seems to add a bit of extra size to the resultant packs - best I got for
> > slicing up the entire git.git run was about 20%.  If this can be
> > reduced to under 10% (say), then sending bundle slices would be quite
> > reasonable by default for the benefit of making large fetches
> > restartable, or even spreadable across multiple mirrors.
> 
> In theory you could have about no overhead.  That all depends how you 
> slice the pack.  If you want a pack to contain a fixed number of commits 
> (such that all objects introduced by a given commit are all in the same 
> pack) then you are of course putting a constraint on the possible delta 
> matches and compression result might be suboptimal.  In comparison, with 
> a single big pack a given blob can delta against a blob from a 
> completely distant commit in the history graph if that provides a better 
> compression ratio.
 [...]
> If you were envisioning _clients_ à la BitTorrent putting up pack slices 
> instead, then in that case the slices have to be well defined entities, 
> like packs containing objects for known range of commits, but then we're 
> back to the delta inefficiency I mentioned above.

I'll do some more experiments to try to quantify this in light of this
new information; I still think that if the overhead is marginal there
are significant wins to this approach.

> And again this might 
> work only if a lot of people are interested in the same repository at 
> the same time, and of course most people have no big insentive to "seed" 
> once they got their copy. So I'm not sure if that might work that well 
> in practice.

Throw away terms like "seeding" and replace with "mirroring".  Sites
which currently house mirrors could potentially be helping serve git
repos, too.  Popular projects could have many mirrors and on the edges
of the internet, git servers could mirror many projects for users in
their country.

Sam

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2009-08-23 10:34 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23   ` Shawn O. Pearce
2009-08-18  5:43   ` Matthieu Moy
2009-08-18  6:58     ` Tomasz Kontusz
2009-08-18 17:56       ` Nicolas Pitre
2009-08-18 18:45         ` Jakub Narebski
2009-08-18 20:01           ` Nicolas Pitre
2009-08-18 21:02             ` Jakub Narebski
2009-08-18 21:32               ` Nicolas Pitre
2009-08-19 15:19                 ` Jakub Narebski
2009-08-19 19:04                   ` Nicolas Pitre
2009-08-19 19:42                     ` Jakub Narebski
2009-08-19 21:13                       ` Nicolas Pitre
2009-08-20  0:26                         ` Sam Vilain
2009-08-20  7:37                         ` Jakub Narebski
2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
2009-08-20  8:23                             ` Jakub Narebski
2009-08-20 18:41                           ` Nicolas Pitre
2009-08-21 10:07                             ` Jakub Narebski
2009-08-21 10:26                               ` Matthieu Moy
2009-08-21 21:07                               ` Nicolas Pitre
2009-08-21 21:41                                 ` Jakub Narebski
2009-08-22  0:59                                   ` Nicolas Pitre
2009-08-21 23:07                                 ` Sam Vilain
2009-08-22  3:37                                   ` Nicolas Pitre
2009-08-22  5:50                                     ` Sam Vilain
2009-08-22  8:13                                       ` Nicolas Pitre
2009-08-23 10:37                                         ` Sam Vilain
2009-08-20 22:57                           ` Sam Vilain
2009-08-18 22:28             ` Johannes Schindelin
2009-08-18 23:40               ` Nicolas Pitre
2009-08-19  7:35                 ` Johannes Schindelin
2009-08-19  8:25                   ` Nguyen Thai Ngoc Duy
2009-08-19  9:52                     ` Johannes Schindelin
2009-08-19 17:21                   ` Nicolas Pitre
2009-08-19 22:23                     ` René Scharfe
2009-08-19  4:42           ` Sitaram Chamarty
2009-08-19  9:53             ` Jakub Narebski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.