* Continue git clone after interruption
@ 2009-08-17 11:42 Tomasz Kontusz
2009-08-17 12:31 ` Johannes Schindelin
0 siblings, 1 reply; 39+ messages in thread
From: Tomasz Kontusz @ 2009-08-17 11:42 UTC (permalink / raw)
To: git
Hi,
is anybody working on making it possible to continue git clone after
interruption? It would be quite useful for people with bad internet
connection (I was downloading a big repo lately, and it was a bit
frustrating to start it over every time git stopped at ~90%).
Tomasz Kontusz
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
@ 2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23 ` Shawn O. Pearce
2009-08-18 5:43 ` Matthieu Moy
0 siblings, 2 replies; 39+ messages in thread
From: Johannes Schindelin @ 2009-08-17 12:31 UTC (permalink / raw)
To: Tomasz Kontusz; +Cc: git
Hi,
On Mon, 17 Aug 2009, Tomasz Kontusz wrote:
> is anybody working on making it possible to continue git clone after
> interruption? It would be quite useful for people with bad internet
> connection (I was downloading a big repo lately, and it was a bit
> frustrating to start it over every time git stopped at ~90%).
Unfortunately, we did not have enough GSoC slots for the project to allow
restartable clones.
There were discussions about how to implement this on the list, though.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-17 12:31 ` Johannes Schindelin
@ 2009-08-17 15:23 ` Shawn O. Pearce
2009-08-18 5:43 ` Matthieu Moy
1 sibling, 0 replies; 39+ messages in thread
From: Shawn O. Pearce @ 2009-08-17 15:23 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Tomasz Kontusz, git
Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Mon, 17 Aug 2009, Tomasz Kontusz wrote:
>
> > is anybody working on making it possible to continue git clone after
> > interruption? It would be quite useful for people with bad internet
> > connection (I was downloading a big repo lately, and it was a bit
> > frustrating to start it over every time git stopped at ~90%).
>
> Unfortunately, we did not have enough GSoC slots for the project to allow
> restartable clones.
>
> There were discussions about how to implement this on the list, though.
Unfortunately, those of us who know how the native protocol works
can't come to an agreement on how it might be restartable. If you
really read the archives on this topic, you'll see that Nico and I
disagree about how to do this. IIRC Nico's position is, it isn't
really possible to implement a restart.
--
Shawn.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23 ` Shawn O. Pearce
@ 2009-08-18 5:43 ` Matthieu Moy
2009-08-18 6:58 ` Tomasz Kontusz
1 sibling, 1 reply; 39+ messages in thread
From: Matthieu Moy @ 2009-08-18 5:43 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Tomasz Kontusz, git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> Hi,
>
> On Mon, 17 Aug 2009, Tomasz Kontusz wrote:
>
>> is anybody working on making it possible to continue git clone after
>> interruption? It would be quite useful for people with bad internet
>> connection (I was downloading a big repo lately, and it was a bit
>> frustrating to start it over every time git stopped at ~90%).
>
> Unfortunately, we did not have enough GSoC slots for the project to allow
> restartable clones.
>
> There were discussions about how to implement this on the list,
> though.
And a paragraph on the wiki:
http://git.or.cz/gitwiki/SoC2009Ideas#RestartableClone
--
Matthieu
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-18 5:43 ` Matthieu Moy
@ 2009-08-18 6:58 ` Tomasz Kontusz
2009-08-18 17:56 ` Nicolas Pitre
0 siblings, 1 reply; 39+ messages in thread
From: Tomasz Kontusz @ 2009-08-18 6:58 UTC (permalink / raw)
To: git
Dnia 2009-08-18, wto o godzinie 07:43 +0200, Matthieu Moy pisze:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > Hi,
> >
> > On Mon, 17 Aug 2009, Tomasz Kontusz wrote:
> >
> >> is anybody working on making it possible to continue git clone after
> >> interruption? It would be quite useful for people with bad internet
> >> connection (I was downloading a big repo lately, and it was a bit
> >> frustrating to start it over every time git stopped at ~90%).
> >
> > Unfortunately, we did not have enough GSoC slots for the project to allow
> > restartable clones.
> >
> > There were discussions about how to implement this on the list,
> > though.
>
> And a paragraph on the wiki:
>
> http://git.or.cz/gitwiki/SoC2009Ideas#RestartableClone
Ok, so it looks like it's not implementable without some kind of cache
server-side, so the server would know what the pack it was sending
looked like.
But here's my idea: make server send objects in different order (the
newest commit + whatever it points to first, then next one,then
another...). Then it would be possible to look at what we got, tell
server we have nothing, and want [the newest commit that was not
complete]. I know the reason why it is sorted the way it is, but I think
that the way data is stored after clone is clients problem, so the
client should reorganize packs the way it wants.
Tomasz K.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-18 6:58 ` Tomasz Kontusz
@ 2009-08-18 17:56 ` Nicolas Pitre
2009-08-18 18:45 ` Jakub Narebski
0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-18 17:56 UTC (permalink / raw)
To: Tomasz Kontusz; +Cc: git
On Tue, 18 Aug 2009, Tomasz Kontusz wrote:
> Ok, so it looks like it's not implementable without some kind of cache
> server-side, so the server would know what the pack it was sending
> looked like.
> But here's my idea: make server send objects in different order (the
> newest commit + whatever it points to first, then next one,then
> another...). Then it would be possible to look at what we got, tell
> server we have nothing, and want [the newest commit that was not
> complete]. I know the reason why it is sorted the way it is, but I think
> that the way data is stored after clone is clients problem, so the
> client should reorganize packs the way it wants.
That won't buy you much. You should realize that a pack is made of:
1) Commit objects. Yes they're all put together at the front of the pack,
but they roughly are the equivalent of:
git log --pretty=raw | gzip | wc -c
For the Linux repo as of now that is around 32 MB.
2) Tree andblob objects. Those are the bulk of the content for the top
commit. The top commit is usually not delta compressed because we
want fast access to the top commit, and that is used as the base for
further delta compression for older commits. So the very first
commit is whole at the front of the pack right after the commit
objects. you can estimate the size of this data with:
git archive --format=tar HEAD | gzip | wc -c
On the same Linux repo this is currently 75 MB.
3) Delta objects. Those are making the rest of the pack, plus a couple
tree/blob objects that were not found in the top commit and are
different enough from any object in that top commit not to be
represented as deltas. Still, the majority of objects for all the
remaining commits are delta objects.
So... if we reorder objects, all that we can do is to spread commit
objects around so that the objects referenced by one commit are all seen
before another commit object is included. That would cut on that
initial 32 MB.
However you still have to get that 75 MB in order to at least be able to
look at _one_ commit. So you've only reduced your critical download
size from 107 MB to 75 MB. This is some improvement, of course, but not
worth the bother IMHO. If we're to have restartable clone, it has to
work for any size.
And that's where the real problem is. I don't think having servers to
cache pack results for every fetch requests is sensible as that would be
an immediate DoS attack vector.
And because the object order in a pack is not defined by the protocol,
we cannot expect the server to necessarily always provide the same
object order either. For example, it is already undefined in which
order you'll receive objects as threaded delta search is non
deterministic and two identical fetch requests may end up with slightly
different packing. Or load balancing may redirect your fetch requests
to different git servers which might have different versions of zlib, or
even git itself, affecting the object packing order and/or size.
Now... What _could_ be done, though, is some extension to the
git-archive command. One thing that is well and strictly defined in git
is the file path sort order. So given a commit SHA1, you should always
get the same files in the same order from git-archive. For an initial
clone, git could attempt fetching the top commit using the remote
git-archive service and locally reconstruct that top commit that way.
if the transfer is interrupted in the middle, then the remote
git-archive could be told how to resume the transfer by telling it how
many files and how many bytes in the current file to skip. This way the
server doesn't need to perform any sort of caching and remains
stateless.
You then end up with a pretty shallow repository. The clone process
could then fall back to the traditional native git transfer protocol to
deepen the history of that shallow repository. And then that special
packing sort order to distribute commit objects would make sense since
each commit would then have a fairly small set of new objects, and most
of them would be deltas anyway, making the data size per commit really
small and any interrupted transfer much less of an issue.
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-18 17:56 ` Nicolas Pitre
@ 2009-08-18 18:45 ` Jakub Narebski
2009-08-18 20:01 ` Nicolas Pitre
2009-08-19 4:42 ` Sitaram Chamarty
0 siblings, 2 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-08-18 18:45 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Tomasz Kontusz, git
Nicolas Pitre <nico@cam.org> writes:
> On Tue, 18 Aug 2009, Tomasz Kontusz wrote:
>
> > Ok, so it looks like it's not implementable without some kind of cache
> > server-side, so the server would know what the pack it was sending
> > looked like.
> > But here's my idea: make server send objects in different order (the
> > newest commit + whatever it points to first, then next one,then
> > another...). Then it would be possible to look at what we got, tell
> > server we have nothing, and want [the newest commit that was not
> > complete]. I know the reason why it is sorted the way it is, but I think
> > that the way data is stored after clone is clients problem, so the
> > client should reorganize packs the way it wants.
>
> That won't buy you much. You should realize that a pack is made of:
>
> 1) Commit objects. Yes they're all put together at the front of the pack,
> but they roughly are the equivalent of:
>
> git log --pretty=raw | gzip | wc -c
>
> For the Linux repo as of now that is around 32 MB.
For my clone of Git repository this gives 3.8 MB
> 2) Tree and blob objects. Those are the bulk of the content for the top
> commit. The top commit is usually not delta compressed because we
> want fast access to the top commit, and that is used as the base for
> further delta compression for older commits. So the very first
> commit is whole at the front of the pack right after the commit
> objects. you can estimate the size of this data with:
>
> git archive --format=tar HEAD | gzip | wc -c
>
> On the same Linux repo this is currently 75 MB.
On the same Git repository this gives 2.5 MB
>
> 3) Delta objects. Those are making the rest of the pack, plus a couple
> tree/blob objects that were not found in the top commit and are
> different enough from any object in that top commit not to be
> represented as deltas. Still, the majority of objects for all the
> remaining commits are delta objects.
You forgot that delta chains are bound by pack.depth limit, which
defaults to 50. You would have then additional full objects.
The single packfile for this (just gc'ed) Git repository is 37 MB.
Much more than 3.8 MB + 2.5 MB = 6.3 MB.
[cut]
There is another way which we can go to implement resumable clone.
Let's git first try to clone whole repository (single pack; BTW what
happens if this pack is larger than file size limit for given
filesystem?). If it fails, client ask first for first half of of
repository (half as in bisect, but it is server that has to calculate
it). If it downloads, it will ask server for the rest of repository.
If it fails, it would reduce size in half again, and ask about 1/4 of
repository in packfile first.
The only extension required is for server to support additional
capability, which enable for client to ask for appropriate 1/2^n part
of repository (approximately), or 1/2^n between have and want.
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-18 18:45 ` Jakub Narebski
@ 2009-08-18 20:01 ` Nicolas Pitre
2009-08-18 21:02 ` Jakub Narebski
2009-08-18 22:28 ` Johannes Schindelin
2009-08-19 4:42 ` Sitaram Chamarty
1 sibling, 2 replies; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-18 20:01 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Tomasz Kontusz, git
On Tue, 18 Aug 2009, Jakub Narebski wrote:
> Nicolas Pitre <nico@cam.org> writes:
>
> > On Tue, 18 Aug 2009, Tomasz Kontusz wrote:
> >
> > > Ok, so it looks like it's not implementable without some kind of cache
> > > server-side, so the server would know what the pack it was sending
> > > looked like.
> > > But here's my idea: make server send objects in different order (the
> > > newest commit + whatever it points to first, then next one,then
> > > another...). Then it would be possible to look at what we got, tell
> > > server we have nothing, and want [the newest commit that was not
> > > complete]. I know the reason why it is sorted the way it is, but I think
> > > that the way data is stored after clone is clients problem, so the
> > > client should reorganize packs the way it wants.
> >
> > That won't buy you much. You should realize that a pack is made of:
> >
> > 1) Commit objects. Yes they're all put together at the front of the pack,
> > but they roughly are the equivalent of:
> >
> > git log --pretty=raw | gzip | wc -c
> >
> > For the Linux repo as of now that is around 32 MB.
>
> For my clone of Git repository this gives 3.8 MB
>
> > 2) Tree and blob objects. Those are the bulk of the content for the top
> > commit. The top commit is usually not delta compressed because we
> > want fast access to the top commit, and that is used as the base for
> > further delta compression for older commits. So the very first
> > commit is whole at the front of the pack right after the commit
> > objects. you can estimate the size of this data with:
> >
> > git archive --format=tar HEAD | gzip | wc -c
> >
> > On the same Linux repo this is currently 75 MB.
>
> On the same Git repository this gives 2.5 MB
Interesting to see that the commit history is larger than the latest
source tree. Probably that would be the same with the Linux kernel as
well if all versions since the beginning with adequate commit logs were
included in the repo.
> > 3) Delta objects. Those are making the rest of the pack, plus a couple
> > tree/blob objects that were not found in the top commit and are
> > different enough from any object in that top commit not to be
> > represented as deltas. Still, the majority of objects for all the
> > remaining commits are delta objects.
>
> You forgot that delta chains are bound by pack.depth limit, which
> defaults to 50. You would have then additional full objects.
Sure, but that's probably not significant. the delta chain depth is
limited, but not the width. A given base object can have unlimited
delta "children", and so on at each depth level.
> The single packfile for this (just gc'ed) Git repository is 37 MB.
> Much more than 3.8 MB + 2.5 MB = 6.3 MB.
What I'm saying is that most of that 37 MB - 6.3 MB = 31 MB is likely to
be occupied by deltas.
> [cut]
>
> There is another way which we can go to implement resumable clone.
> Let's git first try to clone whole repository (single pack; BTW what
> happens if this pack is larger than file size limit for given
> filesystem?).
We currently fail. Seems that no one ever had a problem with that so
far. We'd have to split the pack stream into multiple packs on the
receiving end. But frankly, if you have a repository large enough to
bust your filesystem's file size limit then maybe you should seriously
reconsider your choice of development environment.
> If it fails, client ask first for first half of of
> repository (half as in bisect, but it is server that has to calculate
> it). If it downloads, it will ask server for the rest of repository.
> If it fails, it would reduce size in half again, and ask about 1/4 of
> repository in packfile first.
Problem people with slow links have won't be helped at all with this.
What if the network connection gets broken only after 49% of the
transfer and that took 3 hours to download? You'll attempt a 25% size
transfer which would take 1.5 hour despite the fact that you already
spent that much time downloading that first 1/4 of the repository
already. And yet what if you're unlucky and now the network craps on
you after 23% of that second attempt?
I think it is better to "prime" the repository with the content of the
top commit in the most straight forward manner using git-archive which
has the potential to be fully restartable at any point with little
complexity on the server side.
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-18 20:01 ` Nicolas Pitre
@ 2009-08-18 21:02 ` Jakub Narebski
2009-08-18 21:32 ` Nicolas Pitre
2009-08-18 22:28 ` Johannes Schindelin
1 sibling, 1 reply; 39+ messages in thread
From: Jakub Narebski @ 2009-08-18 21:02 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Tomasz Kontusz, git
On Tue, 18 Aug 2009, Nicolas Pitre wrote:
> On Tue, 18 Aug 2009, Jakub Narebski wrote:
>> Nicolas Pitre <nico@cam.org> writes:
>>> That won't buy you much. You should realize that a pack is made of:
>>>
>>> 1) Commit objects. Yes they're all put together at the front of the pack,
>>> but they roughly are the equivalent of:
>>>
>>> git log --pretty=raw | gzip | wc -c
>>>
>>> For the Linux repo as of now that is around 32 MB.
>>
>> For my clone of Git repository this gives 3.8 MB
>>
>>> 2) Tree and blob objects. Those are the bulk of the content for the top
>>> commit. [...] You can estimate the size of this data with:
>>>
>>> git archive --format=tar HEAD | gzip | wc -c
>>>
>>> On the same Linux repo this is currently 75 MB.
>>
>> On the same Git repository this gives 2.5 MB
>
> Interesting to see that the commit history is larger than the latest
> source tree. Probably that would be the same with the Linux kernel as
> well if all versions since the beginning with adequate commit logs were
> included in the repo.
Note that having reflog and/or patch management interface like StGit,
and frequently reworking commits (e.g. using rebase) means more commit
objects in repository.
Also Git repository has 3 independent branches: 'man', 'html' and 'todo',
from whose branches objects are not included in "git archive HEAD".
>
>>> 3) Delta objects. Those are making the rest of the pack, plus a couple
>>> tree/blob objects that were not found in the top commit and are
>>> different enough from any object in that top commit not to be
>>> represented as deltas. Still, the majority of objects for all the
>>> remaining commits are delta objects.
>>
>> You forgot that delta chains are bound by pack.depth limit, which
>> defaults to 50. You would have then additional full objects.
>
> Sure, but that's probably not significant. the delta chain depth is
> limited, but not the width. A given base object can have unlimited
> delta "children", and so on at each depth level.
You can probably get number and size taken by delta and non-delta (base)
objects in the packfile somehow. Neither "git verify-pack -v <packfile>"
nor contrib/stats/packinfo.pl did help me arrive at this data.
>> The single packfile for this (just gc'ed) Git repository is 37 MB.
>> Much more than 3.8 MB + 2.5 MB = 6.3 MB.
>
> What I'm saying is that most of that 37 MB - 6.3 MB = 31 MB is likely to
> be occupied by deltas.
True.
>> [cut]
>>
>> There is another way which we can go to implement resumable clone.
>> Let's git first try to clone whole repository (single pack; BTW what
>> happens if this pack is larger than file size limit for given
>> filesystem?).
>
> We currently fail. Seems that no one ever had a problem with that so
> far. We'd have to split the pack stream into multiple packs on the
> receiving end. But frankly, if you have a repository large enough to
> bust your filesystem's file size limit then maybe you should seriously
> reconsider your choice of development environment.
Do we fail gracefully (with an error message), or does git crash then?
If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
FAT is often used on SSD, on USB drive. Although if you have 2 GB
packfile, you are doing something wrong, or UGFWIINI (Using Git For
What It Is Not Intended).
>> If it fails, client ask first for first half of of
>> repository (half as in bisect, but it is server that has to calculate
>> it). If it downloads, it will ask server for the rest of repository.
>> If it fails, it would reduce size in half again, and ask about 1/4 of
>> repository in packfile first.
>
> Problem people with slow links have won't be helped at all with this.
> What if the network connection gets broken only after 49% of the
> transfer and that took 3 hours to download? You'll attempt a 25% size
> transfer which would take 1.5 hour despite the fact that you already
> spent that much time downloading that first 1/4 of the repository
> already. And yet what if you're unlucky and now the network craps on
> you after 23% of that second attempt?
A modification then.
First try ordinary clone. If it fails because network is unreliable,
check how much we did download, and ask server for packfile of slightly
smaller size; this means that we are asking server for approximate pack
size limit, not for bisect-like partitioning revision list.
> I think it is better to "prime" the repository with the content of the
> top commit in the most straight forward manner using git-archive which
> has the potential to be fully restartable at any point with little
> complexity on the server side.
But didn't it make fully restartable 2.5 MB part out of 37 MB packfile?
A question about pack protocol negotiation. If clients presents some
objects as "have", server can and does assume that client has all
prerequisites for such objects, e.g. for tree objects that it has
all objects for files and directories inside tree; for commit it means
all ancestors and all objects in snapshot (have top tree, and its
prerequisites). Do I understand this correctly?
If we have partial packfile which crashed during downloading, can we
extract from it some full objects (including blobs)? Can we pass
tree and blob objects as "have" to server, and is it taken into account?
Perhaps instead of separate step of resumable-downloading of top commit
objects (in snapshot), we can pass to server what we did download in
full?
BTW. because of compression it might be more difficult to resume
archive creation in the middle, I think...
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-18 21:02 ` Jakub Narebski
@ 2009-08-18 21:32 ` Nicolas Pitre
2009-08-19 15:19 ` Jakub Narebski
0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-18 21:32 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Tomasz Kontusz, git
On Tue, 18 Aug 2009, Jakub Narebski wrote:
> You can probably get number and size taken by delta and non-delta (base)
> objects in the packfile somehow. Neither "git verify-pack -v <packfile>"
> nor contrib/stats/packinfo.pl did help me arrive at this data.
Documentation for verify-pack says:
|When specifying the -v option the format used is:
|
| SHA1 type size size-in-pack-file offset-in-packfile
|
|for objects that are not deltified in the pack, and
|
| SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
|
|for objects that are deltified.
So a simple script should be able to give you the answer.
> >> (BTW what happens if this pack is larger than file size limit for
> >> given filesystem?).
> >
> > We currently fail. Seems that no one ever had a problem with that so
> > far. We'd have to split the pack stream into multiple packs on the
> > receiving end. But frankly, if you have a repository large enough to
> > bust your filesystem's file size limit then maybe you should seriously
> > reconsider your choice of development environment.
>
> Do we fail gracefully (with an error message), or does git crash then?
If the filesystem is imposing the limit, it will likely return an error
on the write() call and we'll die(). If the machine has a too small
off_t for the received pack then we also die("pack too large for current
definition of off_t").
> If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
> FAT is often used on SSD, on USB drive. Although if you have 2 GB
> packfile, you are doing something wrong, or UGFWIINI (Using Git For
> What It Is Not Intended).
Hopefully you're not performing a 'git clone' off of a FAT filesystem.
For physical transport you may repack with the appropriate switches.
> >> If it fails, client ask first for first half of of
> >> repository (half as in bisect, but it is server that has to calculate
> >> it). If it downloads, it will ask server for the rest of repository.
> >> If it fails, it would reduce size in half again, and ask about 1/4 of
> >> repository in packfile first.
> >
> > Problem people with slow links have won't be helped at all with this.
> > What if the network connection gets broken only after 49% of the
> > transfer and that took 3 hours to download? You'll attempt a 25% size
> > transfer which would take 1.5 hour despite the fact that you already
> > spent that much time downloading that first 1/4 of the repository
> > already. And yet what if you're unlucky and now the network craps on
> > you after 23% of that second attempt?
>
> A modification then.
>
> First try ordinary clone. If it fails because network is unreliable,
> check how much we did download, and ask server for packfile of slightly
> smaller size; this means that we are asking server for approximate pack
> size limit, not for bisect-like partitioning revision list.
If the download didn't reach past the critical point (75 MB in my linux
repo example) then you cannot validate the received data and you've
wasted that much bandwidth.
> > I think it is better to "prime" the repository with the content of the
> > top commit in the most straight forward manner using git-archive which
> > has the potential to be fully restartable at any point with little
> > complexity on the server side.
>
> But didn't it make fully restartable 2.5 MB part out of 37 MB packfile?
The front of the pack is the critical point. If you get enough to
create the top commit then further transfers can be done incrementally
with only the deltas between each commits.
> A question about pack protocol negotiation. If clients presents some
> objects as "have", server can and does assume that client has all
> prerequisites for such objects, e.g. for tree objects that it has
> all objects for files and directories inside tree; for commit it means
> all ancestors and all objects in snapshot (have top tree, and its
> prerequisites). Do I understand this correctly?
That works only for commits.
> If we have partial packfile which crashed during downloading, can we
> extract from it some full objects (including blobs)? Can we pass
> tree and blob objects as "have" to server, and is it taken into account?
No.
> Perhaps instead of separate step of resumable-downloading of top commit
> objects (in snapshot), we can pass to server what we did download in
> full?
See above.
> BTW. because of compression it might be more difficult to resume
> archive creation in the middle, I think...
Why so? the tar+gzip format is streamable.
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-18 20:01 ` Nicolas Pitre
2009-08-18 21:02 ` Jakub Narebski
@ 2009-08-18 22:28 ` Johannes Schindelin
2009-08-18 23:40 ` Nicolas Pitre
1 sibling, 1 reply; 39+ messages in thread
From: Johannes Schindelin @ 2009-08-18 22:28 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Jakub Narebski, Tomasz Kontusz, git
Hi,
On Tue, 18 Aug 2009, Nicolas Pitre wrote:
> On Tue, 18 Aug 2009, Jakub Narebski wrote:
>
> > There is another way which we can go to implement resumable clone.
> > Let's git first try to clone whole repository (single pack; BTW what
> > happens if this pack is larger than file size limit for given
> > filesystem?).
>
> We currently fail. Seems that no one ever had a problem with that so
> far.
They just went away, most probably.
But seriously, I miss a very important idea in this discussion: we control
the Git source code. So we _can_ add a upload_pack feature that a client
can ask for after the first failed attempt.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-18 22:28 ` Johannes Schindelin
@ 2009-08-18 23:40 ` Nicolas Pitre
2009-08-19 7:35 ` Johannes Schindelin
0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-18 23:40 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Jakub Narebski, Tomasz Kontusz, git
On Wed, 19 Aug 2009, Johannes Schindelin wrote:
> Hi,
>
> On Tue, 18 Aug 2009, Nicolas Pitre wrote:
>
> > On Tue, 18 Aug 2009, Jakub Narebski wrote:
> >
> > > There is another way which we can go to implement resumable clone.
> > > Let's git first try to clone whole repository (single pack; BTW what
> > > happens if this pack is larger than file size limit for given
> > > filesystem?).
> >
> > We currently fail. Seems that no one ever had a problem with that so
> > far.
>
> They just went away, most probably.
Most probably they simply don't exist. I would be highly surprised
otherwise.
> But seriously, I miss a very important idea in this discussion: we control
> the Git source code. So we _can_ add a upload_pack feature that a client
> can ask for after the first failed attempt.
Indeed. So what do you think about my proposal? It was included in my
first reply to this thread.
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-18 18:45 ` Jakub Narebski
2009-08-18 20:01 ` Nicolas Pitre
@ 2009-08-19 4:42 ` Sitaram Chamarty
2009-08-19 9:53 ` Jakub Narebski
1 sibling, 1 reply; 39+ messages in thread
From: Sitaram Chamarty @ 2009-08-19 4:42 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Nicolas Pitre, Tomasz Kontusz, git
On Wed, Aug 19, 2009 at 12:15 AM, Jakub Narebski<jnareb@gmail.com> wrote:
> There is another way which we can go to implement resumable clone.
> Let's git first try to clone whole repository (single pack; BTW what
> happens if this pack is larger than file size limit for given
> filesystem?). If it fails, client ask first for first half of of
> repository (half as in bisect, but it is server that has to calculate
> it). If it downloads, it will ask server for the rest of repository.
> If it fails, it would reduce size in half again, and ask about 1/4 of
> repository in packfile first.
How about an extension where the user can *ask* for a clone of a
particular HEAD to be sent to him as a git bundle? Or particular
revisions (say once a week) were kept as a single file git-bundle,
made available over HTTP -- easily restartable with byte-range -- and
anyone who has bandwidth problems first gets that, then changes the
origin remote URL and does a "pull" to get uptodate?
I've done this manually a few times when sneakernet bandwidth was
better than the normal kind, heh, but it seems to me the lowest impact
solution.
Yes you'd need some extra space on the server, but you keep only one
bundle, and maybe replace it every week by cron. Should work fine
right now, as is, with a wee bit of manual work by the user, and a
quick cron entry on the server
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-18 23:40 ` Nicolas Pitre
@ 2009-08-19 7:35 ` Johannes Schindelin
2009-08-19 8:25 ` Nguyen Thai Ngoc Duy
2009-08-19 17:21 ` Nicolas Pitre
0 siblings, 2 replies; 39+ messages in thread
From: Johannes Schindelin @ 2009-08-19 7:35 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Jakub Narebski, Tomasz Kontusz, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1531 bytes --]
Hi,
On Tue, 18 Aug 2009, Nicolas Pitre wrote:
> On Wed, 19 Aug 2009, Johannes Schindelin wrote:
>
> > But seriously, I miss a very important idea in this discussion: we
> > control the Git source code. So we _can_ add a upload_pack feature
> > that a client can ask for after the first failed attempt.
>
> Indeed. So what do you think about my proposal? It was included in my
> first reply to this thread.
Did you not talk about an extension of the archive protocol? That's not
what I meant. The archive protocol can be disabled for completely
different reasons than to prevent restartable clones.
But you brought up an important point: shallow repositories.
Now, the problem, of course, is that if you cannot even get a single ref
(shallow'ed to depth 0 -- which reminds me: I think I promised to fix
that, but I did not do that yet) due to intermittent network failures, you
are borked, as you said.
But here comes an idea: together with Nguyễn's sparse series, it is
conceivable that we support a shallow & narrow clone via the upload-pack
protocol (also making mithro happy). The problem with narrow clones was
not the pack generation side, that is done by a rev-list that can be
limited to certain paths. The problem was that we end up with missing
tree objects. However, if we can make a sparse checkout, we can avoid
the problem.
Note: this is not well thought-through, but just a brainstorm-like answer
to your ideas.
Ciao,
Dscho "who should shut up now and get some work done instead ;-)"
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-19 7:35 ` Johannes Schindelin
@ 2009-08-19 8:25 ` Nguyen Thai Ngoc Duy
2009-08-19 9:52 ` Johannes Schindelin
2009-08-19 17:21 ` Nicolas Pitre
1 sibling, 1 reply; 39+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2009-08-19 8:25 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Nicolas Pitre, Jakub Narebski, Tomasz Kontusz, git
On Wed, Aug 19, 2009 at 2:35 PM, Johannes
Schindelin<Johannes.Schindelin@gmx.de> wrote:
> But here comes an idea: together with Nguy要's sparse series, it is
FWIW, you can write "Nguyen" instead. It might save you one copy/paste
(I take it you don't have a Vietnamese IM ;-)
> conceivable that we support a shallow & narrow clone via the upload-pack
> protocol (also making mithro happy). The problem with narrow clones was
> not the pack generation side, that is done by a rev-list that can be
> limited to certain paths. The problem was that we end up with missing
> tree objects. However, if we can make a sparse checkout, we can avoid
> the problem.
But then git-fsck, git-archive... will die?
--
Duy
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-19 8:25 ` Nguyen Thai Ngoc Duy
@ 2009-08-19 9:52 ` Johannes Schindelin
0 siblings, 0 replies; 39+ messages in thread
From: Johannes Schindelin @ 2009-08-19 9:52 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy; +Cc: Nicolas Pitre, Jakub Narebski, Tomasz Kontusz, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1065 bytes --]
Hi,
On Wed, 19 Aug 2009, Nguyen Thai Ngoc Duy wrote:
> On Wed, Aug 19, 2009 at 2:35 PM, Johannes
> Schindelin<Johannes.Schindelin@gmx.de> wrote:
> > But here comes an idea: together with Nguy要's sparse series, it is
>
> FWIW, you can write "Nguyen" instead. It might save you one copy/paste
> (I take it you don't have a Vietnamese IM ;-)
FWIW I originally wrote Nguyễn (not that Chinese(?) character)... I look
it up everytime I want to write your name by searching my address book for
"pclouds". ;-)
> > conceivable that we support a shallow & narrow clone via the
> > upload-pack protocol (also making mithro happy). The problem with
> > narrow clones was not the pack generation side, that is done by a
> > rev-list that can be limited to certain paths. The problem was that
> > we end up with missing tree objects. However, if we can make a sparse
> > checkout, we can avoid the problem.
>
> But then git-fsck, git-archive... will die?
Oh, but they should be made aware of the narrow clone, just like for
shallow clones.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-19 4:42 ` Sitaram Chamarty
@ 2009-08-19 9:53 ` Jakub Narebski
0 siblings, 0 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-08-19 9:53 UTC (permalink / raw)
To: Sitaram Chamarty; +Cc: Nicolas Pitre, Tomasz Kontusz, git
On Wed, Aug 19, 2009, Sitaram Chamarty wrote:
> On Wed, Aug 19, 2009 at 12:15 AM, Jakub Narebski<jnareb@gmail.com> wrote:
> > There is another way which we can go to implement resumable clone.
> > Let's git first try to clone whole repository (single pack; BTW what
> > happens if this pack is larger than file size limit for given
> > filesystem?). If it fails, client ask first for first half of of
> > repository (half as in bisect, but it is server that has to calculate
> > it). If it downloads, it will ask server for the rest of repository.
> > If it fails, it would reduce size in half again, and ask about 1/4 of
> > repository in packfile first.
>
> How about an extension where the user can *ask* for a clone of a
> particular HEAD to be sent to him as a git bundle? Or particular
> revisions (say once a week) were kept as a single file git-bundle,
> made available over HTTP -- easily restartable with byte-range -- and
> anyone who has bandwidth problems first gets that, then changes the
> origin remote URL and does a "pull" to get uptodate?
>
> I've done this manually a few times when sneakernet bandwidth was
> better than the normal kind, heh, but it seems to me the lowest impact
> solution.
>
> Yes you'd need some extra space on the server, but you keep only one
> bundle, and maybe replace it every week by cron. Should work fine
> right now, as is, with a wee bit of manual work by the user, and a
> quick cron entry on the server
This is a good idea, i think, and it can be implemented with various
amount of effort and changes to git, and various amount of seamless
integration.
1. Simplest solution: social (homepage). Not integrated at all.
On projects homepage, the one where there is described where project
repository is and how to get it, you add a link to most recent bundle
(perhaps in addition to most recent snapshot). This bundle would be
served as a static file via HTTP (and perhaps also FTP) by (any) web
server that supports resuming (range requests). Or you can make
server generate bundles on demand, only when they are first requested.
Most recent might mean latest tagged release, or it might mean daily
snapshot^W bundle.
This solution could be integrated into gitweb, either by generic
'latest bundle' link in project's README.html (or in site's
GITWEB_HOMETEXT, default indextext.html), or by having gitweb
generate those links (and perhaps bundles as well) by itself.
2. Seamless solution: 'bundle' or 'bundles' capability. Requires
changes to both server and client.
If server supports (advertises) 'bundle' capability, it can serve
list of bundles (as HTTP / FTP / rsync URLs) either at client request,
or after (or before) list of refs if client requests 'bundle'
capability.
If client has support for 'bundles' capability, it terminates
connection to sshd or git-daemon, and does ordinary resumable HTTP
fetch using libcurl. After bundle is downloaded fully, it clones
from bundle, and does git-fetch with the same server as before,
which would then have less to transfer. Client has also to handle
situation where bundle download is interrupted, and do not do cleanup,
allowing for "git clone --continue".
3. Seamless solution: GitTorrent or its simplification: git mirror-sync.
I think that GitTorrent (see http://git.or.cz/gitwiki/SoC2009Ideas)
or even its simplification git-mirror-sync would include restartable
cloning. It is even among its intended features. Also this would
help to download faster via mirrors which can have faster and better
network connection.
But this would be most work.
You can implement solution 1. even now...
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-18 21:32 ` Nicolas Pitre
@ 2009-08-19 15:19 ` Jakub Narebski
2009-08-19 19:04 ` Nicolas Pitre
0 siblings, 1 reply; 39+ messages in thread
From: Jakub Narebski @ 2009-08-19 15:19 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Tomasz Kontusz, git
On Tue, 18 Aug 2009, Nicolas Pitre wrote:
> On Tue, 18 Aug 2009, Jakub Narebski wrote:
>
>> You can probably get number and size taken by delta and non-delta (base)
>> objects in the packfile somehow. Neither "git verify-pack -v <packfile>"
>> nor contrib/stats/packinfo.pl did help me arrive at this data.
>
> Documentation for verify-pack says:
>
> |When specifying the -v option the format used is:
> |
> | SHA1 type size size-in-pack-file offset-in-packfile
> |
> |for objects that are not deltified in the pack, and
> |
> | SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
> |
> |for objects that are deltified.
>
> So a simple script should be able to give you the answer.
Thanks.
There are 114937 objects in this packfile, including 56249 objects
used as base (can be deltified or not). git-verify-pack -v shows
that all objects have total size-in-packfile of 33 MB (which agrees
with packfile size of 33 MB), with 17 MB size-in-packfile taken by
deltaified objects, and 16 MB taken by base objects.
git verify-pack -v |
grep -v "^chain" |
grep -v "objects/pack/pack-" > verify-pack.out
sum=0; bsum=0; dsum=0;
while read sha1 type size packsize off depth base; do
echo "$sha1" >> verify-pack.sha1.out
sum=$(( $sum + $packsize ))
if [ -n "$base" ]; then
echo "$sha1" >> verify-pack.delta.out
dsum=$(( $dsum + $packsize ))
else
echo "$sha1" >> verify-pack.base.out
bsum=$(( $bsum + $packsize ))
fi
done < verify-pack.out
echo "sum=$sum; bsum=$bsum; dsum=$dsum"
>>>> (BTW what happens if this pack is larger than file size limit for
>>>> given filesystem?).
[...]
>> If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
>> FAT is often used on SSD, on USB drive. Although if you have 2 GB
>> packfile, you are doing something wrong, or UGFWIINI (Using Git For
>> What It Is Not Intended).
>
> Hopefully you're not performing a 'git clone' off of a FAT filesystem.
> For physical transport you may repack with the appropriate switches.
Not off a FAT filesystem, but into a FAT filesystem.
[...]
>>> I think it is better to "prime" the repository with the content of the
>>> top commit in the most straight forward manner using git-archive which
>>> has the potential to be fully restartable at any point with little
>>> complexity on the server side.
>>
>> But didn't it make fully restartable 2.5 MB part out of 37 MB packfile?
>
> The front of the pack is the critical point. If you get enough to
> create the top commit then further transfers can be done incrementally
> with only the deltas between each commits.
How? You have some objects that can be used as base; how to tell
git-daemon that we have them (but not theirs prerequisites), and how
to generate incrementals?
>> A question about pack protocol negotiation. If clients presents some
>> objects as "have", server can and does assume that client has all
>> prerequisites for such objects, e.g. for tree objects that it has
>> all objects for files and directories inside tree; for commit it means
>> all ancestors and all objects in snapshot (have top tree, and its
>> prerequisites). Do I understand this correctly?
>
> That works only for commits.
Hmmmm... how do you intent for "prefetch top objects restartable-y first"
to work, then?
>> BTW. because of compression it might be more difficult to resume
>> archive creation in the middle, I think...
>
> Why so? the tar+gzip format is streamable.
gzip format uses sliding window in compression. "cat a b | gzip"
is different from "cat <(gzip a) <(gzip b)".
But that doesn't matter. If we are interrupted in the middle, we can
uncompress what we have to check how far did we get, and tell server
to send the rest; this way server wouldn't have to even generate
(but not send) what we get as partial transfer.
P.S. What do you think about 'bundle' capability extension mentioned
in a side sub-thread?
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-19 7:35 ` Johannes Schindelin
2009-08-19 8:25 ` Nguyen Thai Ngoc Duy
@ 2009-08-19 17:21 ` Nicolas Pitre
2009-08-19 22:23 ` René Scharfe
1 sibling, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-19 17:21 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Jakub Narebski, Tomasz Kontusz, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 5306 bytes --]
On Wed, 19 Aug 2009, Johannes Schindelin wrote:
> Hi,
>
> On Tue, 18 Aug 2009, Nicolas Pitre wrote:
>
> > On Wed, 19 Aug 2009, Johannes Schindelin wrote:
> >
> > > But seriously, I miss a very important idea in this discussion: we
> > > control the Git source code. So we _can_ add a upload_pack feature
> > > that a client can ask for after the first failed attempt.
> >
> > Indeed. So what do you think about my proposal? It was included in my
> > first reply to this thread.
>
> Did you not talk about an extension of the archive protocol? That's not
> what I meant. The archive protocol can be disabled for completely
> different reasons than to prevent restartable clones.
And those reasons are?
> But you brought up an important point: shallow repositories.
>
> Now, the problem, of course, is that if you cannot even get a single ref
> (shallow'ed to depth 0 -- which reminds me: I think I promised to fix
> that, but I did not do that yet) due to intermittent network failures, you
> are borked, as you said.
Exact.
> But here comes an idea: together with Nguyễn's sparse series, it is
> conceivable that we support a shallow & narrow clone via the upload-pack
> protocol (also making mithro happy). The problem with narrow clones was
> not the pack generation side, that is done by a rev-list that can be
> limited to certain paths. The problem was that we end up with missing
> tree objects. However, if we can make a sparse checkout, we can avoid
> the problem.
Sure, if you can salvage as much as you can from a partial pack and
create a shallow and narrow clone out of it then it should be possible
to do some restartable clone. I still think this might be much less
complex to achieve through git-archive, especially if some files i.e.
objects are large enough to expose themselves to network outage. It is
like the same issue as being able to fetch at least one revision but to
a lesser degree. You might be able to get that first revision through
multiple attempts by gathering missing objects on each attempt. But if
you encounter an object large enough you then might be unlucky enough
not to be able to transfer it all before the next network failure.
With a simple extension to git-archive, any object content could be
resumed many times from any offset. Then, deepening the history should
make use of deltas through the pack protocol which should hopefully
consist of much smaller transfers and therefore less prone to network
outage.
That could be sketched like this, supposing user runs
"git clone git://foo.bar/baz":
1) "git ini baz" etc. as usual.
2) "git ls-remote git://foo.bar/baz HEAD" and store the result in
.git/CLONE_HEAD so not to be confused by the remote HEAD possibly
changing before we're done.
3) "git archive --remote=git://foo.bar/baz CLONE_HEAD" and store the
result locally. Keep track of how many files are received, and how
many bytes for the currently received file.
4) if network connection is broken, loop back to (3) adding
--skip=${nr_files_received},${nr_bytes_in_curr_file_received} to
the git-archive argument list. REmote server simply skips over
specified number of files and bytes into the next file.
5) Get content from remote commit object for CLONE_HEAD somehow. (?)
6) "git add . && git write-tree" and make sure the top tree SHA1 matches
the one in the commit from (5).
7) "git hash-object -w -t commit" with data obtained in (5), and make
sure it matches SHA1 from CLONE_HEAD.
8) Update local HEAD with CLONE_HEAD and set it up as a shallow clone.
Delete .git/CLONE_HEAD.
9) Run "git fetch" with the --depth parameter to get more revisions.
Notes:
- This mode of operation should probably be optional, like by using
--safe or --restartable with 'git clone'. And since this mode of
operation is really meant for people with slow and unreliable network
connections, they're unlikely to wish for the whole history to be
fetched. Hence this mode could simply be triggered by the --depth
parameter to 'git clone' which would provide a clear depth value to
use in (9).
- If the transfer is interrupted locally with ^C then it should be
possible to resume it by noticing the presence of .git/CLONE_HEAD
up front. DEtermining how many files to skip when resuming with
git-archive can be done with $((`git ls-files -o | wc -l` - 1)) and
$(git ls-files -o | tail -1 | wc -c).
- That probably would be a good idea to have a tgz format to 'git
archive' which might be simpler to deal with than the zip format.
- Step (3) could be optimized in many ways, like by directly using
hash-object and update-index, or by using a filter to pipe the result
directly into fast-import.
- So to say that the above should be pretty easy to implement even
with a shell script. A builtin version could then be made if this
proves to actually be useful. And the server remains stateless with
no additional caching needed which would go against any attempt
at making a busy server like git.kernel.org share as much of the
object store between plenty of mostly identical repositoryes.
> Note: this is not well thought-through, but just a brainstorm-like answer
> to your ideas.
And so is the above.
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-19 15:19 ` Jakub Narebski
@ 2009-08-19 19:04 ` Nicolas Pitre
2009-08-19 19:42 ` Jakub Narebski
0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-19 19:04 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Tomasz Kontusz, git
On Wed, 19 Aug 2009, Jakub Narebski wrote:
> There are 114937 objects in this packfile, including 56249 objects
> used as base (can be deltified or not). git-verify-pack -v shows
> that all objects have total size-in-packfile of 33 MB (which agrees
> with packfile size of 33 MB), with 17 MB size-in-packfile taken by
> deltaified objects, and 16 MB taken by base objects.
>
> git verify-pack -v |
> grep -v "^chain" |
> grep -v "objects/pack/pack-" > verify-pack.out
>
> sum=0; bsum=0; dsum=0;
> while read sha1 type size packsize off depth base; do
> echo "$sha1" >> verify-pack.sha1.out
> sum=$(( $sum + $packsize ))
> if [ -n "$base" ]; then
> echo "$sha1" >> verify-pack.delta.out
> dsum=$(( $dsum + $packsize ))
> else
> echo "$sha1" >> verify-pack.base.out
> bsum=$(( $bsum + $packsize ))
> fi
> done < verify-pack.out
> echo "sum=$sum; bsum=$bsum; dsum=$dsum"
Your object classification is misleading. Because an object has no
base, that doesn't mean it is necessarily a base itself. You'd have to
store $base into a separate file and then sort it and remove duplicates
to know the actual number of base objects. What you have right now is
strictly delta objects and non-delta objects. And base objects can
themselves be delta objects already of course.
Also... my git repo after 'git gc --aggressive' contains a pack which
size is 22 MB. Your script tells me:
sum=22930254; bsum=14142012; dsum=8788242
and:
29558 verify-pack.base.out
82043 verify-pack.delta.out
111601 verify-pack.out
111601 verify-pack.sha1.out
meaning that I have 111601 total objects, of which 29558 are non-deltas
occupying 14 MB and 82043 are deltas occupying 8 MB. That certainly
shows how deltas are space efficient. And with a minor modification to
your script, I know that 44985 objects are actually used as a delta
base. So, on average, each base is responsible for nearly 2 deltas.
> >>>> (BTW what happens if this pack is larger than file size limit for
> >>>> given filesystem?).
> [...]
>
> >> If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
> >> FAT is often used on SSD, on USB drive. Although if you have 2 GB
> >> packfile, you are doing something wrong, or UGFWIINI (Using Git For
> >> What It Is Not Intended).
> >
> > Hopefully you're not performing a 'git clone' off of a FAT filesystem.
> > For physical transport you may repack with the appropriate switches.
>
> Not off a FAT filesystem, but into a FAT filesystem.
That's what I meant, sorry. My point still stands.
> > The front of the pack is the critical point. If you get enough to
> > create the top commit then further transfers can be done incrementally
> > with only the deltas between each commits.
>
> How? You have some objects that can be used as base; how to tell
> git-daemon that we have them (but not theirs prerequisites), and how
> to generate incrementals?
Just the same as when you perform a fetch to update your local copy of a
remote branch: you tell the remote about the commit you have and the one
you want, and git-repack will create delta objects for the commit you
want against similar objects from the commit you already have, and skip
those objects from the commit you want that are already included in the
commit you have.
> >> A question about pack protocol negotiation. If clients presents some
> >> objects as "have", server can and does assume that client has all
> >> prerequisites for such objects, e.g. for tree objects that it has
> >> all objects for files and directories inside tree; for commit it means
> >> all ancestors and all objects in snapshot (have top tree, and its
> >> prerequisites). Do I understand this correctly?
> >
> > That works only for commits.
>
> Hmmmm... how do you intent for "prefetch top objects restartable-y first"
> to work, then?
See my latest reply to dscho (you were in CC already).
> >> BTW. because of compression it might be more difficult to resume
> >> archive creation in the middle, I think...
> >
> > Why so? the tar+gzip format is streamable.
>
> gzip format uses sliding window in compression. "cat a b | gzip"
> is different from "cat <(gzip a) <(gzip b)".
>
> But that doesn't matter. If we are interrupted in the middle, we can
> uncompress what we have to check how far did we get, and tell server
> to send the rest; this way server wouldn't have to even generate
> (but not send) what we get as partial transfer.
You got it.
> P.S. What do you think about 'bundle' capability extension mentioned
> in a side sub-thread?
I don't like it. Reason is that it forces the server to be (somewhat)
stateful by having to keep track of those bundles and cycle them, and it
doubles the disk usage by having one copy of the repository in the form
of the original pack(s) and another copy as a bundle.
Of course, the idea of having a cron job generating a bundle and
offering it for download through HTTP or the like is fine if people are
OK with that, and that requires zero modifications to git. But I don't
think that is a solution that scales.
If you think about git.kernel.org which has maybe hundreds of
repositories where the big majority of them are actually forks of Linus'
own repository, then having all those forks reference Linus' repository
is a big disk space saver (and IO too as the referenced repository is
likely to remain cached in memory). Having a bundle ready for each of
them will simply kill that space advantage, unless they all share the
same bundle.
Now sharing that common bundle could be done of course, but that makes
things yet more complex while still wasting IO because some requests
will hit the common pack and some others will hit the bundle, making
less efficient usage of the disk cache on the server.
Yet, that bundle would probably not contain the latest revision if it is
only periodically updated, even less so if it is shared between multiple
repositories as outlined above. And what people with slow/unreliable
network links are probably most interested in is the latest revision and
maybe a few older revisions, but probably not the whole repository as
that is simply too long to wait for. Hence having a big bundle is not
flexible either with regards to the actual data transfer size.
Hence having a restartable git-archive service to create the top
revision with the ability to cheaply (in terms of network bandwidth)
deepen the history afterwards is probably the most straight forward way
to achieve that. The server needs no be aware of separate bundles, etc.
And the shared object store still works as usual with the same cached IO
whether the data is needed for a traditional fetch or a "git archive"
operation.
Why "git archive"? Because its content is well defined. So if you give
it a commit SHA1 you will always get the same stream of bytes (after
decompression) since the way git sort files is strictly defined. It is
therefore easy to tell a remote "git archive" instance that we want the
content for commit xyz but that we already got n files already, and that
the last file we've got has m bytes. There is simply no confusion about
what we've got already, unlike with a partial pack which might need
yet-to-be-received objects in order to make sense of what has been
already received. The server simply has to skip that many files and
resume the transfer at that point, independently of the compression or
even the archive format.
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-19 19:04 ` Nicolas Pitre
@ 2009-08-19 19:42 ` Jakub Narebski
2009-08-19 21:13 ` Nicolas Pitre
0 siblings, 1 reply; 39+ messages in thread
From: Jakub Narebski @ 2009-08-19 19:42 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Tomasz Kontusz, git, Johannes Schindelin
Cc-ed Dscho, so he can easier participate in this subthread.
On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> On Wed, 19 Aug 2009, Jakub Narebski wrote:
> > P.S. What do you think about 'bundle' capability extension mentioned
> > in a side sub-thread?
>
> I don't like it. Reason is that it forces the server to be (somewhat)
> stateful by having to keep track of those bundles and cycle them, and it
> doubles the disk usage by having one copy of the repository in the form
> of the original pack(s) and another copy as a bundle.
I agree about problems with disk usage, but I disagree about server
having to be stateful; server can just simply scan for bundles, and
offer links to them if client requests 'bundles' capability, somewhere
around initial git-ls-remote list of refs.
> Of course, the idea of having a cron job generating a bundle and
> offering it for download through HTTP or the like is fine if people are
> OK with that, and that requires zero modifications to git. But I don't
> think that is a solution that scales.
Well, offering daily bundle in addition to daily snapshot could be
a good practice, at least until git acquires resumable fetch (resumable
clone).
>
> If you think about git.kernel.org which has maybe hundreds of
> repositories where the big majority of them are actually forks of Linus'
> own repository, then having all those forks reference Linus' repository
> is a big disk space saver (and IO too as the referenced repository is
> likely to remain cached in memory). Having a bundle ready for each of
> them will simply kill that space advantage, unless they all share the
> same bundle.
I am thinking about sharing the same bundle for related projects.
>
> Now sharing that common bundle could be done of course, but that makes
> things yet more complex while still wasting IO because some requests
> will hit the common pack and some others will hit the bundle, making
> less efficient usage of the disk cache on the server.
Hmmm... true (unless bundles are on separate server).
>
> Yet, that bundle would probably not contain the latest revision if it is
> only periodically updated, even less so if it is shared between multiple
> repositories as outlined above. And what people with slow/unreliable
> network links are probably most interested in is the latest revision and
> maybe a few older revisions, but probably not the whole repository as
> that is simply too long to wait for. Hence having a big bundle is not
> flexible either with regards to the actual data transfer size.
I agree that bundle would be useful for restartable clone, and not
useful for restartable fetch. Well, unless you count (non-existing)
GitTorrent / git-mirror-sync as this solution... ;-)
>
> Hence having a restartable git-archive service to create the top
> revision with the ability to cheaply (in terms of network bandwidth)
> deepen the history afterwards is probably the most straight forward way
> to achieve that. The server needs no be aware of separate bundles, etc.
> And the shared object store still works as usual with the same cached IO
> whether the data is needed for a traditional fetch or a "git archive"
> operation.
It's the "cheaply deepen history" that I doubt would be easy. This is
the most difficult part, I think (see also below).
>
> Why "git archive"? Because its content is well defined. So if you give
> it a commit SHA1 you will always get the same stream of bytes (after
> decompression) since the way git sort files is strictly defined. It is
> therefore easy to tell a remote "git archive" instance that we want the
> content for commit xyz but that we already got n files already, and that
> the last file we've got has m bytes. There is simply no confusion about
> what we've got already, unlike with a partial pack which might need
> yet-to-be-received objects in order to make sense of what has been
> already received. The server simply has to skip that many files and
> resume the transfer at that point, independently of the compression or
> even the archive format.
Let's reiterate it to check if I understand it correctly:
Any "restartable clone" / "resumable fetch" solution must begin with
a file which is rock-solid stable wrt. reproductability given the same
parameters. git-archive has this feature, packfile doesn't (so I guess
that bundle also doesn't, unless it was cached / saved on disk).
It would be useful if it was possible to generate part of this rock-solid
file for partial (range, resume) request, without need to generate
(calculate) parts that client already downloaded. Otherwise server has
to either waste disk space and IO for caching, or waste CPU (and IO)
on generating part which is not needed and dropping it to /dev/null.
git-archive you say has this feature.
Next you need to tell server that you have those objects got using
resumable download part ("git archive HEAD" in your proposal), and
that it can use them and do not include them in prepared file/pack.
"have" is limited to commits, and "have <sha1>" tells server that
you have <sha1> and all its prerequisites (dependences). You can't
use "have <sha1>" with git-archive solution. I don't know enough
about 'shallow' capability (and what it enables) to know whether
it can be used for that. Can you elaborate?
Then you have to finish clone / fetch. All solutions so far include
some kind of incremental improvements. My first proposal of bisect
fetching 1/nth or predefined size pack is buttom-up solution, where
we build full clone from root commits up. You propose, from what
I understand build full clone from top commit down, using deepening
from shallow clone. In this step you either get full incremental
or not; downloading incremental (from what I understand) is not
resumable / they do not support partial fetch.
Do I understand this correctly?
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-19 19:42 ` Jakub Narebski
@ 2009-08-19 21:13 ` Nicolas Pitre
2009-08-20 0:26 ` Sam Vilain
2009-08-20 7:37 ` Jakub Narebski
0 siblings, 2 replies; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-19 21:13 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Tomasz Kontusz, git, Johannes Schindelin
On Wed, 19 Aug 2009, Jakub Narebski wrote:
> Cc-ed Dscho, so he can easier participate in this subthread.
>
> On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> > On Wed, 19 Aug 2009, Jakub Narebski wrote:
>
> > > P.S. What do you think about 'bundle' capability extension mentioned
> > > in a side sub-thread?
> >
> > I don't like it. Reason is that it forces the server to be (somewhat)
> > stateful by having to keep track of those bundles and cycle them, and it
> > doubles the disk usage by having one copy of the repository in the form
> > of the original pack(s) and another copy as a bundle.
>
> I agree about problems with disk usage, but I disagree about server
> having to be stateful; server can just simply scan for bundles, and
> offer links to them if client requests 'bundles' capability, somewhere
> around initial git-ls-remote list of refs.
But that's the client that has to deal with what the server wants to
offer, instead of the server actually serving data as the client wants.
> Well, offering daily bundle in addition to daily snapshot could be
> a good practice, at least until git acquires resumable fetch (resumable
> clone).
Outside of Git: maybe. Through the git protocol: no. And what would
that bundle contain over the daily snapshot? The whole history? If so
that goes against the idea that people concerned by all this have slow
links and probably aren't interested in the time to download it all. If
the bundle contains only the top revision then it has no advantage over
the snapshot. Somewhere in the middle? Sure, but then where to draw
the line? That's for the client to decide, not the server
administrator.
And what if you start your slow transfer which breaks in the middle.
The next morning you want to restart it in the hope that you might
resume the transfer of the bundle that is incomplete. But crap, the
server has updated its bundle and your half-bundle is now useless.
You've wasted your bandwidth for nothing.
> > If you think about git.kernel.org which has maybe hundreds of
> > repositories where the big majority of them are actually forks of Linus'
> > own repository, then having all those forks reference Linus' repository
> > is a big disk space saver (and IO too as the referenced repository is
> > likely to remain cached in memory). Having a bundle ready for each of
> > them will simply kill that space advantage, unless they all share the
> > same bundle.
>
> I am thinking about sharing the same bundle for related projects.
... meaning more administrative burden.
> > Now sharing that common bundle could be done of course, but that makes
> > things yet more complex while still wasting IO because some requests
> > will hit the common pack and some others will hit the bundle, making
> > less efficient usage of the disk cache on the server.
>
> Hmmm... true (unless bundles are on separate server).
... meaning additional but avoidable costs.
> > Yet, that bundle would probably not contain the latest revision if it is
> > only periodically updated, even less so if it is shared between multiple
> > repositories as outlined above. And what people with slow/unreliable
> > network links are probably most interested in is the latest revision and
> > maybe a few older revisions, but probably not the whole repository as
> > that is simply too long to wait for. Hence having a big bundle is not
> > flexible either with regards to the actual data transfer size.
>
> I agree that bundle would be useful for restartable clone, and not
> useful for restartable fetch. Well, unless you count (non-existing)
> GitTorrent / git-mirror-sync as this solution... ;-)
I don't think fetches after a clone are such an issue. They are
typically transfers being orders of magnitude smaller than the initial
clone. Same goes for fetches to deepen a shallow clone which are in
fact fetches going back in history instead of forward. I still stands
by my assertion that bundles are suboptimal for a restartable clone.
As for GitTorrent / git-mirror-sync... those are still vaporwares to me
and I therefore have doubts about their actual feasability. So no, I
don't count on them.
> > Hence having a restartable git-archive service to create the top
> > revision with the ability to cheaply (in terms of network bandwidth)
> > deepen the history afterwards is probably the most straight forward way
> > to achieve that. The server needs no be aware of separate bundles, etc.
> > And the shared object store still works as usual with the same cached IO
> > whether the data is needed for a traditional fetch or a "git archive"
> > operation.
>
> It's the "cheaply deepen history" that I doubt would be easy. This is
> the most difficult part, I think (see also below).
Don't think so. Try this:
mkdir test
cd test
git init
git fetch --depth=1 git://git.kernel.org/pub/scm/git/git.git
REsult:
remote: Counting objects: 1824, done.
remote: Compressing objects: 100% (1575/1575), done.
Receiving objects: 100% (1824/1824), 3.01 MiB | 975 KiB/s, done.
remote: Total 1824 (delta 299), reused 1165 (delta 180)
Resolving deltas: 100% (299/299), done.
From git://git.kernel.org/pub/scm/git/git
* branch HEAD -> FETCH_HEAD
You'll get the very latest revision for HEAD, and only that. The size
of the transfer will be roughly the size of a daily snapshot, except it
is fully up to date. It is however non resumable in the event of a
network outage. My proposal is to replace this with a "git archive"
call. It won't get all branches, but for the purpose of initialising
one's repository that should be good enough. And the "git archive" can
be fully resumable as I explained.
Now to deepen that history. Let's say you want 10 more revisions going
back then you simply perform the fetch again with a --depth=10. Right
now it doesn't seem to work optimally, but the pack that is then being
sent could be made of deltas against objects found in the commits we
already have. Currently it seems that a pack that also includes those
objects we already have in addition to those we want is created, which
is IMHO a flaw in the shallow support that shouldn't be too hard to fix.
Each level of deepening should then be as small as standard fetches
going forward when updating the repository with new revisions.
> > Why "git archive"? Because its content is well defined. So if you give
> > it a commit SHA1 you will always get the same stream of bytes (after
> > decompression) since the way git sort files is strictly defined. It is
> > therefore easy to tell a remote "git archive" instance that we want the
> > content for commit xyz but that we already got n files already, and that
> > the last file we've got has m bytes. There is simply no confusion about
> > what we've got already, unlike with a partial pack which might need
> > yet-to-be-received objects in order to make sense of what has been
> > already received. The server simply has to skip that many files and
> > resume the transfer at that point, independently of the compression or
> > even the archive format.
>
> Let's reiterate it to check if I understand it correctly:
>
> Any "restartable clone" / "resumable fetch" solution must begin with
> a file which is rock-solid stable wrt. reproductability given the same
> parameters. git-archive has this feature, packfile doesn't (so I guess
> that bundle also doesn't, unless it was cached / saved on disk).
Right.
> It would be useful if it was possible to generate part of this rock-solid
> file for partial (range, resume) request, without need to generate
> (calculate) parts that client already downloaded. Otherwise server has
> to either waste disk space and IO for caching, or waste CPU (and IO)
> on generating part which is not needed and dropping it to /dev/null.
> git-archive you say has this feature.
"Could easily have" is more appropriate.
> Next you need to tell server that you have those objects got using
> resumable download part ("git archive HEAD" in your proposal), and
> that it can use them and do not include them in prepared file/pack.
> "have" is limited to commits, and "have <sha1>" tells server that
> you have <sha1> and all its prerequisites (dependences). You can't
> use "have <sha1>" with git-archive solution. I don't know enough
> about 'shallow' capability (and what it enables) to know whether
> it can be used for that. Can you elaborate?
See above, or Documentation/technical/shallow.txt.
> Then you have to finish clone / fetch. All solutions so far include
> some kind of incremental improvements. My first proposal of bisect
> fetching 1/nth or predefined size pack is buttom-up solution, where
> we build full clone from root commits up. You propose, from what
> I understand build full clone from top commit down, using deepening
> from shallow clone. In this step you either get full incremental
> or not; downloading incremental (from what I understand) is not
> resumable / they do not support partial fetch.
Right. However, like I said, the incremental part should be much
smaller and therefore less susceptible to network troubles.
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-19 17:21 ` Nicolas Pitre
@ 2009-08-19 22:23 ` René Scharfe
0 siblings, 0 replies; 39+ messages in thread
From: René Scharfe @ 2009-08-19 22:23 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Johannes Schindelin, Jakub Narebski, Tomasz Kontusz, git
Nicolas Pitre schrieb:
> 3) "git archive --remote=git://foo.bar/baz CLONE_HEAD" and store the
> result locally. Keep track of how many files are received, and how
> many bytes for the currently received file.
>
> 4) if network connection is broken, loop back to (3) adding
> --skip=${nr_files_received},${nr_bytes_in_curr_file_received} to
> the git-archive argument list. REmote server simply skips over
> specified number of files and bytes into the next file.
>
> 5) Get content from remote commit object for CLONE_HEAD somehow. (?)
[...]
> - That probably would be a good idea to have a tgz format to 'git
> archive' which might be simpler to deal with than the zip format.
Adding support for the tgz format would be useful anyway, I guess, and
is easy to implement.
And adding support for cpio (and cpio.gz) and writing an extractor for
it should be simpler than writing a tar extractor alone.
One needs to take a closer look at the limits of the chosen archive
format (file name length, supported file types and attributes, etc.) to
make sure any archive can be turned back into the same git tree.
The commit object could be sent as the first (fake) file of the archive.
You'd need a way to turn off the effect of the attributes export-subst
and export-ignore.
Currently, convert_to_working_tree() is used on the contents of all
files in an archive. You'd need a way to turn that off, too.
Adding a new format type is probably the easiest way to bundle the
special requirements of the previous three paragraphs.
René
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-19 21:13 ` Nicolas Pitre
@ 2009-08-20 0:26 ` Sam Vilain
2009-08-20 7:37 ` Jakub Narebski
1 sibling, 0 replies; 39+ messages in thread
From: Sam Vilain @ 2009-08-20 0:26 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin
On Wed, 2009-08-19 at 17:13 -0400, Nicolas Pitre wrote:
> > It's the "cheaply deepen history" that I doubt would be easy. This is
> > the most difficult part, I think (see also below).
>
> Don't think so. Try this:
>
> mkdir test
> cd test
> git init
> git fetch --depth=1 git://git.kernel.org/pub/scm/git/git.git
>
> REsult:
>
> remote: Counting objects: 1824, done.
> remote: Compressing objects: 100% (1575/1575), done.
> Receiving objects: 100% (1824/1824), 3.01 MiB | 975 KiB/s, done.
> remote: Total 1824 (delta 299), reused 1165 (delta 180)
> Resolving deltas: 100% (299/299), done.
> From git://git.kernel.org/pub/scm/git/git
> * branch HEAD -> FETCH_HEAD
>
> You'll get the very latest revision for HEAD, and only that. The size
> of the transfer will be roughly the size of a daily snapshot, except it
> is fully up to date. It is however non resumable in the event of a
> network outage. My proposal is to replace this with a "git archive"
> call. It won't get all branches, but for the purpose of initialising
> one's repository that should be good enough. And the "git archive" can
> be fully resumable as I explained.
>
> Now to deepen that history. Let's say you want 10 more revisions going
> back then you simply perform the fetch again with a --depth=10. Right
> now it doesn't seem to work optimally, but the pack that is then being
> sent could be made of deltas against objects found in the commits we
> already have. Currently it seems that a pack that also includes those
> objects we already have in addition to those we want is created, which
> is IMHO a flaw in the shallow support that shouldn't be too hard to fix.
> Each level of deepening should then be as small as standard fetches
> going forward when updating the repository with new revisions.
Nicholas, apart from starting with most recent commits and working
backwards, this is very similar to the "bundle slicing" idea defined in
GitTorrent. What the GitTorrent research project has so far achieved is
defining a slicing algorithm, and figuring out how well slicing works,
in terms of wasted bandwidth.
If you do it right, then you can support download spreading across
mirrors, too. Eg, given a starting point, a 'slice size' - which I
based on uncompressed object size but could as well be based on commit
count - and a slice number to fetch, you should be able to look up in
the revision list index the revisions to select and then make a thin
pack corresponding to those commits. Currently creating this index is
the slowest part of creating bundle fragments in my Perl implementation.
Once Nick Edelen's project is mergeable, we have a mechanism for being
able to relatively quickly draw a manifest of objects for these slices.
So how much bandwidth is lost?
Eg, for git.git, taking the complete object list, slicing it into 1024k
(uncompressed) bundle slices, and making thin packs from those slices we
get:
Generating index...
Length is 1291327524, 1232 blocks
Slice #0: 1050390 => 120406 (11%)
Slice #1: 1058162 => 124978 (11%)
Slice #2: 1049858 => 104363 (9%)
...
Slice #51: 1105090 => 43140 (3%)
Slice #52: 1091282 => 45367 (4%)
Slice #53: 1067675 => 39792 (3%)
...
Slice #211: 1086238 => 25451 (2%)
Slice #212: 1055705 => 31294 (2%)
Slice #213: 1059460 => 7767 (0%)
...
Slice #1129: 1109209 => 38182 (3%)
Slice #1130: 1125925 => 29829 (2%)
Slice #1131: 1120203 => 14446 (1%)
Final slice: 623055 => 49345
Overall compressed: 39585851
Calculating Repository bundle size...
Counting objects: 107369, done.
Compressing objects: 100% (28059/28059), done.
Writing objects: 100% (107369/107369), 29.20 MiB | 48321 KiB/s, done.
Total 107369 (delta 78185), reused 106770 (delta 77609)
Bundle size: 30638967
Overall inefficiency: 29%
In the above output, the first figure is the complete un-delta'd,
uncompressed size of the slice - that is, the size of all of the new
objects that the commit introduces. The second figure is the full size
of a thin pack with those objects in it. ie the above tells me that in
git.git there are 1.2GB of uncompressed objects. Each slice ends up
varying in size between about 10k and 200k, but most of the slices end
up between 15k and 50k.
Actually the test script was thrown off by a loose root and that added
about 3MB to the compressed size, so the overall inefficiency with this
block size is actually more like 20%. I think I am running into the
flaw you mention above, too, especially when I do a larger block size
run:
Generating index...
Length is 1291327524, 62 blocks
Slice #0: 21000218 => 1316165 (6%)
Slice #1: 20988208 => 1107636 (5%)
...
Slice #59: 21102776 => 1387722 (6%)
Slice #60: 20974960 => 876648 (4%)
Final slice: 6715954 => 261218
Overall compressed: 50071857
Calculating Repository bundle size...
Counting objects: 107369, done.
Compressing objects: 100% (28059/28059), done.
Writing objects: 100% (107369/107369), 29.20 MiB | 48353 KiB/s, done.
Total 107369 (delta 78185), reused 106770 (delta 77609)
Bundle size: 30638967
Overall inefficiency: 63%
Somehow we made larger packs but the total packed size was larger.
Trying with 100MB "blocks" I get:
Generating index...
Length is 1291327524, 13 blocks
Slice #0: 104952661 => 4846553 (4%)
Slice #1: 104898188 => 2830056 (2%)
Slice #2: 105007998 => 2856535 (2%)
Slice #3: 104909972 => 2583402 (2%)
Slice #4: 104909440 => 2187708 (2%)
Slice #5: 104859786 => 2555686 (2%)
Slice #6: 104873317 => 2358914 (2%)
Slice #7: 104881597 => 2183894 (2%)
Slice #8: 104863418 => 3555224 (3%)
Slice #9: 104896599 => 3192564 (3%)
Slice #10: 104876697 => 3895707 (3%)
Slice #11: 104903491 => 3731555 (3%)
Final slice: 32494360 => 1270887
Overall compressed: 38048685
Calculating Repository bundle size...
Counting objects: 107369, done.
Compressing objects: 100% (28059/28059), done.
Writing objects: 100% (107369/107369), 29.20 MiB | 48040 KiB/s, done.
Total 107369 (delta 78185), reused 106770 (delta 77609)
Bundle size: 30638967
Overall inefficiency: 24%
In the above, we broke the git.git download into 13 partial downloads of
a few meg each, at the expense of an extra 24% of download.
Anyway I've hopefully got more to add to this but this will do for a
starting point.
Sam
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-19 21:13 ` Nicolas Pitre
2009-08-20 0:26 ` Sam Vilain
@ 2009-08-20 7:37 ` Jakub Narebski
2009-08-20 7:48 ` Nguyen Thai Ngoc Duy
` (2 more replies)
1 sibling, 3 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-08-20 7:37 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Tomasz Kontusz, git, Johannes Schindelin
On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> On Wed, 19 Aug 2009, Jakub Narebski wrote:
> >
> > On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> > > On Wed, 19 Aug 2009, Jakub Narebski wrote:
[...]
> > > Yet, that bundle would probably not contain the latest revision if it is
> > > only periodically updated, even less so if it is shared between multiple
> > > repositories as outlined above. And what people with slow/unreliable
> > > network links are probably most interested in is the latest revision and
> > > maybe a few older revisions, but probably not the whole repository as
> > > that is simply too long to wait for. Hence having a big bundle is not
> > > flexible either with regards to the actual data transfer size.
> >
> > I agree that bundle would be useful for restartable clone, and not
> > useful for restartable fetch. Well, unless you count (non-existing)
> > GitTorrent / git-mirror-sync as this solution... ;-)
>
> I don't think fetches after a clone are such an issue. They are
> typically transfers being orders of magnitude smaller than the initial
> clone. Same goes for fetches to deepen a shallow clone which are in
> fact fetches going back in history instead of forward. I still stands
> by my assertion that bundles are suboptimal for a restartable clone.
They are good as a workaround for lack of resumable clone in a pinch,
but I agree that because a) like packfile they cannot guarantee that
for the same arguments (endpoints) they would generate the same file,
b) you can't generate currently resume pat of bundle only (and it would
be probably difficult to add it) they are not optimal solution.
> As for GitTorrent / git-mirror-sync... those are still vaporwares to me
> and I therefore have doubts about their actual feasability. So no, I
> don't count on them.
Well... there is _some_ code.
> > > Hence having a restartable git-archive service to create the top
> > > revision with the ability to cheaply (in terms of network bandwidth)
> > > deepen the history afterwards is probably the most straight forward way
> > > to achieve that. The server needs no be aware of separate bundles, etc.
> > > And the shared object store still works as usual with the same cached IO
> > > whether the data is needed for a traditional fetch or a "git archive"
> > > operation.
> >
> > It's the "cheaply deepen history" that I doubt would be easy. This is
> > the most difficult part, I think (see also below).
>
> Don't think so. Try this:
>
> mkdir test
> cd test
> git init
> git fetch --depth=1 git://git.kernel.org/pub/scm/git/git.git
>
> Result:
>
> remote: Counting objects: 1824, done.
> remote: Compressing objects: 100% (1575/1575), done.
> Receiving objects: 100% (1824/1824), 3.01 MiB | 975 KiB/s, done.
> remote: Total 1824 (delta 299), reused 1165 (delta 180)
> Resolving deltas: 100% (299/299), done.
> From git://git.kernel.org/pub/scm/git/git
> * branch HEAD -> FETCH_HEAD
>
> You'll get the very latest revision for HEAD, and only that. The size
> of the transfer will be roughly the size of a daily snapshot, except it
> is fully up to date. It is however non resumable in the event of a
> network outage. My proposal is to replace this with a "git archive"
> call. It won't get all branches, but for the purpose of initialising
> one's repository that should be good enough. And the "git archive" can
> be fully resumable as I explained.
It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
(well, that of course depends on repository). Not that much that is
resumable.
> Now to deepen that history. Let's say you want 10 more revisions going
> back then you simply perform the fetch again with a --depth=10. Right
> now it doesn't seem to work optimally, but the pack that is then being
> sent could be made of deltas against objects found in the commits we
> already have. Currently it seems that a pack that also includes those
> objects we already have in addition to those we want is created, which
> is IMHO a flaw in the shallow support that shouldn't be too hard to fix.
> Each level of deepening should then be as small as standard fetches
> going forward when updating the repository with new revisions.
You would have the same (or at least quite similar) problems with
deepening part (the 'incrementals' transfer part) as you found with my
first proposal of server bisection / division of rev-list, and serving
1/Nth of revisions (where N is selected so packfile is reasonable) to
client as incrementals. Yours is top-down, mine was bottom-up approach
to sending series of smaller packs. The problem is how to select size
of incrementals, and that incrementals are all-or-nothing (but see also
comment below).
In proposal using git-archive and shallow clone deepening as incrementals
you have this small seed (how small it depends on repository: 50% - 5%)
which is resumable. And presumably with deepening you can somehow make
some use from incomplete packfile, only part of which was transferred
before network error / disconnect. And even tell server about objects
which you managed to extract from *.pack.part.
> > > Why "git archive"? Because its content is well defined. So if you give
> > > it a commit SHA1 you will always get the same stream of bytes (after
> > > decompression) since the way git sort files is strictly defined. It is
> > > therefore easy to tell a remote "git archive" instance that we want the
> > > content for commit xyz but that we already got n files already, and that
> > > the last file we've got has m bytes. There is simply no confusion about
> > > what we've got already, unlike with a partial pack which might need
> > > yet-to-be-received objects in order to make sense of what has been
> > > already received. The server simply has to skip that many files and
> > > resume the transfer at that point, independently of the compression or
> > > even the archive format.
> >
> > Let's reiterate it to check if I understand it correctly:
> >
> > Any "restartable clone" / "resumable fetch" solution must begin with
> > a file which is rock-solid stable wrt. reproductability given the same
> > parameters. git-archive has this feature, packfile doesn't (so I guess
> > that bundle also doesn't, unless it was cached / saved on disk).
>
> Right.
*NEW IDEA*
Another solution would be to try to come up with some sort of stable
sorting of objects so that packfile generated for the same parameters
(endpoints) would be always byte-for-byte the same. But that might be
difficult, or even impossible.
Well, we could send the list of objects in pack in order used later by
pack creation to client (non-resumable but small part), and if packfile
transport was interrupted in the middle client would compare list of
complete objects in part of packfile against this manifest, and sent
request to server with *sorted* list of object it doesn't have yet.
Server would probably have to check validity of objects list first (the
object list might be needed to be more than just object list; it might
need to specify topology of deltas, i.e. which objects are base for which
ones). Then it would generate rest of packfile.
> > It would be useful if it was possible to generate part of this rock-solid
> > file for partial (range, resume) request, without need to generate
> > (calculate) parts that client already downloaded. Otherwise server has
> > to either waste disk space and IO for caching, or waste CPU (and IO)
> > on generating part which is not needed and dropping it to /dev/null.
> > git-archive you say has this feature.
>
> "Could easily have" is more appropriate.
O.K. And I can see how this can be easy done.
> > Next you need to tell server that you have those objects got using
> > resumable download part ("git archive HEAD" in your proposal), and
> > that it can use them and do not include them in prepared file/pack.
> > "have" is limited to commits, and "have <sha1>" tells server that
> > you have <sha1> and all its prerequisites (dependences). You can't
> > use "have <sha1>" with git-archive solution. I don't know enough
> > about 'shallow' capability (and what it enables) to know whether
> > it can be used for that. Can you elaborate?
>
> See above, or Documentation/technical/shallow.txt.
Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
and "deepen" commands from 'shallow' capability extension to git pack
protocol (http://git-scm.com/gitserver.txt).
> > Then you have to finish clone / fetch. All solutions so far include
> > some kind of incremental improvements. My first proposal of bisect
> > fetching 1/nth or predefined size pack is buttom-up solution, where
> > we build full clone from root commits up. You propose, from what
> > I understand build full clone from top commit down, using deepening
> > from shallow clone. In this step you either get full incremental
> > or not; downloading incremental (from what I understand) is not
> > resumable / they do not support partial fetch.
>
> Right. However, like I said, the incremental part should be much
> smaller and therefore less susceptible to network troubles.
If you have 7% total pack size of git-archive resumable part, how small
do you plan to have those incremental deepening? Besides in my 1/Nth
proposal those bottom-up packs werealso meant to be sufficiently small
to avoid network troubles.
P.S. As you can see implementing resumable clone isn't easy...
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-20 7:37 ` Jakub Narebski
@ 2009-08-20 7:48 ` Nguyen Thai Ngoc Duy
2009-08-20 8:23 ` Jakub Narebski
2009-08-20 18:41 ` Nicolas Pitre
2009-08-20 22:57 ` Sam Vilain
2 siblings, 1 reply; 39+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2009-08-20 7:48 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Nicolas Pitre, Tomasz Kontusz, git, Johannes Schindelin
On Thu, Aug 20, 2009 at 2:37 PM, Jakub Narebski<jnareb@gmail.com> wrote:
> *NEW IDEA*
>
> Another solution would be to try to come up with some sort of stable
> sorting of objects so that packfile generated for the same parameters
> (endpoints) would be always byte-for-byte the same. But that might be
> difficult, or even impossible.
Isn't it the idea of commit reels [1] from git-torrent?
[1] http://gittorrent.utsl.gen.nz/rfc.html#org-reels
--
Duy
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-20 7:48 ` Nguyen Thai Ngoc Duy
@ 2009-08-20 8:23 ` Jakub Narebski
0 siblings, 0 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-08-20 8:23 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy
Cc: Nicolas Pitre, Tomasz Kontusz, git, Johannes Schindelin
Dnia czwartek 20. sierpnia 2009 09:48, Nguyen Thai Ngoc Duy napisał:
> On Thu, Aug 20, 2009 at 2:37 PM, Jakub Narebski<jnareb@gmail.com> wrote:
> > *NEW IDEA*
> >
> > Another solution would be to try to come up with some sort of stable
> > sorting of objects so that packfile generated for the same parameters
> > (endpoints) would be always byte-for-byte the same. But that might be
> > difficult, or even impossible.
>
> Isn't it the idea of commit reels [1] from git-torrent?
>
> [1] http://gittorrent.utsl.gen.nz/rfc.html#org-reels
Well, I didn't thought that this idea didn't occur to anybody else.
What I meant here was that it is a new idea mentioned in this subthread.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-20 7:37 ` Jakub Narebski
2009-08-20 7:48 ` Nguyen Thai Ngoc Duy
@ 2009-08-20 18:41 ` Nicolas Pitre
2009-08-21 10:07 ` Jakub Narebski
2009-08-20 22:57 ` Sam Vilain
2 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-20 18:41 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Tomasz Kontusz, git, Johannes Schindelin
On Thu, 20 Aug 2009, Jakub Narebski wrote:
> On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> > You'll get the very latest revision for HEAD, and only that. The size
> > of the transfer will be roughly the size of a daily snapshot, except it
> > is fully up to date. It is however non resumable in the event of a
> > network outage. My proposal is to replace this with a "git archive"
> > call. It won't get all branches, but for the purpose of initialising
> > one's repository that should be good enough. And the "git archive" can
> > be fully resumable as I explained.
>
> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
> (well, that of course depends on repository). Not that much that is
> resumable.
Take the Linux kernel then. It is more like 75 MB.
> > Now to deepen that history. Let's say you want 10 more revisions going
> > back then you simply perform the fetch again with a --depth=10. Right
> > now it doesn't seem to work optimally, but the pack that is then being
> > sent could be made of deltas against objects found in the commits we
> > already have. Currently it seems that a pack that also includes those
> > objects we already have in addition to those we want is created, which
> > is IMHO a flaw in the shallow support that shouldn't be too hard to fix.
> > Each level of deepening should then be as small as standard fetches
> > going forward when updating the repository with new revisions.
>
> You would have the same (or at least quite similar) problems with
> deepening part (the 'incrementals' transfer part) as you found with my
> first proposal of server bisection / division of rev-list, and serving
> 1/Nth of revisions (where N is selected so packfile is reasonable) to
> client as incrementals. Yours is top-down, mine was bottom-up approach
> to sending series of smaller packs. The problem is how to select size
> of incrementals, and that incrementals are all-or-nothing (but see also
> comment below).
Yes and no. Combined with a slight reordering of commit objects, it
could be possible to receive a partial pack and still be able to extract
a bunch of full revisions. The biggest issue is to be able to transfer
revision x (75 MB for Linux), but revision x-1 usually requires only a
few kilobytes, revision x-2 a few other kilobytes, etc. Remember that
you are likely to have only a few deltas from one revision to another,
which is not the case for the very first revision you get. A special
mode to pack-object could place commit objects only after all the
objects needed to create that revision. So once you get a commit object
on the receiving end, you could assume that all objects reachable from
that commit are already received, or you had them locally already.
> In proposal using git-archive and shallow clone deepening as incrementals
> you have this small seed (how small it depends on repository: 50% - 5%)
> which is resumable. And presumably with deepening you can somehow make
> some use from incomplete packfile, only part of which was transferred
> before network error / disconnect. And even tell server about objects
> which you managed to extract from *.pack.part.
yes. And at that point resuming the transfer is just another case of
shallow repository deepening.
> *NEW IDEA*
>
> Another solution would be to try to come up with some sort of stable
> sorting of objects so that packfile generated for the same parameters
> (endpoints) would be always byte-for-byte the same. But that might be
> difficult, or even impossible.
And I don't want to commit to that either. Having some flexibility in
object ordering makes it possible to improve on the packing heuristics.
We certainly should avoid imposing strong restrictions like that for
little gain. Even the deltas are likely to be different from one
request to another when using threads as one thread might be getting
more CPU time than another slightly modifying the outcome.
> Well, we could send the list of objects in pack in order used later by
> pack creation to client (non-resumable but small part), and if packfile
> transport was interrupted in the middle client would compare list of
> complete objects in part of packfile against this manifest, and sent
> request to server with *sorted* list of object it doesn't have yet.
Well... actually that's one of the item for pack V4. Lots of SHA1s are
duplicated in tree and commit objects, in addition to the pack index
file. With pack v4 all those SHA1s would be stored only once in a table
and objects would index that table instead.
Still, that is not _that_ small though. Just look at the size of the
pack index file for the Linux repository to give you an idea.
> Server would probably have to check validity of objects list first (the
> object list might be needed to be more than just object list; it might
> need to specify topology of deltas, i.e. which objects are base for which
> ones). Then it would generate rest of packfile.
I'm afraid that has the looks of something adding lots of complexity to
a piece of git that is already quite complex already, namely
pack-objects. And there is already only a few individuals with their
brain around it.
> > > It would be useful if it was possible to generate part of this rock-solid
> > > file for partial (range, resume) request, without need to generate
> > > (calculate) parts that client already downloaded. Otherwise server has
> > > to either waste disk space and IO for caching, or waste CPU (and IO)
> > > on generating part which is not needed and dropping it to /dev/null.
> > > git-archive you say has this feature.
> >
> > "Could easily have" is more appropriate.
>
> O.K. And I can see how this can be easy done.
>
> > > Next you need to tell server that you have those objects got using
> > > resumable download part ("git archive HEAD" in your proposal), and
> > > that it can use them and do not include them in prepared file/pack.
> > > "have" is limited to commits, and "have <sha1>" tells server that
> > > you have <sha1> and all its prerequisites (dependences). You can't
> > > use "have <sha1>" with git-archive solution. I don't know enough
> > > about 'shallow' capability (and what it enables) to know whether
> > > it can be used for that. Can you elaborate?
> >
> > See above, or Documentation/technical/shallow.txt.
>
> Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
> and "deepen" commands from 'shallow' capability extension to git pack
> protocol (http://git-scm.com/gitserver.txt).
404 Not Found
Maybe that should be committed to git in Documentation/technical/ as
well?
> > > Then you have to finish clone / fetch. All solutions so far include
> > > some kind of incremental improvements. My first proposal of bisect
> > > fetching 1/nth or predefined size pack is buttom-up solution, where
> > > we build full clone from root commits up. You propose, from what
> > > I understand build full clone from top commit down, using deepening
> > > from shallow clone. In this step you either get full incremental
> > > or not; downloading incremental (from what I understand) is not
> > > resumable / they do not support partial fetch.
> >
> > Right. However, like I said, the incremental part should be much
> > smaller and therefore less susceptible to network troubles.
>
> If you have 7% total pack size of git-archive resumable part, how small
> do you plan to have those incremental deepening? Besides in my 1/Nth
> proposal those bottom-up packs werealso meant to be sufficiently small
> to avoid network troubles.
Two issues here: 1) people with slow links might not be interested in a
deep history as it costs them time. 2) Extra revisions should typically
require only a few KB each, therefore we might manage to ask for the
full history after the initial revision is downloaded and salvage as
much as we can if a network outage is encountered. There is no need for
arbitrary size, unless the user decides arbitrarily to get only 10 more
revisions, or 100 more, etc.
> P.S. As you can see implementing resumable clone isn't easy...
I've been saying that all along for quite a while now. ;-)
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-20 7:37 ` Jakub Narebski
2009-08-20 7:48 ` Nguyen Thai Ngoc Duy
2009-08-20 18:41 ` Nicolas Pitre
@ 2009-08-20 22:57 ` Sam Vilain
2 siblings, 0 replies; 39+ messages in thread
From: Sam Vilain @ 2009-08-20 22:57 UTC (permalink / raw)
To: Jakub Narebski
Cc: Nicolas Pitre, Tomasz Kontusz, git, Johannes Schindelin, nick edelen
On Thu, 2009-08-20 at 09:37 +0200, Jakub Narebski wrote:
> You would have the same (or at least quite similar) problems with
> deepening part (the 'incrementals' transfer part) as you found with my
> first proposal of server bisection / division of rev-list, and serving
> 1/Nth of revisions (where N is selected so packfile is reasonable) to
> client as incrementals. Yours is top-down, mine was bottom-up approach
> to sending series of smaller packs. The problem is how to select size
> of incrementals, and that incrementals are all-or-nothing (but see also
> comment below).
I've defined a way to do this which doesn't have the complexity of
bisect in GitTorrent, making the compromise that you can't guarantee
each chunk is exactly the same size... I'll have a crack at doing it
based on the rev-cache code in C instead of the horrendously slow
Perl/Berkeley solution I have at the moment to see how well it fares.
> Another solution would be to try to come up with some sort of stable
> sorting of objects so that packfile generated for the same parameters
> (endpoints) would be always byte-for-byte the same. But that might be
> difficult, or even impossible.
delta compression is not repeatable enough for this.
This was an assumption made by the first version of GitTorrent, that
this would be an appropriate solution.
So, first you have to sort the objects - that's fine, --date-order is a
good starting point, then I reasoned that interleaving new objects for
each commit with commit objects would be a useful sort order. You also
need to tie-break for commits with the same commit date; I just used the
SHA-1 of the commit for that. Finally, when making packs to avoid
excessive transfer you have to try to make sure that they are "thin"
packs.
Currently, thin packs can only work starting at the beginning of history
and working forward, which is opposite to what happens most of the time
in packs. I think this is the source of much of the inefficiency caused
by chopping up the object lists mentioned in my other e-mail. It might
be possible, if you could also know which earlier objects were using
this object as a delta base, to try delta'ing against all those objects
and see which one results in the smallest delta.
> Well, we could send the list of objects in pack in order used later by
> pack creation to client (non-resumable but small part), and if packfile
> transport was interrupted in the middle client would compare list of
> complete objects in part of packfile against this manifest, and sent
> request to server with *sorted* list of object it doesn't have yet.
> Server would probably have to check validity of objects list first (the
> object list might be needed to be more than just object list; it might
> need to specify topology of deltas, i.e. which objects are base for which
> ones). Then it would generate rest of packfile.
Mmm. It's a bit chatty, that. Object lists add another 10-20% on,
which I think should be avoidable if the thin pack problem, plus the
problem of some objects ending up in more than one of the thin packs
that are created, should be reduced to very little.
Sam
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-20 18:41 ` Nicolas Pitre
@ 2009-08-21 10:07 ` Jakub Narebski
2009-08-21 10:26 ` Matthieu Moy
2009-08-21 21:07 ` Nicolas Pitre
0 siblings, 2 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-08-21 10:07 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon
On Thu, 20 Aug 2009, Nicolas Pitre wrote:
> On Thu, 20 Aug 2009, Jakub Narebski wrote:
>> On Wed, 19 Aug 2009, Nicolas Pitre wrote:
>>> You'll get the very latest revision for HEAD, and only that. The size
>>> of the transfer will be roughly the size of a daily snapshot, except it
>>> is fully up to date. It is however non resumable in the event of a
>>> network outage. My proposal is to replace this with a "git archive"
>>> call. It won't get all branches, but for the purpose of initialising
>>> one's repository that should be good enough. And the "git archive" can
>>> be fully resumable as I explained.
>>
>> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
>> (well, that of course depends on repository). Not that much that is
>> resumable.
>
> Take the Linux kernel then. It is more like 75 MB.
Ah... good example.
On the other hand Linux is fairly large project in terms of LoC, but
it had its history cut when moving to Git, so the ratio of git-archive
of HEAD to the size of packfile is overemphasized here.
>>> Now to deepen that history. Let's say you want 10 more revisions going
>>> back then you simply perform the fetch again with a --depth=10. Right
>>> now it doesn't seem to work optimally, but the pack that is then being
>>> sent could be made of deltas against objects found in the commits we
>>> already have. Currently it seems that a pack that also includes those
>>> objects we already have in addition to those we want is created, which
>>> is IMHO a flaw in the shallow support that shouldn't be too hard to fix.
>>> Each level of deepening should then be as small as standard fetches
>>> going forward when updating the repository with new revisions.
>>
>> You would have the same (or at least quite similar) problems with
>> deepening part (the 'incrementals' transfer part) as you found with my
>> first proposal of server bisection / division of rev-list, and serving
>> 1/Nth of revisions (where N is selected so packfile is reasonable) to
>> client as incrementals. Yours is top-down, mine was bottom-up approach
>> to sending series of smaller packs. The problem is how to select size
>> of incrementals, and that incrementals are all-or-nothing (but see also
>> comment below).
>
> Yes and no. Combined with a slight reordering of commit objects, it
> could be possible to receive a partial pack and still be able to extract
> a bunch of full revisions. The biggest issue is to be able to transfer
> revision x (75 MB for Linux), but revision x-1 usually requires only a
> few kilobytes, revision x-2 a few other kilobytes, etc. Remember that
> you are likely to have only a few deltas from one revision to another,
> which is not the case for the very first revision you get.
Let's reiterate again, to be sure that I understand this correctly:
You make use here of a few facts:
1. Objects in packfile are _usually_ sorted in recency order, with most
recent commits, and most recent versions of trees and tags being in
the front of pack file, and being base objects for a large set of
objects. Note the "usually" part; it is not set in stone as for RCS
(and CVS) reverse delta based repository format.
2. There is support in git pack format to do 'deepening' of shallow
clone, which means that git can generate incrementals in top-down
order, _similar to how objects are ordered in packfile_.
3. git-archive output is stable. _git-archive can be made resumable_
(with range/partial requests), and can be made so it can create
single-head depth 0 shallow clone.
Also, with top-down deepening order even if you don't use
'git clone --continue' but 'git clone --skip' (or something), you
would have got usable shallow clone. In the most extreme case when
you are able to get only the fully resumable part, i.e. git-archive
part (with top commit), you would have single-branch depth 0
shallow clone (not very usable, but still better than nothing).
> A special
> mode to pack-object could place commit objects only after all the
> objects needed to create that revision. So once you get a commit object
> on the receiving end, you could assume that all objects reachable from
> that commit are already received, or you had them locally already.
Yes, with such mode (which I think wouldn't reduce / interfere with
ability for upload-pack to pack more tightly by reordering objects
and choosing different deltas) it would be easy to do a salvage of
a partially completed / transferred packfile. Even if there is no
extension to tell git server which objects we have ("have" is only
about commits), if there is at least one commit object in received
part of packfile, we can try to continue from later (from more);
there is less left to download.
>
>> In proposal using git-archive and shallow clone deepening as incrementals
>> you have this small seed (how small it depends on repository: 50% - 5%)
>> which is resumable. And presumably with deepening you can somehow make
>> some use from incomplete packfile, only part of which was transferred
>> before network error / disconnect. And even tell server about objects
>> which you managed to extract from *.pack.part.
>
> yes. And at that point resuming the transfer is just another case of
> shallow repository deepening.
Also for deepening top-down incrementals like in your proposal you can
have 'salvage' operation which tries to use something out of partially
transferred packfile (partially downloaded incremental). It is not,
I think, the case with my earlier 'server bisect' bottom-up incrementals
idea.
>
>> *NEW IDEA*
>>
>> Another solution would be to try to come up with some sort of stable
>> sorting of objects so that packfile generated for the same parameters
>> (endpoints) would be always byte-for-byte the same. But that might be
>> difficult, or even impossible.
>
> And I don't want to commit to that either. Having some flexibility in
> object ordering makes it possible to improve on the packing heuristics.
> We certainly should avoid imposing strong restrictions like that for
> little gain. Even the deltas are likely to be different from one
> request to another when using threads as one thread might be getting
> more CPU time than another slightly modifying the outcome.
Right.
>> Well, we could send the list of objects in pack in order used later by
>> pack creation to client (non-resumable but small part), and if packfile
>> transport was interrupted in the middle client would compare list of
>> complete objects in part of packfile against this manifest, and sent
>> request to server with *sorted* list of object it doesn't have yet.
>
> Well... actually that's one of the item for pack V4. Lots of SHA1s are
> duplicated in tree and commit objects, in addition to the pack index
> file. With pack v4 all those SHA1s would be stored only once in a table
> and objects would index that table instead.
>
> Still, that is not _that_ small though. Just look at the size of the
> pack index file for the Linux repository to give you an idea.
Well, as such plan (map) of a packfile wouldn't be much smaller than
pack index, if pack index file is large enough (or connection crappy
enough) that it couldn't be transferred without interruption.
Nevertheless 34 MB index for largest 310 MB packfile in Linux kernel
http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/objects/pack/
isn't something very large. And objects-list / plan of packfile would
be of comparable size.
I was envisioning packfile plan (packfile map) as something like that
sha1 TERMINATOR
for objects that are not deltified in the packfile, and
sha1 SEPARATOR base-sha1 TERMINATOR
for objects that are deltified (or something like that; we could use
pkt-line format instead).
>> Server would probably have to check validity of objects list first (the
>> object list might be needed to be more than just object list; it might
>> need to specify topology of deltas, i.e. which objects are base for which
>> ones). Then it would generate rest of packfile.
>
> I'm afraid that has the looks of something adding lots of complexity to
> a piece of git that is already quite complex already, namely
> pack-objects. And there is already only a few individuals with their
> brain around it.
Well, with complexity or server CPU/IO because one should not trust
client (unfortunately), with 'packfile plan' transfer being non-resumable,
and also with requiring packv4 or a temporary file or memory to send
packfile plan (packfile map)... I think we can scrape that half-baked
idea.
>>>> [...] I don't know enough
>>>> about 'shallow' capability (and what it enables) to know whether
>>>> it can be used for that. Can you elaborate?
>>>
>>> See above, or Documentation/technical/shallow.txt.
>>
>> Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
>> and "deepen" commands from 'shallow' capability extension to git pack
>> protocol (http://git-scm.com/gitserver.txt).
>
> 404 Not Found
>
> Maybe that should be committed to git in Documentation/technical/ as
> well?
This was plain text RFC for the Git Packfile Protocol, generated from
rfc2629 XML sources at http://github.com/schacon/gitserver-rfc
Scott, what happened to http://git-scm.com/gitserver.txt?
And could you create 'rfc' or 'text' branch in gitserver-rfc
repository, with processed plain/text output, similar to 'man' and
'html' branches in git.git repository? TIA.
_Some_ description of pack protocol can be found in git mailing list
archives
http://thread.gmane.org/gmane.comp.version-control.git/118956
in "The Git Community Book"
http://book.git-scm.com/7_transfer_protocols.html
http://github.com/schacon/gitbook/blob/master/text/54_Transfer_Protocols/0_Transfer_Protocols.markdown
and in "Pro Git"
http://progit.org/book/ch9-6.html
http://github.com/progit/progit/blob/master/en/09-git-internals/01-chapter9.markdown
The description in Documentation/technical/pack-protocol.txt is very
brief, and Documentation/technical/shallow.txt doesn't cover 'shallow'
capability of git pack protocol.
>>>> Then you have to finish clone / fetch. All solutions so far include
>>>> some kind of incremental improvements. My first proposal of bisect
>>>> fetching 1/nth or predefined size pack is buttom-up solution, where
>>>> we build full clone from root commits up. You propose, from what
>>>> I understand build full clone from top commit down, using deepening
>>>> from shallow clone. In this step you either get full incremental
>>>> or not; downloading incremental (from what I understand) is not
>>>> resumable / they do not support partial fetch.
>>>
>>> Right. However, like I said, the incremental part should be much
>>> smaller and therefore less susceptible to network troubles.
>>
>> If you have 7% total pack size of git-archive resumable part, how small
>> do you plan to have those incremental deepening? Besides in my 1/Nth
>> proposal those bottom-up packs werealso meant to be sufficiently small
>> to avoid network troubles.
>
> Two issues here: 1) people with slow links might not be interested in a
> deep history as it costs them time. 2) Extra revisions should typically
> require only a few KB each, therefore we might manage to ask for the
> full history after the initial revision is downloaded and salvage as
> much as we can if a network outage is encountered. There is no need for
> arbitrary size, unless the user decides arbitrarily to get only 10 more
> revisions, or 100 more, etc.
Those two features of your proposal:
1.) It is possible salvage of partially transferred packfiles (so there
is no requirement to guess accurately what size should they be),
2.) After completing initial git-archive transfer, you can convert
incomplete clone into functioning repository. It would be shallow
clone, and can be missing some branches and tags, but you can work
with it if network connection fails completely.
make it very compelling.
>> P.S. As you can see implementing resumable clone isn't easy...
>
> I've been saying that all along for quite a while now. ;-)
Well, on the other hand side we have example of how long it took to
come to current implementation of git submodules. But if finally
got done.
The git-archive + deepening approach you proposed can be split into
smaller individual improvements. You don't need to implement it all
at once.
1. Improve deepening of shallow clone. This means sending only required
objects, and being able to use as a base objects that other side has
and send thin pack.
2. Add support for resuming (range request) of git-archive. It is up
to client to translate size of partial transfer of compressed file
into range request of original (uncompressed) archive.
3. Create new git-archive pseudoformat, used to transfer single commit
(with commit object and original branch name in some extended header,
similar to how commit ID is stored in extended pax header or ZIP
comment). It would imply not using export-* gitattributes.
4. Implement alternate ordering of objects in packfile, so commit object
is put immediately after all its prerequisites.
5. Implement 'salvage' operation, which given partially transferred
packfile would deepen shallow clone, or advance tracking branches,
ensuring that repository would pass fsck after this operation.
Probably requires 4; might be not possible or much harder to salvage
anything with current ordering of objects in packfile.
6. Implement resumable clone ("git clone --keep <URL> [<directory>]",
"git clone --resume" / "git clone --continue", "git clone --abort",
"git clone --make-shallow" / "git clone --salvage").
Requires 1-5.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-21 10:07 ` Jakub Narebski
@ 2009-08-21 10:26 ` Matthieu Moy
2009-08-21 21:07 ` Nicolas Pitre
1 sibling, 0 replies; 39+ messages in thread
From: Matthieu Moy @ 2009-08-21 10:26 UTC (permalink / raw)
To: Jakub Narebski
Cc: Nicolas Pitre, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon
Jakub Narebski <jnareb@gmail.com> writes:
> On the other hand Linux is fairly large project in terms of LoC, but
> it had its history cut when moving to Git, so the ratio of git-archive
> of HEAD to the size of packfile is overemphasized here.
Emacs can be a good example if you want a project with a loooong
history.
emacs.git$ git ll | wc -l
100651
emacs.git$ du -sh emacs.tar.gz .git/objects/pack/pack-144583582d53e273028966c6de2b3fb2fe3504bc.pack
29M emacs.tar.gz
138M .git/objects/pack/pack-144583582d53e273028966c6de2b3fb2fe3504bc.pack
(from git://git.savannah.gnu.org/emacs.git )
--
Matthieu
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-21 10:07 ` Jakub Narebski
2009-08-21 10:26 ` Matthieu Moy
@ 2009-08-21 21:07 ` Nicolas Pitre
2009-08-21 21:41 ` Jakub Narebski
2009-08-21 23:07 ` Sam Vilain
1 sibling, 2 replies; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-21 21:07 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon
On Fri, 21 Aug 2009, Jakub Narebski wrote:
> On Thu, 20 Aug 2009, Nicolas Pitre wrote:
> > On Thu, 20 Aug 2009, Jakub Narebski wrote:
> >> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
> >> (well, that of course depends on repository). Not that much that is
> >> resumable.
> >
> > Take the Linux kernel then. It is more like 75 MB.
>
> Ah... good example.
>
> On the other hand Linux is fairly large project in terms of LoC, but
> it had its history cut when moving to Git, so the ratio of git-archive
> of HEAD to the size of packfile is overemphasized here.
That doesn't matter. You still need that amount of data up front to do
anything. And I doubt people with slow links will want the full history
anyway, regardless if it goes backward 4 years or 18 years back.
> You make use here of a few facts:
>
> 1. Objects in packfile are _usually_ sorted in recency order, with most
> recent commits, and most recent versions of trees and tags being in
> the front of pack file, and being base objects for a large set of
> objects. Note the "usually" part; it is not set in stone as for RCS
> (and CVS) reverse delta based repository format.
Exact. In theory the object order could be totally random and the pack
would still be valid. The only restriction at the moment has to do with
OFS_DELTA objects as the reference to the base object is encoded as a
downward offset from the beginning of that OFS_DELTA object. Hence the
base object has to appear first. In the case of REF_DELTA objects, the
base can be located anywhere in the pack (or anywhere else outside of
the pack in the thin pack case).
> 2. There is support in git pack format to do 'deepening' of shallow
> clone, which means that git can generate incrementals in top-down
> order, _similar to how objects are ordered in packfile_.
Well... the pack format was not meant for that "support". The fact that
the typical object order used by pack-objects when serving fetch request
is amenable to incremental top-down updates is rather coincidental and
not really planned.
> 3. git-archive output is stable. _git-archive can be made resumable_
> (with range/partial requests), and can be made so it can create
> single-head depth 0 shallow clone.
>
> Also, with top-down deepening order even if you don't use
> 'git clone --continue' but 'git clone --skip' (or something), you
> would have got usable shallow clone. In the most extreme case when
> you are able to get only the fully resumable part, i.e. git-archive
> part (with top commit), you would have single-branch depth 0
> shallow clone (not very usable, but still better than nothing).
Right.
> > A special
> > mode to pack-object could place commit objects only after all the
> > objects needed to create that revision. So once you get a commit object
> > on the receiving end, you could assume that all objects reachable from
> > that commit are already received, or you had them locally already.
>
> Yes, with such mode (which I think wouldn't reduce / interfere with
> ability for upload-pack to pack more tightly by reordering objects
> and choosing different deltas) it would be easy to do a salvage of
> a partially completed / transferred packfile. Even if there is no
> extension to tell git server which objects we have ("have" is only
> about commits), if there is at least one commit object in received
> part of packfile, we can try to continue from later (from more);
> there is less left to download.
Exact. Suffice to set the last received commit(s) (after validation) as
one of the shallow points.
> >> Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
> >> and "deepen" commands from 'shallow' capability extension to git pack
> >> protocol (http://git-scm.com/gitserver.txt).
> >
> > 404 Not Found
> >
> > Maybe that should be committed to git in Documentation/technical/ as
> > well?
>
> This was plain text RFC for the Git Packfile Protocol, generated from
> rfc2629 XML sources at http://github.com/schacon/gitserver-rfc
I suggest you track it down and prod/propose a version for merging in
the git repository.
> The description in Documentation/technical/pack-protocol.txt is very
> brief, and Documentation/technical/shallow.txt doesn't cover 'shallow'
> capability of git pack protocol.
Yeah. I finally had a look directly at the code to understand how it
works.
> >> P.S. As you can see implementing resumable clone isn't easy...
> >
> > I've been saying that all along for quite a while now. ;-)
>
> Well, on the other hand side we have example of how long it took to
> come to current implementation of git submodules. But if finally
> got done.
In this case there is still no new line of code what so ever. Thinking
it through is what takes time.
> The git-archive + deepening approach you proposed can be split into
> smaller individual improvements. You don't need to implement it all
> at once.
>
> 1. Improve deepening of shallow clone. This means sending only required
> objects, and being able to use as a base objects that other side has
> and send thin pack.
Yes. And now that I understand how shallow clones are implemented, I
Probably will fix that flaw soon. Won't be hard at all.
> 2. Add support for resuming (range request) of git-archive. It is up
> to client to translate size of partial transfer of compressed file
> into range request of original (uncompressed) archive.
>
> 3. Create new git-archive pseudoformat, used to transfer single commit
> (with commit object and original branch name in some extended header,
> similar to how commit ID is stored in extended pax header or ZIP
> comment). It would imply not using export-* gitattributes.
The format I was envisioning is really simple:
First the size of the raw commit object data content in decimal,
followed by a 0 byte, followed by the actual content of the commit
object, followed by a 0 byte. (Note: this could be the exact same
content as the canonical commit object data with the "commit" prefix,
but as all the rest are all blob content this would be redundant.)
Then, for each file:
- The file mode in octal notation just as in tree objects
- a space
- the size of the file in decimal
- a tab
- the full path of the file
- a 0 byte
- the file content as found in the corresponding blob
- a 0 byte
And finally some kind of marker to indicate the end of the stream.
Put the lot through zlib and you're done.
> 4. Implement alternate ordering of objects in packfile, so commit object
> is put immediately after all its prerequisites.
That would require some changes in the object enumeration code which is
an area of the code I don't know well.
> 5. Implement 'salvage' operation, which given partially transferred
> packfile would deepen shallow clone, or advance tracking branches,
> ensuring that repository would pass fsck after this operation.
>
> Probably requires 4; might be not possible or much harder to salvage
> anything with current ordering of objects in packfile.
>
> 6. Implement resumable clone ("git clone --keep <URL> [<directory>]",
> "git clone --resume" / "git clone --continue", "git clone --abort",
> "git clone --make-shallow" / "git clone --salvage").
Right. This is all doable fairly easily.
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-21 21:07 ` Nicolas Pitre
@ 2009-08-21 21:41 ` Jakub Narebski
2009-08-22 0:59 ` Nicolas Pitre
2009-08-21 23:07 ` Sam Vilain
1 sibling, 1 reply; 39+ messages in thread
From: Jakub Narebski @ 2009-08-21 21:41 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon
On Fri, 21 Aug 2009, Nicolas Pitre wrote:
> On Fri, 21 Aug 2009, Jakub Narebski wrote:
>> On Thu, 20 Aug 2009, Nicolas Pitre wrote:
>>> On Thu, 20 Aug 2009, Jakub Narebski wrote:
>>>> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
>>>> (well, that of course depends on repository). Not that much that is
>>>> resumable.
>>>
>>> Take the Linux kernel then. It is more like 75 MB.
>>
>> Ah... good example.
>>
>> On the other hand Linux is fairly large project in terms of LoC, but
>> it had its history cut when moving to Git, so the ratio of git-archive
>> of HEAD to the size of packfile is overemphasized here.
>
> That doesn't matter. You still need that amount of data up front to do
> anything. And I doubt people with slow links will want the full history
> anyway, regardless if it goes backward 4 years or 18 years back.
On the other hand unreliable link doesn't need to mean unreasonably
slow link.
Hopefully GitTorrent / git-mirror-sync would finally come out of
vapourware and wouldn't share the fate of Duke Nukem Forever ;-),
and we would have this as an alternative to clone large repositories.
Well, supposedly there is some code, and last year GSoC project at
least shook the dust out of initial design and made it simplier, IIUC.
>> You make use here of a few facts:
[...]
>> 2. There is support in git pack format to do 'deepening' of shallow
>> clone, which means that git can generate incrementals in top-down
>> order, _similar to how objects are ordered in packfile_.
>
> Well... the pack format was not meant for that "support". The fact that
> the typical object order used by pack-objects when serving fetch request
> is amenable to incremental top-down updates is rather coincidental and
> not really planned.
Ooops. I meant "git pack PROTOCOL" here, not "git pack _format_".
the one about want/have/shallow/deepen exchange.
[...]
>>> A special
>>> mode to pack-object could place commit objects only after all the
>>> objects needed to create that revision. So once you get a commit object
>>> on the receiving end, you could assume that all objects reachable from
>>> that commit are already received, or you had them locally already.
>>
>> Yes, with such mode (which I think wouldn't reduce / interfere with
>> ability for upload-pack to pack more tightly by reordering objects
>> and choosing different deltas) it would be easy to do a salvage of
>> a partially completed / transferred packfile. Even if there is no
>> extension to tell git server which objects we have ("have" is only
>> about commits), if there is at least one commit object in received
>> part of packfile, we can try to continue from later (from more);
>> there is less left to download.
>
> Exact. Suffice to set the last received commit(s) (after validation) as
> one of the shallow points.
Assuming that received commit is full (has all prerequisites), and
is connected to the rest of body of partially [shallow] cloned
repository.
>>>> Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
>>>> and "deepen" commands from 'shallow' capability extension to git pack
>>>> protocol (http://git-scm.com/gitserver.txt).
>>>
>>> 404 Not Found
>>>
>>> Maybe that should be committed to git in Documentation/technical/ as
>>> well?
>>
>> This was plain text RFC for the Git Packfile Protocol, generated from
>> rfc2629 XML sources at http://github.com/schacon/gitserver-rfc
>
> I suggest you track it down and prod/propose a version for merging in
> the git repository.
Scott Chacon was (and is) CC-ed.
I don't know if you remember mentioned discussion about pack protocol,
stemming from the fact that some of git (re)implementations (Dulwich,
JGit) failed to implement it properly, where properly = same as
git-core, i.e. the original implementation in C... because there were
not enough documentation.
>>>> P.S. As you can see implementing resumable clone isn't easy...
>>>
>>> I've been saying that all along for quite a while now. ;-)
>>
>> Well, on the other hand side we have example of how long it took to
>> come to current implementation of git submodules. But if finally
>> got done.
>
> In this case there is still no new line of code what so ever. Thinking
> it through is what takes time.
Measure twice, cut once :-)
In this case I think design upfront is a good solution.
>> The git-archive + deepening approach you proposed can be split into
>> smaller individual improvements. You don't need to implement it all
>> at once.
[...]
>> 3. Create new git-archive pseudoformat, used to transfer single commit
>> (with commit object and original branch name in some extended header,
>> similar to how commit ID is stored in extended pax header or ZIP
>> comment). It would imply not using export-* gitattributes.
>
> The format I was envisioning is really simple:
>
> First the size of the raw commit object data content in decimal,
> followed by a 0 byte, followed by the actual content of the commit
> object, followed by a 0 byte. (Note: this could be the exact same
> content as the canonical commit object data with the "commit" prefix,
> but as all the rest are all blob content this would be redundant.)
>
> Then, for each file:
>
> - The file mode in octal notation just as in tree objects
> - a space
> - the size of the file in decimal
> - a tab
> - the full path of the file
> - a 0 byte
> - the file content as found in the corresponding blob
> - a 0 byte
>
> And finally some kind of marker to indicate the end of the stream.
>
> Put the lot through zlib and you're done.
So you don't want to just tack commit object (as extended pax header,
or a comment - if it is at all possible) to the existing 'tar' and
'zip' archive formats. Probably better to design format from scratch.
>> 4. Implement alternate ordering of objects in packfile, so commit object
>> is put immediately after all its prerequisites.
>
> That would require some changes in the object enumeration code which is
> an area of the code I don't know well.
Oh.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-21 21:07 ` Nicolas Pitre
2009-08-21 21:41 ` Jakub Narebski
@ 2009-08-21 23:07 ` Sam Vilain
2009-08-22 3:37 ` Nicolas Pitre
1 sibling, 1 reply; 39+ messages in thread
From: Sam Vilain @ 2009-08-21 23:07 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon
On Fri, 2009-08-21 at 17:07 -0400, Nicolas Pitre wrote:
> > 2. There is support in git pack format to do 'deepening' of shallow
> > clone, which means that git can generate incrementals in top-down
> > order, _similar to how objects are ordered in packfile_.
>
> Well... the pack format was not meant for that "support". The fact
> that
> the typical object order used by pack-objects when serving fetch
> request
> is amenable to incremental top-down updates is rather coincidental
> and
> not really planned.
Mmm. And the problem with 'thin' packs is that they normally allow
deltas the other way.
I think the first step here would be to allow thin pack generation to
accept a bounded range of commits, any of the objects within which may
be used as delta base candidates. That way, these "top down" thin packs
can be generated. Currently of course it just uses the --not and makes
"bottom up" thin packs.
> > Another solution would be to try to come up with some sort of stable
> > sorting of objects so that packfile generated for the same
> > parameters (endpoints) would be always byte-for-byte the same. But
> > that might be difficult, or even impossible.
>
> And I don't want to commit to that either. Having some flexibility
> in object ordering makes it possible to improve on the packing
> heuristics.
You don't have to lose that for storage. It's only for generating the
thin packs that it matters; also, the restriction is relaxed when it
comes to objects which are all being sent in the same pack, which can
freely delta amongst themselves in any direction.
What did you think about the bundle slicing stuff?
Sam
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-21 21:41 ` Jakub Narebski
@ 2009-08-22 0:59 ` Nicolas Pitre
0 siblings, 0 replies; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-22 0:59 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon
[-- Attachment #1: Type: TEXT/PLAIN, Size: 4528 bytes --]
On Fri, 21 Aug 2009, Jakub Narebski wrote:
> On Fri, 21 Aug 2009, Nicolas Pitre wrote:
> > On Fri, 21 Aug 2009, Jakub Narebski wrote:
> >> On Thu, 20 Aug 2009, Nicolas Pitre wrote:
> >>> On Thu, 20 Aug 2009, Jakub Narebski wrote:
>
> >>>> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
> >>>> (well, that of course depends on repository). Not that much that is
> >>>> resumable.
> >>>
> >>> Take the Linux kernel then. It is more like 75 MB.
> >>
> >> Ah... good example.
> >>
> >> On the other hand Linux is fairly large project in terms of LoC, but
> >> it had its history cut when moving to Git, so the ratio of git-archive
> >> of HEAD to the size of packfile is overemphasized here.
> >
> > That doesn't matter. You still need that amount of data up front to do
> > anything. And I doubt people with slow links will want the full history
> > anyway, regardless if it goes backward 4 years or 18 years back.
>
> On the other hand unreliable link doesn't need to mean unreasonably
> slow link.
In my experience speed and reliability are more or less tied together.
And the slower is your link, the longer your transfer will last, the
greater are the chances for you to have troubles.
> Hopefully GitTorrent / git-mirror-sync would finally come out of
> vapourware and wouldn't share the fate of Duke Nukem Forever ;-),
> and we would have this as an alternative to clone large repositories.
Well... Maybe.
> Well, supposedly there is some code, and last year GSoC project at
> least shook the dust out of initial design and made it simplier, IIUC.
The BitTorrent protocol is a nifty thing (although I doubt the
intertainment industry think so). But its efficiency relies on the fact
that many many people are expected to download the same stuff at the
same time. I have some doubts about the availability of the right
conditions in the context of git for a BitTorrent-like protocol to work
well in practice. But this is Open Source and no one has to wait for me
or anyone else to be convinced before attempting it and showing results
to the world.
> >> This was plain text RFC for the Git Packfile Protocol, generated from
> >> rfc2629 XML sources at http://github.com/schacon/gitserver-rfc
> >
> > I suggest you track it down and prod/propose a version for merging in
> > the git repository.
>
> Scott Chacon was (and is) CC-ed.
He might not have followed all our exchange so deeply in this thread
though. So maybe another thread with him in the To: field might be
required to get his attention.
> I don't know if you remember mentioned discussion about pack protocol,
> stemming from the fact that some of git (re)implementations (Dulwich,
> JGit) failed to implement it properly, where properly = same as
> git-core, i.e. the original implementation in C... because there were
> not enough documentation.
Yes I followed the discussion. I still think that, since that
documentation exists now, that would be a good idea to have a copy
included in the git sources.
> > The format I was envisioning is really simple:
> >
> > First the size of the raw commit object data content in decimal,
> > followed by a 0 byte, followed by the actual content of the commit
> > object, followed by a 0 byte. (Note: this could be the exact same
> > content as the canonical commit object data with the "commit" prefix,
> > but as all the rest are all blob content this would be redundant.)
> >
> > Then, for each file:
> >
> > - The file mode in octal notation just as in tree objects
> > - a space
> > - the size of the file in decimal
> > - a tab
> > - the full path of the file
> > - a 0 byte
> > - the file content as found in the corresponding blob
> > - a 0 byte
> >
> > And finally some kind of marker to indicate the end of the stream.
> >
> > Put the lot through zlib and you're done.
>
> So you don't want to just tack commit object (as extended pax header,
> or a comment - if it is at all possible) to the existing 'tar' and
> 'zip' archive formats. Probably better to design format from scratch.
As René Scharfe mentioned, the existing archive formats have limitations
and complexities that we might simply avoid altogether by creating a
simpler format that is more likely to never fail to faithfully reproduce
a git revision content. Maybe the git-fast-import format could do it
even better, and maybe not. That's an implementation detail that needs
to be worked out once one is ready to get real with actual coding.
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-21 23:07 ` Sam Vilain
@ 2009-08-22 3:37 ` Nicolas Pitre
2009-08-22 5:50 ` Sam Vilain
0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-22 3:37 UTC (permalink / raw)
To: Sam Vilain
Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon
On Sat, 22 Aug 2009, Sam Vilain wrote:
> On Fri, 2009-08-21 at 17:07 -0400, Nicolas Pitre wrote:
> > > 2. There is support in git pack format to do 'deepening' of shallow
> > > clone, which means that git can generate incrementals in top-down
> > > order, _similar to how objects are ordered in packfile_.
> >
> > Well... the pack format was not meant for that "support". The fact
> > that
> > the typical object order used by pack-objects when serving fetch
> > request
> > is amenable to incremental top-down updates is rather coincidental
> > and
> > not really planned.
>
> Mmm. And the problem with 'thin' packs is that they normally allow
> deltas the other way.
Sure. The pack format is flexible.
> I think the first step here would be to allow thin pack generation to
> accept a bounded range of commits, any of the objects within which may
> be used as delta base candidates. That way, these "top down" thin packs
> can be generated. Currently of course it just uses the --not and makes
> "bottom up" thin packs.
The pack is still almost top-down. It's only the missing delta base
that are in the other direction, refering to objects you have locally
and therefore older.
> > > Another solution would be to try to come up with some sort of stable
> > > sorting of objects so that packfile generated for the same
> > > parameters (endpoints) would be always byte-for-byte the same. But
> > > that might be difficult, or even impossible.
> >
> > And I don't want to commit to that either. Having some flexibility
> > in object ordering makes it possible to improve on the packing
> > heuristics.
>
> You don't have to lose that for storage. It's only for generating the
> thin packs that it matters;
What matters?
> also, the restriction is relaxed when it
> comes to objects which are all being sent in the same pack, which can
> freely delta amongst themselves in any direction.
That's always the case within a pack, but only for REF_DELTA objects.
The OFS_DELTA objects have to be ordered. And yes, having deltas across
packs is disallowed to avoid cycles and to keep the database robust.
The only exception is for thin packs, but those are never created on
disk. Thin packs are only used for transport and quickly "fixed" upon
reception by appending the missing objects to them so they are not
"thin" anymore.
> What did you think about the bundle slicing stuff?
If I didn't comment on it already, then I probably missed it and have no
idea.
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-22 3:37 ` Nicolas Pitre
@ 2009-08-22 5:50 ` Sam Vilain
2009-08-22 8:13 ` Nicolas Pitre
0 siblings, 1 reply; 39+ messages in thread
From: Sam Vilain @ 2009-08-22 5:50 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon
On Fri, 2009-08-21 at 23:37 -0400, Nicolas Pitre wrote:
> What did you think about the bundle slicing stuff?
>
> If I didn't comment on it already, then I probably missed it and have no
> idea.
I really tire of repeating myself for your sole benefit. Please show
some consideration for other people in the conversation by trying to
listen. Thank-you.
> > I think the first step here would be to allow thin pack generation to
> > accept a bounded range of commits, any of the objects within which may
> > be used as delta base candidates. That way, these "top down" thin packs
> > can be generated. Currently of course it just uses the --not and makes
> > "bottom up" thin packs.
>
> The pack is still almost top-down. It's only the missing delta base
> that are in the other direction, refering to objects you have locally
> and therefore older.
Ok, but right now there's no way to specify that you want a thin pack,
where the allowable base objects are *newer* than the commit range you
wish to include.
What I said in my other e-mail where I showed how well it works taking
a given bundle, and slicing it into a series of thin packs, was that it
seems to add a bit of extra size to the resultant packs - best I got for
slicing up the entire git.git run was about 20%. If this can be
reduced to under 10% (say), then sending bundle slices would be quite
reasonable by default for the benefit of making large fetches
restartable, or even spreadable across multiple mirrors.
The object sorting stuff is something of a distraction; it's required
for download spreading but not for the case at hand now.
Sam
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-22 5:50 ` Sam Vilain
@ 2009-08-22 8:13 ` Nicolas Pitre
2009-08-23 10:37 ` Sam Vilain
0 siblings, 1 reply; 39+ messages in thread
From: Nicolas Pitre @ 2009-08-22 8:13 UTC (permalink / raw)
To: Sam Vilain
Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon
[-- Attachment #1: Type: TEXT/PLAIN, Size: 5187 bytes --]
On Sat, 22 Aug 2009, Sam Vilain wrote:
> On Fri, 2009-08-21 at 23:37 -0400, Nicolas Pitre wrote:
> > > What did you think about the bundle slicing stuff?
> >
> > If I didn't comment on it already, then I probably missed it and have no
> > idea.
>
> I really tire of repeating myself for your sole benefit. Please show
> some consideration for other people in the conversation by trying to
> listen. Thank-you.
I'm sorry but I have way too many emails to consider reading. This is
like ethernet: not a reliable transport, and lost packets means you have
to retransmit. Cut and paste does wonders, or even a link to previous
post.
> > > I think the first step here would be to allow thin pack generation to
> > > accept a bounded range of commits, any of the objects within which may
> > > be used as delta base candidates. That way, these "top down" thin packs
> > > can be generated. Currently of course it just uses the --not and makes
> > > "bottom up" thin packs.
> >
> > The pack is still almost top-down. It's only the missing delta base
> > that are in the other direction, refering to objects you have locally
> > and therefore older.
>
> Ok, but right now there's no way to specify that you want a thin pack,
> where the allowable base objects are *newer* than the commit range you
> wish to include.
Sure you can. Try this:
( echo "-$(git rev-parse v1.6.4)"; \
git rev-list --objects v1.6.2..v1.6.3 ) | \
git pack-objects --progress --stdout > foo.pack
That'll give you a thin pack for the _new_ objects that _appeared_
between v1.6.2 and v1.6.3, but which external delta base objects are
found in v1.6.4.
If you want _all_ the objects that are referenced from commits between
v1.6.2 and v1.6.3 then you just have to list them all for v1.6.2 in
addition to the rest:
( echo "-$(git rev-parse v1.6.4)"; \
git rev-list --objects v1.6.2..v1.6.3; \
git ls-tree -t -r v1.6.2 | cut -d' ' -f 3- | tr "\t" " "; ) | \
git pack-objects --progress --stdout > foo.pack
> What I said in my other e-mail where I showed how well it works taking
> a given bundle, and slicing it into a series of thin packs, was that it
> seems to add a bit of extra size to the resultant packs - best I got for
> slicing up the entire git.git run was about 20%. If this can be
> reduced to under 10% (say), then sending bundle slices would be quite
> reasonable by default for the benefit of making large fetches
> restartable, or even spreadable across multiple mirrors.
In theory you could have about no overhead. That all depends how you
slice the pack. If you want a pack to contain a fixed number of commits
(such that all objects introduced by a given commit are all in the same
pack) then you are of course putting a constraint on the possible delta
matches and compression result might be suboptimal. In comparison, with
a single big pack a given blob can delta against a blob from a
completely distant commit in the history graph if that provides a better
compression ratio.
If you slice your pack according to a size treshold, then you might
consider the --max-pack-size= argument to pack-objects. This currently
doesn't produce thin pack as delta objects whose base are stored in a
different pack than their base because of a pack split are simply not
stored as delta. Only a few line of code would need to be modified in
order to store those deltas nevertheless and turn those packs into thin
packs, preserving the optimal delta match. Of course cross pack delta
reference have to be REF_DELTA objects with headers about 16 to 17 bytes
larger than those of OFS_DELTA objects, so you will still have some
overhead.
> The object sorting stuff is something of a distraction; it's required
> for download spreading but not for the case at hand now.
Well, the idea of spreading small packs has its drawbacks. You still
might need to get a sizeable portion of them to get at least one usable
commit. And ideally you want the top commit in priority, which pretty
much impose an ordering on the packs you're likely to want first, unlike
with BitTorrent where you don't care as you normally want all
the blocks anyway.
If the goal is to make for faster downloads, then you could simply make
a bundle, copy it on multiple server, and slice your download across
those servers. This has the disadvantage of being static data that
doubles the disk (and cache) usage. That doesn't work too well with
shallow clones though.
If you were envisioning _clients_ à la BitTorrent putting up pack slices
instead, then in that case the slices have to be well defined entities,
like packs containing objects for known range of commits, but then we're
back to the delta inefficiency I mentioned above. And again this might
work only if a lot of people are interested in the same repository at
the same time, and of course most people have no big insentive to "seed"
once they got their copy. So I'm not sure if that might work that well
in practice.
This certainly still looks like a pretty cool project. But it is not
all the cool stuff that works well in real conditions I'm afraid. Just
my opinion of course.
Nicolas
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Continue git clone after interruption
2009-08-22 8:13 ` Nicolas Pitre
@ 2009-08-23 10:37 ` Sam Vilain
0 siblings, 0 replies; 39+ messages in thread
From: Sam Vilain @ 2009-08-23 10:37 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Jakub Narebski, Tomasz Kontusz, git, Johannes Schindelin, Scott Chacon
On Sat, 2009-08-22 at 04:13 -0400, Nicolas Pitre wrote:
> > Ok, but right now there's no way to specify that you want a thin pack,
> > where the allowable base objects are *newer* than the commit range you
> > wish to include.
>
> Sure you can. Try this:
>
> ( echo "-$(git rev-parse v1.6.4)"; \
> git rev-list --objects v1.6.2..v1.6.3 ) | \
> git pack-objects --progress --stdout > foo.pack
>
> That'll give you a thin pack for the _new_ objects that _appeared_
> between v1.6.2 and v1.6.3, but which external delta base objects are
> found in v1.6.4.
Aha. I guess I had made an assumption about where that '-' lets
pack-objects find deltas from that aren't true.
> > What I said in my other e-mail where I showed how well it works taking
> > a given bundle, and slicing it into a series of thin packs, was that it
> > seems to add a bit of extra size to the resultant packs - best I got for
> > slicing up the entire git.git run was about 20%. If this can be
> > reduced to under 10% (say), then sending bundle slices would be quite
> > reasonable by default for the benefit of making large fetches
> > restartable, or even spreadable across multiple mirrors.
>
> In theory you could have about no overhead. That all depends how you
> slice the pack. If you want a pack to contain a fixed number of commits
> (such that all objects introduced by a given commit are all in the same
> pack) then you are of course putting a constraint on the possible delta
> matches and compression result might be suboptimal. In comparison, with
> a single big pack a given blob can delta against a blob from a
> completely distant commit in the history graph if that provides a better
> compression ratio.
[...]
> If you were envisioning _clients_ à la BitTorrent putting up pack slices
> instead, then in that case the slices have to be well defined entities,
> like packs containing objects for known range of commits, but then we're
> back to the delta inefficiency I mentioned above.
I'll do some more experiments to try to quantify this in light of this
new information; I still think that if the overhead is marginal there
are significant wins to this approach.
> And again this might
> work only if a lot of people are interested in the same repository at
> the same time, and of course most people have no big insentive to "seed"
> once they got their copy. So I'm not sure if that might work that well
> in practice.
Throw away terms like "seeding" and replace with "mirroring". Sites
which currently house mirrors could potentially be helping serve git
repos, too. Popular projects could have many mirrors and on the edges
of the internet, git servers could mirror many projects for users in
their country.
Sam
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2009-08-23 10:34 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23 ` Shawn O. Pearce
2009-08-18 5:43 ` Matthieu Moy
2009-08-18 6:58 ` Tomasz Kontusz
2009-08-18 17:56 ` Nicolas Pitre
2009-08-18 18:45 ` Jakub Narebski
2009-08-18 20:01 ` Nicolas Pitre
2009-08-18 21:02 ` Jakub Narebski
2009-08-18 21:32 ` Nicolas Pitre
2009-08-19 15:19 ` Jakub Narebski
2009-08-19 19:04 ` Nicolas Pitre
2009-08-19 19:42 ` Jakub Narebski
2009-08-19 21:13 ` Nicolas Pitre
2009-08-20 0:26 ` Sam Vilain
2009-08-20 7:37 ` Jakub Narebski
2009-08-20 7:48 ` Nguyen Thai Ngoc Duy
2009-08-20 8:23 ` Jakub Narebski
2009-08-20 18:41 ` Nicolas Pitre
2009-08-21 10:07 ` Jakub Narebski
2009-08-21 10:26 ` Matthieu Moy
2009-08-21 21:07 ` Nicolas Pitre
2009-08-21 21:41 ` Jakub Narebski
2009-08-22 0:59 ` Nicolas Pitre
2009-08-21 23:07 ` Sam Vilain
2009-08-22 3:37 ` Nicolas Pitre
2009-08-22 5:50 ` Sam Vilain
2009-08-22 8:13 ` Nicolas Pitre
2009-08-23 10:37 ` Sam Vilain
2009-08-20 22:57 ` Sam Vilain
2009-08-18 22:28 ` Johannes Schindelin
2009-08-18 23:40 ` Nicolas Pitre
2009-08-19 7:35 ` Johannes Schindelin
2009-08-19 8:25 ` Nguyen Thai Ngoc Duy
2009-08-19 9:52 ` Johannes Schindelin
2009-08-19 17:21 ` Nicolas Pitre
2009-08-19 22:23 ` René Scharfe
2009-08-19 4:42 ` Sitaram Chamarty
2009-08-19 9:53 ` Jakub Narebski
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.