All of lore.kernel.org
 help / color / mirror / Atom feed
* Resumable clone/Gittorrent (again)
@ 2011-01-05 16:23 Nguyen Thai Ngoc Duy
  2011-01-05 16:56 ` Luke Kenneth Casson Leighton
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-05 16:23 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Nicolas Pitre, Luke Kenneth Casson Leighton

Hi,

I've been analyzing bittorrent protocol and come up with this. The
last idea about a similar thing [1], gittorrent, was given by Nicolas.
This keeps close to that idea (i.e the transfer protocol must be around git
objects, not file chunks) with a bit difference.

The idea is to transfer a chain of objects (trees or blobs), including
base object and delta chain. Objects are chained in according to
worktree layout, e.g. all objects of path/to/any/blob will form a
chain, from a commit tip down to the root commits. Chains can have
gaps, and don't need to start from commit tip. The transfer is
resumable because if a delta chain is corrupt at some point, we can
just request another chain from where it stops. Base object is
obviously resumable.

We start by fetching all commit contents reachable from a commit tip.
This is a chain, therefore resumable. From there each commit can be
examined. Missing trees and blobs will be fetched as chains. Everytime
a delta is received, we can recreate the new object and verify it (we
should have its SHA-1 from its parent trees/commits).

Because these chains are quite independent, in a sense that a blob
chain is independent from another blob chain (but requires tree
chains, of course). We can fetch as many as we want in parallel, once
we're done with the commit chain.

The last thing I like about these chains is that the number of chains
is reasonable. It won't increase too fast over time (as compared to
the number of commits). As such it maps well to BitTorrent's "pieces".
When a new gittorrent comes in, a running client can advertise what
chain it has with a bitmap (*).

So it all looks good to me. It is resumable and verifiable. It can be
fetched in parallel from many servers. It maps pretty good to
BitTorrent (which means we can reuse BitTorrent design). All transfer
should be compressed so the amount of transfer is also acceptable (not
as optimized as upload-pack, but hopefully overhead is low). A minor
point is latest commit (in full) will be available as soon as
possible.

One thing about reachability test. In order to avoid this test in an
expensive way every time a chain is requested, the requested chain
must be in SHA-1 extended form, starting from commit tip (e.g.
$SHA1~12^2~34...). Parsing and following that syntax is cheaper, I
hope.

Comments?

[1] http://article.gmane.org/gmane.comp.version-control.git/155222

(*) BitTorrent stores list of pieces in .torrent file. The bitmap
reflects what pieces a client has so other clients can ask for pieces
from it. If we follow the same way, we can create a list of all
possible directories/files in a repository in .gittorrent file, then
gittorrent client can advertise with bitmaps too, perhaps two-bit map
(not fetched, fetching, fully fetched).
-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-05 16:23 Resumable clone/Gittorrent (again) Nguyen Thai Ngoc Duy
@ 2011-01-05 16:56 ` Luke Kenneth Casson Leighton
  2011-01-05 17:13   ` Thomas Rast
  2011-01-05 23:28 ` Maaartin
  2011-01-07  3:21 ` Nicolas Pitre
  2 siblings, 1 reply; 25+ messages in thread
From: Luke Kenneth Casson Leighton @ 2011-01-05 16:56 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List, Nicolas Pitre

On Wed, Jan 5, 2011 at 4:23 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> Hi,
>
> I've been analyzing bittorrent protocol and come up with this. The
> last idea about a similar thing [1], gittorrent, was given by Nicolas.
> This keeps close to that idea (i.e the transfer protocol must be around git
> objects, not file chunks) with a bit difference.

> So it all looks good to me. It is resumable and verifiable. It can be
> fetched in parallel from many servers. It maps pretty good to
> BitTorrent (which means we can reuse BitTorrent design). All transfer
> should be compressed so the amount of transfer is also acceptable (not
> as optimized as upload-pack, but hopefully overhead is low). A minor
> point is latest commit (in full) will be available as soon as
> possible.

 ok.  what i wasn't aware of, about the bittorrent protocol, was that
multiple files, when placed into the same .torrent, are just
concatenated as one monolithic data block.  the "chunking" is just
then slapped on top of that.  the end result is that it's possible
that, if you want to get one specific file (or, in this case "object"
whether it be commit, blob or tree) then you *might* end up getting
two more "chunks" than you actually need, which, worst case could end
up being 2.999x the actual data really required.

 _this_ is the reason why people such as cameron dale criticised
bittorrent as a hierarchical / multi-file delivery mechanism (and
abandoned it in favour of apt-p2p), and i didn't understand this at
the time enough to be able to point out that i'd assumed they knew
what i was thinking :)

 what i was thinking was "duhh!" don't slap multiple files into a
single .torrent, put each file (or, in this case "object" whether it
be commit, blob, tree or other) into a separate torrent!  that's all -
problem goes away!

 now that of course leaves you with the problem that you now have
potentially hundreds if not thousands or tens of thousands of
.torrents to deal with, publish, find etc. etc.   and the solution to
_that_ is to give the name of the .torrent file something
meaningful.... like.... ooo, how about... the object's md5 sum? :)

 so _that_ problem's solved, which leaves just one more problem: how
to find such a ridiculously large number of objects in the first
place, and, surprise-surprise, there's a perfect solution to that, as
well, called DHTs.  and, surprise-surprise, what do you need as the
DHT key?  something like a 128-bit key?   ooo, how about ... the
object's md5 sum? that's 128-bit, that'll do :)

 but, better than that: there happens to have been announced very
recently an upgraded version of a bittorrent client, which claims to
be fantastic as it's no longer dependent on "internet search" sites,
because surprise-surprise, it uses peer-to-peer DHT to do the search
queries.

 so not only is there a solution to the problems but also there's even
a suitable codebase to work from in order to create a working
prototype.

 now i just have to find the damn thing... ah yes, it's called Tribler.

 l.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-05 16:56 ` Luke Kenneth Casson Leighton
@ 2011-01-05 17:13   ` Thomas Rast
  2011-01-05 18:07     ` Luke Kenneth Casson Leighton
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Rast @ 2011-01-05 17:13 UTC (permalink / raw)
  To: Luke Kenneth Casson Leighton
  Cc: Nguyen Thai Ngoc Duy, Git Mailing List, Nicolas Pitre

Luke Kenneth Casson Leighton wrote:
>  now that of course leaves you with the problem that you now have
> potentially hundreds if not thousands or tens of thousands of
> .torrents to deal with, publish, find etc. etc.

Umm, I'm counting 202400 objects in my git.git and 1799525 in a clone
of linux-2.6.git.  So I'm not sure how far you want to split things
into single transfers, but going all the way down to objects will
massively hurt performance.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-05 17:13   ` Thomas Rast
@ 2011-01-05 18:07     ` Luke Kenneth Casson Leighton
  2011-01-06  1:47       ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 25+ messages in thread
From: Luke Kenneth Casson Leighton @ 2011-01-05 18:07 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List, Nicolas Pitre

On Wed, Jan 5, 2011 at 5:13 PM, Thomas Rast <trast@student.ethz.ch> wrote:
> Luke Kenneth Casson Leighton wrote:
>>  now that of course leaves you with the problem that you now have
>> potentially hundreds if not thousands or tens of thousands of
>> .torrents to deal with, publish, find etc. etc.
>
> Umm, I'm counting 202400 objects in my git.git and 1799525 in a clone
> of linux-2.6.git.  So I'm not sure how far you want to split things
> into single transfers, but going all the way down to objects will
> massively hurt performance.

 yeah... this is a key reason why i came up with a protocol which
transferred the exact same pack-objects that HTTP and all the other
"point-to-point" git protocols use, to such good effect.

the problem was that i was going to rely on multiple clients being
able to genereate the exact same pack-object, given the exact same
input, and then share that pack-object.  ok, that's not the problem,
that was just the plan :)

 nicolas kindly pointed out, at some length, that in a distributed
environment, however, that plan was naive, becauuuse whenever you
request a pack-object for use e.g. normally with HTTP or other git
point-to-point protocol, it's generated there-and-then using
heuristics and multi-threading that pretty much guarantees that even
if you were to make the exact same request of exactly the same system,
you'd get *different* pack-objects!  not to mention the fact that
different people have the same git objects stored in *different* ways
because the object stores, despite having the same commits in them,
were pulled at different times and end up with a completely different
set of git objects that represent those exact same commits that
everyone else has.

that's all a bit wordy, but you get the idea.

 so, nicolas recommended a "simpler" approach, which, well, apologies
nicolas but i didn't really like it - it seemed far too simplistic and
i'm not really one for spending time doing these kinds of
"intermediate baby steps" (wrong choice of words, no offense implied,
but i'm sure you know what i mean).  i much prefer to just hit all the
issues head-on, right from the start :)


so, in the intervening time since this was last discussed i've given
the pack-objects-distributing idea some thought (and NO, nicolas, just
to clarify, this is NOT grabbing the git packed objects that are
actually in the .git/objects store, so NO, this does NOT end up
bypassing security by giving people objects that are from another
branch, it really IS getting that lovely varying data which is
heuristic, store and threadnum dependent!).

 the plan is to turn that variation in the git pack-objects responses,
across multiple peers, into an *advantage* not a liability.  how?
like this:

 * a client requiring objects from commit abcd0123 up to commit
efga3456 sends out a DHT broadcast query to all and sundry who have
commits abcd0123 and everything in between up to efga3456.

 * those clients that can be bothered to respond, do so [refinements below]

 * the requestor selects a few of them, and asks them to create git
pack-objects.  this takes time, but that's ok.  once created, the size
of the git pack-object is sent as part of the acknowledgement.

 * the requestor, on receipt of all the sizes, selects the *smallest*
one to begin the p2p (.torrent) from (by asking the remote client to
create a .torrent specifically for that purpose, with the filename
abcd0123-ebga3456).

 in this way you end up with not only an efficient git pack-object but
you get, to 99.5% certainty *THE* most efficient git pack-object.
distributed computing at its best :)

 now, an immediately obvious refinement of this is that those .torrent
(pack-objects) "stick around", in a cache (with a hard limit defined
on the cache size of course).  and so, when the client that requires a
pack-object makes the request, of course, those remote clients that
*already* have that cached pack-object for that specific commit-range
should be given first priority, to avoid other clients from having to
make massive amounts of git pack-objects.

 a further refinement is of course to collect statistics on the number
of peers doing downloads at the time, prioritising those pack-objects
which are most being distributed at the time.  this has fairly obvious
benefits :)

 yet *another* refinement is slightly less obvious, and it's this:
there *COULD* happen to be some existing pack-objects in the cache,
not of commit abcd0123-efga3456 but in a ready-made "chain": commits
abcd01234-beef7890 packed already and in the cache, and commits
beef7890-efga3456 likewise packed already and in the cache.  again:
the requestor should be informed of these, and make their mind up as
to what to do.

 it gets rather more complex when you have *part* of the chain already
pre-cached (and have to work out err, i got this bit and this bit, but
i'd have to generate a git pack-object for the bit in the middle, i'll
inform the requestor of this, they can make up their mind what to do),
but again i do not imagine for one second that this would be anything
more than an intriguing coding challenge and, importantly, an
optimisation challenge for gittorrent version 3.0 somewhere down the
line, rather than an all-out absolute requirement that it must, must
be done now, now, now.

 what else can i mention, that occurred to me... yeah - abandoning of
a download.  if, for some reason it becomes blindingly obvious that
the p2p transfer just isn't working out, then the requestor simply
stops the process and starts again.  a refinement of this, which is a
bit cheeky i know, is to keep *two* simultaneous requests and
downloads for the *exact* same git pack-object commit-chain but with
different data from different groups of peers, for a short period of
time, and then abandon one of them once it's clear which one is best.
this does seem a bit cheeky, but it has the advantage that if the one
that _was_ fastest goes tits-up, you can at least go back to the
previous one and, assuming that the cache hasn't been cleared, just
join in again.  but this is _really_ something that's wayyy down the
line, for gittorrent version 4.0 or 5.0 or so.

so, can you see that a) this is a far cry from the "simplistic
transfer of blobs and trees" b) it's *not* going to overload peoples'
systems by splattering (eek!) millions of md5 sums across the internet
as bittorrent files c) it _does_ fit neatly into the bittorrent
protocol d) it combines the best of git with the best of p2p
distributed networking principles...

... all of which creates a system which people will _still_ say is a
"hammer looking for nails" :)

... right up until the point where some idiot in the USA government
decides to seize sourceforge.net, github.com, gitorious.org and
savannah.gnu.org because they contain source code of software that
MIGHT be used for copyright infringement.  whilst i realise that the
only one of those that might be missed is sourceforget, you cannot
ignore the fact that the trust placed in governments and large
corporations to run the internet infrastructure is now completely
gone, and that the USA and other countries are now putting in place
hypocritical policies that put them into the same category that used
to be reserved for China, Saudi Arabia, Iran and other regimes accused
of being "Totalitarian".

 thoughts, anyone?  (other than on the last paragraph, please, if that's ok).

l.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-05 16:23 Resumable clone/Gittorrent (again) Nguyen Thai Ngoc Duy
  2011-01-05 16:56 ` Luke Kenneth Casson Leighton
@ 2011-01-05 23:28 ` Maaartin
  2011-01-06  1:32   ` Nguyen Thai Ngoc Duy
  2011-01-07  3:21 ` Nicolas Pitre
  2 siblings, 1 reply; 25+ messages in thread
From: Maaartin @ 2011-01-05 23:28 UTC (permalink / raw)
  To: git

Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:

> I've been analyzing bittorrent protocol and come up with this. The
> last idea about a similar thing [1], gittorrent, was given by Nicolas.
> This keeps close to that idea (i.e the transfer protocol must be around git
> objects, not file chunks) with a bit difference.
>
> The idea is to transfer a chain of objects (trees or blobs), including
> base object and delta chain. Objects are chained in according to
> worktree layout, e.g. all objects of path/to/any/blob will form a
> chain, from a commit tip down to the root commits. Chains can have
> gaps, and don't need to start from commit tip. The transfer is
> resumable because if a delta chain is corrupt at some point, we can
> just request another chain from where it stops. Base object is
> obviously resumable.

I may be talking nonsense, please bare with me.

I'm not sure if it works well, since chains defined this way change over time. 
I may request commits A and B while declaring to possess commits C and D. One 
server may be ahead of A, so should it send me more data or repack the chain so 
that the non-requested versions get excluded? At the same time the server may 
be missing B and posses only some ancestors of it. Should it send me only a 
part of the chain or should I better ask a different server?

Moreover, in case a directory gets renamed, the content may get transfered 
needlessly. This is probably no big problem.

I haven't read the whole other thread yet, but what about going the other way 
round? Use a single commit as a chain, create deltas assuming that all 
ancestors are already available. The packs may arrive out of order, so the 
decompression may have to wait. The number of commits may be one order of 
magnitude larger than the the number of paths (there are currently 2254 paths 
and 24235 commits in git.git), so grouping consequent commits into one larger 
pack may be useful.

The advantage is that the packs stays stable over time, you may create them 
using the most aggressive and time-consuming settings and store them forever. 
You could create packs for single commits, packs for non-overlapping 
consecutive pairs of them, for non-overlapping pairs of pairs, etc. I mean with 
commits numbered 0, 1, 2, ... create packs [0,1], [2,3], ..., [0,3], [4,7], 
etc. The reason for this is obviously to allow reading groups of commits from 
different servers so that they fit together (similar to Buddy memory 
allocation). Of course, there are things like branches bringing chaos in this 
simple scheme, but I'm sure this can be solved somehow.

Another problem is the client requesting commits A and B while declaring to 
possess commits C and D. When both C and D are ancestors of either A or B, you 
can ignore it (as you assume this while packing, anyway). The other case is 
less probable, unless e.g. C is the master and A is a developing branch. 
Currently. I've no idea how to optimize this and whether this could be 
important.

I see no disadvantage when compared to path-based chains, but am probably 
overlooking something obvious.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-05 23:28 ` Maaartin
@ 2011-01-06  1:32   ` Nguyen Thai Ngoc Duy
  2011-01-06  3:34     ` Maaartin-1
  0 siblings, 1 reply; 25+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-06  1:32 UTC (permalink / raw)
  To: Maaartin; +Cc: git

On Thu, Jan 6, 2011 at 6:28 AM, Maaartin <grajcar1@seznam.cz> wrote:
> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:
>
>> I've been analyzing bittorrent protocol and come up with this. The
>> last idea about a similar thing [1], gittorrent, was given by Nicolas.
>> This keeps close to that idea (i.e the transfer protocol must be around git
>> objects, not file chunks) with a bit difference.
>>
>> The idea is to transfer a chain of objects (trees or blobs), including
>> base object and delta chain. Objects are chained in according to
>> worktree layout, e.g. all objects of path/to/any/blob will form a
>> chain, from a commit tip down to the root commits. Chains can have
>> gaps, and don't need to start from commit tip. The transfer is
>> resumable because if a delta chain is corrupt at some point, we can
>> just request another chain from where it stops. Base object is
>> obviously resumable.
>
> I may be talking nonsense, please bare with me.
>
> I'm not sure if it works well, since chains defined this way change over time.
> I may request commits A and B while declaring to possess commits C and D. One
> server may be ahead of A, so should it send me more data or repack the chain so
> that the non-requested versions get excluded? At the same time the server may
> be missing B and posses only some ancestors of it. Should it send me only a
> part of the chain or should I better ask a different server?

I'll keep it simple. A chain is defined by one commit head. Such a
chain can't change over time. But you can ask for just part of the
chain, rev-list syntax can be used here. For example if you already
have commits C and D and 10 delta in the chain (linear history for
simplicity here), requesting "give me A~10 ^C ^D" should give required
commits.

> Moreover, in case a directory gets renamed, the content may get transfered
> needlessly. This is probably no big problem.

Yes, the chain constraint can backfire in these cases. We can mix
standard upload-pack/fetch-pack and this if the server can recognize
these cases, by cutting commit history into chunks. The dir rename
chunks can be fetched with git-fetch.

> I haven't read the whole other thread yet, but what about going the other way
> round? Use a single commit as a chain, create deltas assuming that all
> ancestors are already available. The packs may arrive out of order, so the
> decompression may have to wait. The number of commits may be one order of
> magnitude larger than the the number of paths (there are currently 2254 paths
> and 24235 commits in git.git), so grouping consequent commits into one larger
> pack may be useful.

The number of commits can increase fast. I'd rather have a
small/stable number over time. And commits depend on other commits so
you can't verify a commit until you have got all of its parents. That
does apply to file, but then this file chain does not interfere other
file chains.

> The advantage is that the packs stays stable over time, you may create them
> using the most aggressive and time-consuming settings and store them forever.
> You could create packs for single commits, packs for non-overlapping
> consecutive pairs of them, for non-overlapping pairs of pairs, etc. I mean with
> commits numbered 0, 1, 2, ... create packs [0,1], [2,3], ..., [0,3], [4,7],
> etc. The reason for this is obviously to allow reading groups of commits from
> different servers so that they fit together (similar to Buddy memory
> allocation). Of course, there are things like branches bringing chaos in this
> simple scheme, but I'm sure this can be solved somehow.

Pack encoding can change. And packs can contain objects you don't want
to share (i.e. hidden from public view).

> Another problem is the client requesting commits A and B while declaring to
> possess commits C and D. When both C and D are ancestors of either A or B, you
> can ignore it (as you assume this while packing, anyway). The other case is
> less probable, unless e.g. C is the master and A is a developing branch.
> Currently. I've no idea how to optimize this and whether this could be
> important.

As I said, we can request just part of a chain (from A+B to C+D).
git-fetch should be used if the repo is quite uptodate though. It's
just more efficient.
-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-05 18:07     ` Luke Kenneth Casson Leighton
@ 2011-01-06  1:47       ` Nguyen Thai Ngoc Duy
  2011-01-06 17:50         ` Luke Kenneth Casson Leighton
  0 siblings, 1 reply; 25+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-06  1:47 UTC (permalink / raw)
  To: Luke Kenneth Casson Leighton; +Cc: Thomas Rast, Git Mailing List, Nicolas Pitre

On Thu, Jan 6, 2011 at 1:07 AM, Luke Kenneth Casson Leighton
<luke.leighton@gmail.com> wrote:
>  the plan is to turn that variation in the git pack-objects responses,
> across multiple peers, into an *advantage* not a liability.  how?
> like this:
>
>  * a client requiring objects from commit abcd0123 up to commit
> efga3456 sends out a DHT broadcast query to all and sundry who have
> commits abcd0123 and everything in between up to efga3456.
>
>  * those clients that can be bothered to respond, do so [refinements below]
>
>  * the requestor selects a few of them, and asks them to create git
> pack-objects.  this takes time, but that's ok.  once created, the size
> of the git pack-object is sent as part of the acknowledgement.
>
>  * the requestor, on receipt of all the sizes, selects the *smallest*
> one to begin the p2p (.torrent) from (by asking the remote client to
> create a .torrent specifically for that purpose, with the filename
> abcd0123-ebga3456).

That defeats the purpose of distributing. You are putting pressure on
certain peers.

>  now, an immediately obvious refinement of this is that those .torrent
> (pack-objects) "stick around", in a cache (with a hard limit defined
> on the cache size of course).  and so, when the client that requires a
> pack-object makes the request, of course, those remote clients that
> *already* have that cached pack-object for that specific commit-range
> should be given first priority, to avoid other clients from having to
> make massive amounts of git pack-objects.

Cache have its limits too. Suppose I half-fetch a pack then stop and
go wild for a month. The next month I restart the fetch, the pack may
no longer in cache. A new pack may or may not be identical to the old
pack.

Also if you go with packs, you are tied to the peer that generates
that pack. Two different peers can, in theory, generate different
packs (in encoding) for the same input.

Another thing with packs (ok, not exactly with packs) is how you
verify that's you have got what you asked. Bittorrent can verify every
piece a peer receives because sha-1 sum of those pieces are recorded
in .torrent file. We have SHA-1 all over the place, but if you don't
have base objects to undeltify, you can't use those SHA-1 to verify.
Verification is an important step before you advertise to other peers
"I have these".

> so, can you see that a) this is a far cry from the "simplistic
> transfer of blobs and trees" b) it's *not* going to overload peoples'
> systems by splattering (eek!) millions of md5 sums across the internet
> as bittorrent files c) it _does_ fit neatly into the bittorrent
> protocol d) it combines the best of git with the best of p2p
> distributed networking principles...

How can you advertise what you have to another peer?
-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-06  1:32   ` Nguyen Thai Ngoc Duy
@ 2011-01-06  3:34     ` Maaartin-1
  2011-01-06  6:36       ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 25+ messages in thread
From: Maaartin-1 @ 2011-01-06  3:34 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git

On 11-01-06 02:32, Nguyen Thai Ngoc Duy wrote:
> On Thu, Jan 6, 2011 at 6:28 AM, Maaartin <grajcar1@seznam.cz> wrote:
>> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:

>> I haven't read the whole other thread yet, but what about going the other way
>> round? Use a single commit as a chain, create deltas assuming that all
>> ancestors are already available. The packs may arrive out of order, so the
>> decompression may have to wait. The number of commits may be one order of
>> magnitude larger than the the number of paths (there are currently 2254 paths
>> and 24235 commits in git.git), so grouping consequent commits into one larger
>> pack may be useful.
> 
> The number of commits can increase fast. I'd rather have a
> small/stable number over time.

In theory, I could create many commits per seconds. I could create many
unique paths per seconds, too. But I don't think it really happens. I do
know no larger repository than git.git and I don't want to download it
just to see how many commits, paths, and object it contains, but I'd
suppose it's less than one million commits, which should be manageable,
especially when commits get grouped together as I described below.

> And commits depend on other commits so
> you can't verify a commit until you have got all of its parents. That
> does apply to file, but then this file chain does not interfere other
> file chains.

That's true, but the verification is something done locally on the
client, it consumes no network traffic and no server resources, so I
consider it to be cheap. I need less than half a minute (using only a
single core) for verifying of the whole git.git repository (36 MB). This
is no problem, even when it had to wait until the download finishes. I'm
sure, the OP of [1] would be happy if he could wait for this.

>> The advantage is that the packs stays stable over time, you may create them
>> using the most aggressive and time-consuming settings and store them forever.
>> You could create packs for single commits, packs for non-overlapping
>> consecutive pairs of them, for non-overlapping pairs of pairs, etc. I mean with
>> commits numbered 0, 1, 2, ... create packs [0,1], [2,3], ..., [0,3], [4,7],
>> etc. The reason for this is obviously to allow reading groups of commits from
>> different servers so that they fit together (similar to Buddy memory
>> allocation). Of course, there are things like branches bringing chaos in this
>> simple scheme, but I'm sure this can be solved somehow.
> 
> Pack encoding can change.

I see I didn't explain it clear enough (or am missing something
completely). I know why the packs normally used by git can't be used for
this purpose. Let me retry: Let's assume there's a commit chain
A-B-C-D-E-F-..., the client has already commit B and requests commit F.
It may send requests to up to 4 servers, asking for C, D, E, and F,
respectively. The server being asked for E _creates_ a pack containing
all the information needed to create E given _all of_ A, B, C, D. As
base for any blob/whatever in E it may choose any blob contained in any
of these commits. Of course, it may also choose a blob already packed in
this pack. It may not choose any other blob, so any client having all
ancestors of E can use the pack. Different server and/or program
versions may create different packs for E, but all of them are
_interchangeable_. Because of this, it makes sense to _store_ it for
future reuse.

Compared to the way git packing normally works, this is a restriction,
but I don't think it leads to significantly worse compression. You guys
working on git can confirm or disprove it.

> And packs can contain objects you don't want
> to share (i.e. hidden from public view).

This pack would contain only commit E. I also described pairing intended
for greater efficiency. In this case a server creates a pack allowing
e.g. to create commits E and F given all their ancestors (while other
server creates a pack for C and D). This way the number of packs needed
may be a fraction of the total number of commits requested.

>> Another problem is the client requesting commits A and B while declaring to
>> possess commits C and D. When both C and D are ancestors of either A or B, you
>> can ignore it (as you assume this while packing, anyway). The other case is
>> less probable, unless e.g. C is the master and A is a developing branch.
>> Currently. I've no idea how to optimize this and whether this could be
>> important.
> 
> As I said, we can request just part of a chain (from A+B to C+D).
> git-fetch should be used if the repo is quite uptodate though. It's
> just more efficient.

[1] http://article.gmane.org/gmane.comp.version-control.git/164564

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-06  3:34     ` Maaartin-1
@ 2011-01-06  6:36       ` Nguyen Thai Ngoc Duy
  2011-01-08  1:04         ` Maaartin-1
  0 siblings, 1 reply; 25+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-06  6:36 UTC (permalink / raw)
  To: Maaartin-1; +Cc: git

On Thu, Jan 6, 2011 at 10:34 AM, Maaartin-1 <grajcar1@seznam.cz> wrote:
> In theory, I could create many commits per seconds. I could create many
> unique paths per seconds, too. But I don't think it really happens. I do
> know no larger repository than git.git and I don't want to download it
> just to see how many commits, paths, and object it contains, but I'd
> suppose it's less than one million commits, which should be manageable,
> especially when commits get grouped together as I described below.

In pratice, commits are created every day in an active project. Paths
on the other hand are added less often (perhaps except webkit).

I've got some numbers:

 - wine.git has 72k commits, 260k trees, 200k blobs, 12k paths
 - git.git has 24k commits, 39k trees, 24k blobs, 2.7k paths
 - linux-2.6.git has 160k commits, 760k trees, 442k blobs, 46k paths

Large repos are more interesting because small ones can be cloned with
git-clone.

Listing all those commits in linux-2.6.git takes 160k*20=3M (I suppose
compressing is useless because SHA-1 is random). A compressed listing
of those 46k paths takes 200k.

>> And commits depend on other commits so
>> you can't verify a commit until you have got all of its parents. That
>> does apply to file, but then this file chain does not interfere other
>> file chains.
>
> That's true, but the verification is something done locally on the
> client, it consumes no network traffic and no server resources, so I
> consider it to be cheap. I need less than half a minute (using only a
> single core) for verifying of the whole git.git repository (36 MB). This
> is no problem, even when it had to wait until the download finishes. I'm
> sure, the OP of [1] would be happy if he could wait for this.

The point is you need to fetch its parent commits first in order to
verify a commit. Fetching a whole commit is more expensive than a
file. So while you can fetch a few commit bases and request for packs
from those bases in parallel, the cost of initial commit bases will be
high.

> I see I didn't explain it clear enough (or am missing something
> completely). I know why the packs normally used by git can't be used for
> this purpose. Let me retry: Let's assume there's a commit chain
> A-B-C-D-E-F-..., the client has already commit B and requests commit F.
> It may send requests to up to 4 servers, asking for C, D, E, and F,
> respectively. The server being asked for E _creates_ a pack containing
> all the information needed to create E given _all of_ A, B, C, D. As
> base for any blob/whatever in E it may choose any blob contained in any
> of these commits. Of course, it may also choose a blob already packed in
> this pack. It may not choose any other blob, so any client having all
> ancestors of E can use the pack. Different server and/or program
> versions may create different packs for E, but all of them are
> _interchangeable_. Because of this, it makes sense to _store_ it for
> future reuse.

They are interchangeable as a whole, yes. But you cannot fetch half
the pack from server A and the other half from server B. You can try
to recover as many deltas as possible in a broken pack, but how do you
request a server to send the rest of the pack to you?
-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-06  1:47       ` Nguyen Thai Ngoc Duy
@ 2011-01-06 17:50         ` Luke Kenneth Casson Leighton
  0 siblings, 0 replies; 25+ messages in thread
From: Luke Kenneth Casson Leighton @ 2011-01-06 17:50 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Thomas Rast, Git Mailing List, Nicolas Pitre

On Thu, Jan 6, 2011 at 1:47 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Thu, Jan 6, 2011 at 1:07 AM, Luke Kenneth Casson Leighton
> <luke.leighton@gmail.com> wrote:
>>  the plan is to turn that variation in the git pack-objects responses,
>> across multiple peers, into an *advantage* not a liability.  how?
>> like this:
>>
>>  * a client requiring objects from commit abcd0123 up to commit
>> efga3456 sends out a DHT broadcast query to all and sundry who have
>> commits abcd0123 and everything in between up to efga3456.
>>
>>  * those clients that can be bothered to respond, do so [refinements below]
>>
>>  * the requestor selects a few of them, and asks them to create git
>> pack-objects.  this takes time, but that's ok.  once created, the size
>> of the git pack-object is sent as part of the acknowledgement.
>>
>>  * the requestor, on receipt of all the sizes, selects the *smallest*
>> one to begin the p2p (.torrent) from (by asking the remote client to
>> create a .torrent specifically for that purpose, with the filename
>> abcd0123-ebga3456).
>
> That defeats the purpose of distributing. You are putting pressure on
> certain peers.

 that's unavoidable, but it's not actually as bad as it seems.  think
about it.  normally, "pressure" is put onto a git server, by forcing
that server to perform multiple "git pack-object" calculations,
repeatedly, for each and every "git pull".

 so, the principle behind this RFC (is it an RFC? yes, kinda...) is
that a) you cache those git pack-objects, thus avoiding heavy CPU
usage b) you make the requests to _many_ peers that you'll likely find
already are in the process of distributing that particular
commit-range _anyway_ so will _definitely_ have it  ... etc. etc.

 so there's a ton of reasons why it's quite a big improvement over the
present star-network arrangement.

>
>>  now, an immediately obvious refinement of this is that those .torrent
>> (pack-objects) "stick around", in a cache (with a hard limit defined
>> on the cache size of course).  and so, when the client that requires a
>> pack-object makes the request, of course, those remote clients that
>> *already* have that cached pack-object for that specific commit-range
>> should be given first priority, to avoid other clients from having to
>> make massive amounts of git pack-objects.
>
> Cache have its limits too. Suppose I half-fetch a pack then stop and
> go wild for a month. The next month I restart the fetch, the pack may
> no longer in cache. A new pack may or may not be identical to the old
> pack.

 correct.  that's not in the slightest bit a problem.  the peer which
has that new pack will be asked to make a new .torrent for _that_
pack.  with a new name that uniquely identifies it (the md5sum of the
pack would do as the .torrent filename)

> Also if you go with packs, you are tied to the peer that generates
> that pack. Two different peers can, in theory, generate different
> packs (in encoding) for the same input.

 yes.  correct.  i _did_ say that you pick the one that is the
smallest of the two (or three.  or 10).  in this way you actually do
much better than you would otherwise in a "star network" such as a
standard HTTP git server, because you've asked 2, 3 or 10 (whatever)
peers, and you'll end up with _the_ most efficient representation of
that commit-range.  statistically speaking, of course :)


> Another thing with packs (ok, not exactly with packs) is how you
> verify that's you have got what you asked.

 ok - how do you verify that you've got what you asked, when you ask
from a git server using HTTP?

> Bittorrent can verify every
> piece a peer receives because sha-1 sum of those pieces are recorded
> in .torrent file.

 yes.  this is simply a part of the bittorrent protocol, to ensure
that the file being transferred is correctly transferred.

 these verifications steps should be _trusted_ and should _not_ be
confused with anything else (i've deleted the rest of the paragraph
you wrote, in order to reduce any opportunity for confusion).

 if you mentally keep git separate from bittorrent it helps.  imagine
that bittorrent is merely a drop-in replacement for git over HTTP
(nicolas kindly explained the plugin system for git which would add
another protocol for downloading of git repos, and yes this can all be
implemented as a plugin)


>> so, can you see that a) this is a far cry from the "simplistic
>> transfer of blobs and trees" b) it's *not* going to overload peoples'
>> systems by splattering (eek!) millions of md5 sums across the internet
>> as bittorrent files c) it _does_ fit neatly into the bittorrent
>> protocol d) it combines the best of git with the best of p2p
>> distributed networking principles...
>
> How can you advertise what you have to another peer?

 you don't.  it's done "on-demand".

 the concept of "git push" becomes virtually a null-op, updating the
bittorrent tracker and that's... about it.

 it's where "git pull" that all the work is done, starting with that
DHT query [no i know "bittorrent the protocol" doesn't have DHT, but
many bittorrent clients _do_ have DHT, and Tribler has an extremely
good one].

 l.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-05 16:23 Resumable clone/Gittorrent (again) Nguyen Thai Ngoc Duy
  2011-01-05 16:56 ` Luke Kenneth Casson Leighton
  2011-01-05 23:28 ` Maaartin
@ 2011-01-07  3:21 ` Nicolas Pitre
  2011-01-07  6:34   ` Nguyen Thai Ngoc Duy
  2011-01-07 15:59   ` Luke Kenneth Casson Leighton
  2 siblings, 2 replies; 25+ messages in thread
From: Nicolas Pitre @ 2011-01-07  3:21 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List, Luke Kenneth Casson Leighton

On Wed, 5 Jan 2011, Nguyen Thai Ngoc Duy wrote:

> Hi,
> 
> I've been analyzing bittorrent protocol and come up with this. The
> last idea about a similar thing [1], gittorrent, was given by Nicolas.
> This keeps close to that idea (i.e the transfer protocol must be around git
> objects, not file chunks) with a bit difference.
> 
> The idea is to transfer a chain of objects (trees or blobs), including
> base object and delta chain. Objects are chained in according to
> worktree layout, e.g. all objects of path/to/any/blob will form a
> chain, from a commit tip down to the root commits. Chains can have
> gaps, and don't need to start from commit tip. The transfer is
> resumable because if a delta chain is corrupt at some point, we can
> just request another chain from where it stops. Base object is
> obviously resumable.

How do you actually define your chain?  Given that Git is conceptually 
snapshot based, there is currently no relationship between two blobs 
forming the content for two different versions of the same file.  Even 
delta objects are not really part of the Git data model as they are only 
an encoding variation of a given primary object.  In fact, we may and 
actually do have deltas where the base object is not from the same 
worktree file as the delta object itself.

The only thing that 
ties this all together is the commit graph.  And that graph might have 
multiple forks and merges so any attempt at a linearity representation 
into a chain is rather futile.  Therefore it is not clear to me how you 
can define a chain with a beginning and an end, and how this can be 
resumed midway.

> We start by fetching all commit contents reachable from a commit tip.

Sure.  This is doable today and is called a shalow clone with depth=1.

> This is a chain, therefore resumable.

I don't get that part though.  How is this resumable?  That's the very 
issue we have with a clone.

I proposed a solution to that already, which is to use 
git-upload-archive for one of the tip commit since the data stream 
produced by upload-archive (once decompressed) is actually 
deterministic.  Once completed, this can be converted into a shalow 
clone on the client side, and can be deepened in smaller steps 
afterwards.

> From there each commit can be
> examined. Missing trees and blobs will be fetched as chains. Everytime
> a delta is received, we can recreate the new object and verify it (we
> should have its SHA-1 from its parent trees/commits).

What if the delta is based on an object from another chain?  How do you 
determine which chain to ask for to get that base?

> Because these chains are quite independent, in a sense that a blob
> chain is independent from another blob chain (but requires tree
> chains, of course). We can fetch as many as we want in parallel, once
> we're done with the commit chain.

But in practice, most of those chains will end up containing objects 
which are duplicate of objects in another chain.  How do you tell the 
remote that you want part of a chain because you've got 96% of it in 
another chain already?

> The last thing I like about these chains is that the number of chains
> is reasonable. It won't increase too fast over time (as compared to
> the number of commits). As such it maps well to BitTorrent's "pieces".

My problem right now is that I don't see how this maps well to Git.


Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-07  3:21 ` Nicolas Pitre
@ 2011-01-07  6:34   ` Nguyen Thai Ngoc Duy
  2011-01-07 15:59   ` Luke Kenneth Casson Leighton
  1 sibling, 0 replies; 25+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-07  6:34 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Git Mailing List, Luke Kenneth Casson Leighton

On Fri, Jan 7, 2011 at 10:21 AM, Nicolas Pitre <nico@fluxnic.net> wrote:
> How do you actually define your chain?  Given that Git is conceptually
> snapshot based, there is currently no relationship between two blobs
> forming the content for two different versions of the same file.  Even
> delta objects are not really part of the Git data model as they are only
> an encoding variation of a given primary object.  In fact, we may and
> actually do have deltas where the base object is not from the same
> worktree file as the delta object itself.
>
> The only thing that
> ties this all together is the commit graph.  And that graph might have
> multiple forks and merges so any attempt at a linearity representation
> into a chain is rather futile.  Therefore it is not clear to me how you
> can define a chain with a beginning and an end, and how this can be
> resumed midway.

There's no need to be linear. OK it's not a chain, but a DAG of
objects that has the same path, in the same structure of commit DAG.

>> We start by fetching all commit contents reachable from a commit tip.
>
> Sure.  This is doable today and is called a shalow clone with depth=1.

I meant only commit objects, no trees nor blobs.

>> This is a chain, therefore resumable.
>
> I don't get that part though.  How is this resumable?  That's the very
> issue we have with a clone.

I assume that all commits are sent in an order that parent commits are
always after the commit in question. We can make a pack of undeltified
commit objects in such order. That would make sure we could recover a
continuous commit DAG from the tip if the pack cannot be sent
completely to client.

We can traverse commit graph we have, and request for a pack of
missing commits to grow the commit DAG until we have all commits.

> I proposed a solution to that already, which is to use
> git-upload-archive for one of the tip commit since the data stream
> produced by upload-archive (once decompressed) is actually
> deterministic.  Once completed, this can be converted into a shalow
> clone on the client side, and can be deepened in smaller steps
> afterwards.

You see, I don't send trees and blobs in this phase. There are three
phases. Phase 1 fetches all commits. Once we have all commits. We can
use them to request packs of trees of the same path. Those packs are
like the commit pack, but deltified. That's phase 2. When we have
enough trees, we can proceed to phase 3: fetching packs of blobs.

>> From there each commit can be
>> examined. Missing trees and blobs will be fetched as chains. Everytime
>> a delta is received, we can recreate the new object and verify it (we
>> should have its SHA-1 from its parent trees/commits).
>
> What if the delta is based on an object from another chain?  How do you
> determine which chain to ask for to get that base?

Chains should be independent. If a chain is based on another chain and
we have not got its base yet (because the other chain is not
completed), we can fetch the base separately. In theory we need to
fetch a version of all paths once for them to become bases. So it's
like a broken down version of git-upload-archive.

>> Because these chains are quite independent, in a sense that a blob
>> chain is independent from another blob chain (but requires tree
>> chains, of course). We can fetch as many as we want in parallel, once
>> we're done with the commit chain.
>
> But in practice, most of those chains will end up containing objects
> which are duplicate of objects in another chain.  How do you tell the
> remote that you want part of a chain because you've got 96% of it in
> another chain already?

Because all clients should have full commit graph (without trees and
blobs) before doing anything, they should be able to specify a rev
list for the chain they need. So if you only need SHA1~76..SHA1~100 of
a path, say so to remote side. SHA-1 must be one of the refs on remote
side, so it can parse the syntax and verify quickly if "SHA1~76" is
reachable/allowed to transfer.

>> The last thing I like about these chains is that the number of chains
>> is reasonable. It won't increase too fast over time (as compared to
>> the number of commits). As such it maps well to BitTorrent's "pieces".
>
> My problem right now is that I don't see how this maps well to Git.

Git sees a repository as history of snapshots. This way I see it as a
bunch of "git log -- path", not that bad.
-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-07  3:21 ` Nicolas Pitre
  2011-01-07  6:34   ` Nguyen Thai Ngoc Duy
@ 2011-01-07 15:59   ` Luke Kenneth Casson Leighton
  2011-01-08  2:17     ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 25+ messages in thread
From: Luke Kenneth Casson Leighton @ 2011-01-07 15:59 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List

On Fri, Jan 7, 2011 at 3:21 AM, Nicolas Pitre <nico@fluxnic.net> wrote:

>> The last thing I like about these chains is that the number of chains
>> is reasonable. It won't increase too fast over time (as compared to
>> the number of commits). As such it maps well to BitTorrent's "pieces".
>
> My problem right now is that I don't see how this maps well to Git.

 bittorrent as "just another file getting method" maps very well.

 only with some modifications to the bittorrent protocol would the
concept map well to bittorrent "pieces" because the pieces are at
present a) fixed size b) defined by a heuristic based on the file size
(logarithmic) so that the number of pieces are kept to within a
reasonable limit.

 bottom line: my take on this is (sorry to say, nguyen) that i don't
believe bittorrent "pieces" map well to the chains concept, unless the
chains are perfectly fixed identical sizes [which they could well be?
am i mistaken about this?]

 l.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-06  6:36       ` Nguyen Thai Ngoc Duy
@ 2011-01-08  1:04         ` Maaartin-1
  2011-01-08  2:40           ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 25+ messages in thread
From: Maaartin-1 @ 2011-01-08  1:04 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git

On 11-01-06 07:36, Nguyen Thai Ngoc Duy wrote:
> On Thu, Jan 6, 2011 at 10:34 AM, Maaartin-1 <grajcar1@seznam.cz> wrote:
>> In theory, I could create many commits per seconds. I could create many
>> unique paths per seconds, too. But I don't think it really happens. I do
>> know no larger repository than git.git and I don't want to download it
>> just to see how many commits, paths, and object it contains, but I'd
>> suppose it's less than one million commits, which should be manageable,
>> especially when commits get grouped together as I described below.
> 
> In pratice, commits are created every day in an active project. Paths
> on the other hand are added less often (perhaps except webkit).
> 
> I've got some numbers:
> 
>  - wine.git has 72k commits, 260k trees, 200k blobs, 12k paths
>  - git.git has 24k commits, 39k trees, 24k blobs, 2.7k paths
>  - linux-2.6.git has 160k commits, 760k trees, 442k blobs, 46k paths
> 
> Large repos are more interesting because small ones can be cloned with
> git-clone.

Sure. Linux is the winner and has 4 times as much commits as paths.

> Listing all those commits in linux-2.6.git takes 160k*20=3M (I suppose
> compressing is useless because SHA-1 is random). A compressed listing
> of those 46k paths takes 200k.

Sure, Linux has only 4 times as much commits as paths, but the commits
need 30 times more storage. What does it tell us?

IMHO it speaks in favor of my proposal. Imagine a path changing with
nearly every commit. The root directory is such a path and near top
directories come close to (as may other files like todo-lists do). For
each such file you need 3MB for storing the commits SHAs only. Of
course, you can invent a schema making storing all the SHAs unnecessary,
but this is another complication.

OTOH, with the commits used as directory entries we get quite a large
directory. Is this a problem you wanted me to get aware of?

> The point is you need to fetch its parent commits first in order to
> verify a commit. Fetching a whole commit is more expensive than a
> file. So while you can fetch a few commit bases and request for packs
> from those bases in parallel, the cost of initial commit bases will be
> high.

You've lost me. I assume you mean that something like that there may be
very large commits (e.g., in a project not versioned from the very
beginning). I'd suggest to split such commits in two parts by
classifying the blobs (and trees) using a fixed bit of their SHAs. Of
course, this can be repeated in order to get even smaller parts. Let's
assume a commit X gets split into X0 and X1. As before, for compressing
of X0 you may use the content any predecessor of X. For compressing of
X0 you may additionally use the content of X0. This way the compression
rate could stay close to optimal, IMHO.

> They are interchangeable as a whole, yes. But you cannot fetch half
> the pack from server A and the other half from server B. You can try
> to recover as many deltas as possible in a broken pack, but how do you
> request a server to send the rest of the pack to you?

Indeed, it's not resumable. For most commits it's not needed since they
are very small. Why? There are more commits than paths, so the commits
are smaller than paths on the average. I expect my schema to allow for
nearly as good compression as git usually does, especially I'd hope it's
no worse than when packing paths.

However, there may be very large commits in my schema (and maybe also
very large "path-packs" in yours). Such large commits get split as I
described above. Small commits get paired (possibly multiple times) as I
described earlier. You end up with only reasonably sized pieces of data,
let's say between 256 and 512 kB, so you don't need to resume.

Actually, with a really bad connection, you could ask the very server
from which you obtained an incomplete pack to resume from a given byte
offset (similar to HTTP ranges). The server may or may not have it. This
time it should try to keep it available for you in case you connections
abort again. Don't get me wrong -- this is just an additional help for
very badly connected people.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-07 15:59   ` Luke Kenneth Casson Leighton
@ 2011-01-08  2:17     ` Nguyen Thai Ngoc Duy
  2011-01-08 17:21       ` Luke Kenneth Casson Leighton
  0 siblings, 1 reply; 25+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-08  2:17 UTC (permalink / raw)
  To: Luke Kenneth Casson Leighton; +Cc: Nicolas Pitre, Git Mailing List

On Fri, Jan 7, 2011 at 10:59 PM, Luke Kenneth Casson Leighton
<luke.leighton@gmail.com> wrote:
>  bottom line: my take on this is (sorry to say, nguyen) that i don't
> believe bittorrent "pieces" map well to the chains concept, unless the
> chains are perfectly fixed identical sizes [which they could well be?
> am i mistaken about this?]

there are a few characteristics of bittorrent pieces that i see:
verifiable, resumable, uniquely identifiable across peers and
reasonbly small in count.

The fixed size helps peers uniquely identify any pieces by splitting
the whole transfer equally and indexing them in 1-dimension. It's a
rule to help identify pieces, not the only valid rule. In git, given a
commit SHA-1 we can identify any parts (or "pieces") that are
reachable from that commit using rev-list syntax. Therefore the
"identifiable" characteristics still holds even we don't split in
fixed size pieces.
-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-08  1:04         ` Maaartin-1
@ 2011-01-08  2:40           ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 25+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-08  2:40 UTC (permalink / raw)
  To: Maaartin-1; +Cc: git

On Sat, Jan 8, 2011 at 8:04 AM, Maaartin-1 <grajcar1@seznam.cz> wrote:
>> Listing all those commits in linux-2.6.git takes 160k*20=3M (I suppose
>> compressing is useless because SHA-1 is random). A compressed listing
>> of those 46k paths takes 200k.
>
> Sure, Linux has only 4 times as much commits as paths, but the commits
> need 30 times more storage. What does it tell us?
>
> IMHO it speaks in favor of my proposal. Imagine a path changing with
> nearly every commit. The root directory is such a path and near top
> directories come close to (as may other files like todo-lists do). For
> each such file you need 3MB for storing the commits SHAs only. Of
> course, you can invent a schema making storing all the SHAs unnecessary,
> but this is another complication.
>
> OTOH, with the commits used as directory entries we get quite a large
> directory. Is this a problem you wanted me to get aware of?

I merely point out that if we use commit sha-1 as "pieces". Then when
a new peer comes in and ask a running peer "what pieces have you got
(so that I can start fetching from you)?", you will need more
bandwidth for that kind of information.

>> The point is you need to fetch its parent commits first in order to
>> verify a commit. Fetching a whole commit is more expensive than a
>> file. So while you can fetch a few commit bases and request for packs
>> from those bases in parallel, the cost of initial commit bases will be
>> high.
>
> You've lost me. I assume you mean that something like that there may be
> very large commits (e.g., in a project not versioned from the very
> beginning). I'd suggest to split such commits in two parts by
> classifying the blobs (and trees) using a fixed bit of their SHAs. Of
> course, this can be repeated in order to get even smaller parts. Let's
> assume a commit X gets split into X0 and X1. As before, for compressing
> of X0 you may use the content any predecessor of X. For compressing of
> X0 you may additionally use the content of X0. This way the compression
> rate could stay close to optimal, IMHO.

Well if you are going to split a commit, then splitting by paths
sounds more natural to me (assume that people don't often move files).

>> They are interchangeable as a whole, yes. But you cannot fetch half
>> the pack from server A and the other half from server B. You can try
>> to recover as many deltas as possible in a broken pack, but how do you
>> request a server to send the rest of the pack to you?
>
> Indeed, it's not resumable. For most commits it's not needed since they
> are very small. Why? There are more commits than paths, so the commits
> are smaller than paths on the average. I expect my schema to allow for
> nearly as good compression as git usually does, especially I'd hope it's
> no worse than when packing paths.

A commit diff consists of all tree and blobs diff compared to the
parent commit (let's ignore merges). How can it be smaller than just a
single tree/blob diff (of the same path, compared to the parent
commit)?

> However, there may be very large commits in my schema (and maybe also
> very large "path-packs" in yours). Such large commits get split as I
> described above. Small commits get paired (possibly multiple times) as I
> described earlier. You end up with only reasonably sized pieces of data,
> let's say between 256 and 512 kB, so you don't need to resume.

Yeah, I started thinking how to transfer effieciently and I came to
similar thing: assume we have good order of packing and know what to
pack, then we close the pack when its size is greater than a limit and
start sending another pack. If this pack stream is corrupt, resume
from the corrupt pack forward. I currently hardcode the limit 4k, not
greater because pack overhead is very low already (12 header bytes and
20 sha1 trailer bytes each pack).

> Actually, with a really bad connection, you could ask the very server
> from which you obtained an incomplete pack to resume from a given byte
> offset (similar to HTTP ranges). The server may or may not have it. This
> time it should try to keep it available for you in case you connections
> abort again. Don't get me wrong -- this is just an additional help for
> very badly connected people.

Replace "byte range" with rev-list syntax (SHA1~10..SHA1-20) then we
have quite fine-grained way of asking for data. Deltas are usually
very small (I observed cache.h only so this could be a wrong
assumption). But if SHA1~10..SHA1~11 has too big diff that keeps
failing, sending byte ranges of the blob in SHA1~11 is probably the
only way left.
-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-08  2:17     ` Nguyen Thai Ngoc Duy
@ 2011-01-08 17:21       ` Luke Kenneth Casson Leighton
  2011-01-09  3:34         ` Nguyen Thai Ngoc Duy
  2011-01-10 21:38         ` Sam Vilain
  0 siblings, 2 replies; 25+ messages in thread
From: Luke Kenneth Casson Leighton @ 2011-01-08 17:21 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Nicolas Pitre, Git Mailing List

On Sat, Jan 8, 2011 at 2:17 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Fri, Jan 7, 2011 at 10:59 PM, Luke Kenneth Casson Leighton
> <luke.leighton@gmail.com> wrote:
>>  bottom line: my take on this is (sorry to say, nguyen) that i don't
>> believe bittorrent "pieces" map well to the chains concept, unless the
>> chains are perfectly fixed identical sizes [which they could well be?
>> am i mistaken about this?]
>
> there are a few characteristics of bittorrent pieces that i see:
> verifiable, resumable, uniquely identifiable across peers and
> reasonbly small in count.
>
> The fixed size helps peers uniquely identify any pieces by splitting
> the whole transfer equally and indexing them in 1-dimension.

 ok - you haven't answered the question: are the chains perfectly
fixed identical sizes?

 if so they can be slotted into the bittorrent protocol by simply
pre-selecting the size to match.  with the downside that if there are
a million such "chains" you now pretty much overwhelm the peers with
the amount of processing, network traffic and memory requirements to
maintain the "pieces" map.

 if not then you now need to modify the bittorrent protocol to cope
with variable-length block sizes: the protocol only allows for the
last block to be of variable-length.

 also, it's worth pointing out that the entire code-base of every
single bittorrent client that you will ever be able to find revolves
around the concept of reassembly of files from "pieces".

 bottom line: the bittorrent protocol and the available bittorrent
source code libraries, both of which save you a great deal of time in
getting something up-and-running, is _not_ the right fit for the
concept of placing the proposed "chains" into bittorrent "pieces".

 translation: if you wish to pursue the "chains" concept, either a
heavily-modified bittorrent protocol and client _or_ an entirely new
peer-to-peer protocol is far more appropriate.

 orrr, doing what i initially suggested, which is to leave the
bittorrent protocol as-is and to open one .torrent per "chain".
especially if these "chains" vary considerably in size (k to gb)

l.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-08 17:21       ` Luke Kenneth Casson Leighton
@ 2011-01-09  3:34         ` Nguyen Thai Ngoc Duy
  2011-01-09 13:55           ` Luke Kenneth Casson Leighton
  2011-01-10 21:38         ` Sam Vilain
  1 sibling, 1 reply; 25+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-09  3:34 UTC (permalink / raw)
  To: Luke Kenneth Casson Leighton; +Cc: Nicolas Pitre, Git Mailing List

On Sun, Jan 9, 2011 at 12:21 AM, Luke Kenneth Casson Leighton
<luke.leighton@gmail.com> wrote:
>  ok - you haven't answered the question: are the chains perfectly
> fixed identical sizes?

No.

>  if so they can be slotted into the bittorrent protocol by simply
> pre-selecting the size to match.  with the downside that if there are
> a million such "chains" you now pretty much overwhelm the peers with
> the amount of processing, network traffic and memory requirements to
> maintain the "pieces" map.

No, there are thousands of them only (less than 100k for repos I
examined). It's precisely the reason I stay away from commits as
pieces because commits can potentially go up to millions.

>  if not then you now need to modify the bittorrent protocol to cope
> with variable-length block sizes: the protocol only allows for the
> last block to be of variable-length.

Ah I see. I do not reuse bittorrent code out there. Just its ideas,
adapted to git model. If you don't want to modify bittorrent protocol
at all, seed a bundle (as mentioned in another thread).
-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-09  3:34         ` Nguyen Thai Ngoc Duy
@ 2011-01-09 13:55           ` Luke Kenneth Casson Leighton
  2011-01-09 17:48             ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 25+ messages in thread
From: Luke Kenneth Casson Leighton @ 2011-01-09 13:55 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Nicolas Pitre, Git Mailing List

On Sun, Jan 9, 2011 at 3:34 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Sun, Jan 9, 2011 at 12:21 AM, Luke Kenneth Casson Leighton
> <luke.leighton@gmail.com> wrote:
>>  ok - you haven't answered the question: are the chains perfectly
>> fixed identical sizes?
>
> No.
>
>>  if so they can be slotted into the bittorrent protocol by simply
>> pre-selecting the size to match.  with the downside that if there are
>> a million such "chains" you now pretty much overwhelm the peers with
>> the amount of processing, network traffic and memory requirements to
>> maintain the "pieces" map.
>
> No, there are thousands of them only (less than 100k for repos I
> examined). It's precisely the reason I stay away from commits as
> pieces because commits can potentially go up to millions.

 ok - thousands is still a lot.  i recommend that you examine:

 * the heuristics algorithm in bittorrent for piece-selection
 * large repositories such as webkit (1.2gb) and the linux kernel (600mb)

 you still have to come up with a mapping from "chains" to "pieces".
in the bittorrent protocol the mapping is done *entirely* implicitly
and algorithmically.  the "meta" info in the .torrent contains
filenames and file lengths.  stack the files one after the other in a
big long data block, get a chopper and just go "whack, whack, whack"
at regular piece-long points, that's your "pieces".  so, reassembly is
a complete bitch, and picking just _one_ file to download rather than
the whole lot becomes a total pain.

why the bloody hell the bittorrent protocol doesn't just have a file
id i _really_ don't know, it would have made things a damn sight
easier.  anyway - if you're going to modify and "be inspired by" the
bittorrent protocol, you really should look at adding some sort of
"chain" identification - f*** the "chains"-to-"pieces" algorithm, just
add a unique chain id to the relevant bittorrent[-like] command.


>>  if not then you now need to modify the bittorrent protocol to cope
>> with variable-length block sizes: the protocol only allows for the
>> last block to be of variable-length.
>
> Ah I see. I do not reuse bittorrent code out there. Just its ideas,
> adapted to git model.

 that's hard work and you're now into "unproven" territory.  the
successful R&D proof-of-concept code that i wrote i _deliberately_
stayed away from "adapting" a proven bittorrent protocol, and as a
result managed to get that proof-of-concept up and running within ...
i think it was... 3 days.  most of the time was spent arseing about
adding in a VFS layer into bittornado, in order to libratise it.

i mention that just to give you something to think about.  if you're
up to the challenge of writing your own p2p protocol, however, GREAT!
you'll become a world expert on _both_ peer-to-peer protocols _and_
git :)

 l.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-09 13:55           ` Luke Kenneth Casson Leighton
@ 2011-01-09 17:48             ` Nguyen Thai Ngoc Duy
  2011-01-13 11:39               ` Luke Kenneth Casson Leighton
  0 siblings, 1 reply; 25+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-01-09 17:48 UTC (permalink / raw)
  To: Luke Kenneth Casson Leighton; +Cc: Nicolas Pitre, Git Mailing List

On Sun, Jan 9, 2011 at 8:55 PM, Luke Kenneth Casson Leighton
<luke.leighton@gmail.com> wrote:
>  you still have to come up with a mapping from "chains" to "pieces".
> in the bittorrent protocol the mapping is done *entirely* implicitly
> and algorithmically.

Given a commit SHA-1, the mapping can be done algorithmically because
the graph from the commit tip is fixed. Perhaps not mapping all at
once, but as you have more pieces in the graph, you can map more.

> the "meta" info in the .torrent contains
> filenames and file lengths.  stack the files one after the other in a
> big long data block, get a chopper and just go "whack, whack, whack"
> at regular piece-long points, that's your "pieces".  so, reassembly is
> a complete bitch, and picking just _one_ file to download rather than
> the whole lot becomes a total pain.

Well, there won't be .torrent files. Torrent files serve as checksums
for file pieces (let's forget the tracker part). We do sha-1 checksum
on every objects in git. The object graph without real content _is_
"info" part in .torrent files. Instead of passing around torrent
files, I only need to pass around the sha-1 of the commit tip(s). That
should be enough for any peer to discover the rest.

Reassembling, in its simplest way, is to just dump loose objects to
$GIT_DIR/objects. But it's been six years since git's birth now, I'll
pack them instead.

>  that's hard work and you're now into "unproven" territory.  the
> successful R&D proof-of-concept code that i wrote i _deliberately_
> stayed away from "adapting" a proven bittorrent protocol, and as a
> result managed to get that proof-of-concept up and running within ...
> i think it was... 3 days.  most of the time was spent arseing about
> adding in a VFS layer into bittornado, in order to libratise it.
>
> i mention that just to give you something to think about.  if you're
> up to the challenge of writing your own p2p protocol, however, GREAT!
> you'll become a world expert on _both_ peer-to-peer protocols _and_
> git :)

Maybe I have gone insane ;) But I have another aim for this work: to
adjust narrow clone area (pretty much path-based clones). So while it
may not become real torrent for git (i.e p2p exchanging, depends on my
needs), restartable clone from multiple sources is still worth it.
-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-08 17:21       ` Luke Kenneth Casson Leighton
  2011-01-09  3:34         ` Nguyen Thai Ngoc Duy
@ 2011-01-10 21:38         ` Sam Vilain
  1 sibling, 0 replies; 25+ messages in thread
From: Sam Vilain @ 2011-01-10 21:38 UTC (permalink / raw)
  To: Luke Kenneth Casson Leighton
  Cc: Nguyen Thai Ngoc Duy, Nicolas Pitre, Git Mailing List

On 09/01/11 06:21, Luke Kenneth Casson Leighton wrote:
> On Sat, Jan 8, 2011 at 2:17 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> there are a few characteristics of bittorrent pieces that i see:
>> verifiable, resumable, uniquely identifiable across peers and
>> reasonbly small in count.
>>
>> The fixed size helps peers uniquely identify any pieces by splitting
>> the whole transfer equally and indexing them in 1-dimension.
>  ok - you haven't answered the question: are the chains perfectly
> fixed identical sizes?
>
>  if so they can be slotted into the bittorrent protocol by simply
> pre-selecting the size to match.  with the downside that if there are
> a million such "chains" you now pretty much overwhelm the peers with
> the amount of processing, network traffic and memory requirements to
> maintain the "pieces" map.

I'll respond also to this sub-point.  This can be done; but instead of
doing it at the pack level, you take the list of objects between A and B
(for a fetch from A to B), order them by some deterministic order
(called the "commit reel" in the Gittorrent RFC) and then carve that
list up into chunks based on the uncompressed object sizes.

The ordering defined in the RFC is such that it is possible to create
"thin" packs for discrete ranges of commits using existing plumbing, so
that the total transfer size is relatively similar to a complete clone. 
In experiments the network overhead was found to be around 10-20% in
this way.

However I must discourage looking for "inspiration" from the Bittorrent
protocol; it reinvents many wheels unnecessarily, and contains much
shonky advice in it.  See the revision history for the gittorrent RFC
(github.com/samv/gittorrent) for the gory details.

Sam

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-09 17:48             ` Nguyen Thai Ngoc Duy
@ 2011-01-13 11:39               ` Luke Kenneth Casson Leighton
  2011-01-13 23:40                 ` Sam Vilain
  0 siblings, 1 reply; 25+ messages in thread
From: Luke Kenneth Casson Leighton @ 2011-01-13 11:39 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Nicolas Pitre, Git Mailing List

On Sun, Jan 9, 2011 at 5:48 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Sun, Jan 9, 2011 at 8:55 PM, Luke Kenneth Casson Leighton
> <luke.leighton@gmail.com> wrote:
>>  you still have to come up with a mapping from "chains" to "pieces".
>> in the bittorrent protocol the mapping is done *entirely* implicitly
>> and algorithmically.
>
> Given a commit SHA-1, the mapping can be done algorithmically because
> the graph from the commit tip is fixed. Perhaps not mapping all at
> once, but as you have more pieces in the graph, you can map more.

 you're _sure_ about this?  what happens when new commits get added,
and change that graph?  are you _certain_ that you can write an
algorithm which is capable of generating exactly the same mapping,
even as more commits are added to the repository being mirrored, or,
does that situation not matter?

l.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-13 11:39               ` Luke Kenneth Casson Leighton
@ 2011-01-13 23:40                 ` Sam Vilain
  2011-01-14 14:26                   ` Luke Kenneth Casson Leighton
  0 siblings, 1 reply; 25+ messages in thread
From: Sam Vilain @ 2011-01-13 23:40 UTC (permalink / raw)
  To: Luke Kenneth Casson Leighton
  Cc: Nguyen Thai Ngoc Duy, Nicolas Pitre, Git Mailing List

On 14/01/11 00:39, Luke Kenneth Casson Leighton wrote:
> On Sun, Jan 9, 2011 at 5:48 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> On Sun, Jan 9, 2011 at 8:55 PM, Luke Kenneth Casson Leighton
>> <luke.leighton@gmail.com> wrote:
>>>  you still have to come up with a mapping from "chains" to "pieces".
>>> in the bittorrent protocol the mapping is done *entirely* implicitly
>>> and algorithmically.
>> Given a commit SHA-1, the mapping can be done algorithmically because
>> the graph from the commit tip is fixed. Perhaps not mapping all at
>> once, but as you have more pieces in the graph, you can map more.
>  you're _sure_ about this?  what happens when new commits get added,
> and change that graph?  are you _certain_ that you can write an
> algorithm which is capable of generating exactly the same mapping,
> even as more commits are added to the repository being mirrored, or,
> does that situation not matter?

For a given set of start and end points, and a given sort algorithm,
walking the commit tree can yield deterministic results.

You need to first make sure topological sanity prevails, then order by
commit date where there are ties.  git rev-list --date-order does this. 
There is the possibility of commits with the same commit date, so if you
need to be really particular you can tie break on those, too.

Did you look at any of the previous research I linked to before?

Sam

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-13 23:40                 ` Sam Vilain
@ 2011-01-14 14:26                   ` Luke Kenneth Casson Leighton
  2011-01-16  2:11                     ` Sam Vilain
  0 siblings, 1 reply; 25+ messages in thread
From: Luke Kenneth Casson Leighton @ 2011-01-14 14:26 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Nguyen Thai Ngoc Duy, Nicolas Pitre, Git Mailing List

On Thu, Jan 13, 2011 at 11:40 PM, Sam Vilain <sam@vilain.net> wrote:
> On 14/01/11 00:39, Luke Kenneth Casson Leighton wrote:

>> and change that graph?  are you _certain_ that you can write an
>> algorithm which is capable of generating exactly the same mapping,
>> even as more commits are added to the repository being mirrored, or,
>> does that situation not matter?
>
> For a given set of start and end points, and a given sort algorithm,
> walking the commit tree can yield deterministic results.

 excellent.  out of curiosity, is it as efficient as git pack-objects
for the same start and end points?

> Did you look at any of the previous research I linked to before?

 i've been following this since you first originally started it, sam
:)  it would have been be nice if it was a completed implementation
that i could test and see "for real" what you're referring to (above)
- the fact that it's in perl and has "TODO" at some of the critical
points, after trying to work with it for several days i stopped and
went "i'm not getting anywhere with this" and focussed on bittorrent
"as a black box" instead.

 if i recall, the original gittorrent work that you did (mirror-sync),
the primary aim was to rely solely and exclusively on a one-to-one
direct link between one machine and another.  in other words, whilst
syncing, if that peer went "offline", you're screwed - you have to
start again.  is that a fair assessment?  please do correct any
assumptions that i've made.

 because on the basis _of_ that assumption, i decided not to proceed
with mirror-sync, instead to pursue a "cache git pack-objects"
approach and to use bittorrent "black-box-style".  which i
demonstrated (minus the cacheing) works perfectly well, several months
back.

 in this way (i know i didn't reply earlier - apologies), there is
absolutely no need to "take bittorrent apart", no need to modify it,
tinker with it, adjust it, redesign it, learn from it "for
inspiration" - you just get on with it, using the code, protocol and
everything about it as a black-box. "get this file to everyone and
anyone wot needs it".

 as well, after nicolas and others went to all the trouble to explain
what git pack-objects is, how it works, and how damn efficient it is,
i'm pretty much convinced that an approach to uniquely identify, then
pick and cache the *best* git pack-object made [by all the peers
requested to provide a particular commit range], is the best, most
efficient - and importantly simplest and easiest to understand -
approach so far that i've heard.  perhaps that's because i came up
with it, i dunno :)  but the important thing is that i can _show_ that
it works (http://gitorious.org/python-libbittorrent/pybtlib - go back
a few revisions)

 so - perhaps it would help if mirrorsync was revived, so that it can
be used to demonstrate what you mean (there aren't any instructions on
how to set up mirrorsync, for example).  that would then allow people
to do a comparative analysis of the approaches being taken.

 i'd be *very* interested - and i'm sure that there are others
likewise equally as interested - to see if the mirrorsync "commit tree
walking" algorithm can come up with a more efficient method of
transferring git repositories than git pack-objects can.

l.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Resumable clone/Gittorrent (again)
  2011-01-14 14:26                   ` Luke Kenneth Casson Leighton
@ 2011-01-16  2:11                     ` Sam Vilain
  0 siblings, 0 replies; 25+ messages in thread
From: Sam Vilain @ 2011-01-16  2:11 UTC (permalink / raw)
  To: Luke Kenneth Casson Leighton
  Cc: Nguyen Thai Ngoc Duy, Nicolas Pitre, Git Mailing List

On 15/01/11 03:26, Luke Kenneth Casson Leighton wrote:
>>> and change that graph?  are you _certain_ that you can write an
>>> algorithm which is capable of generating exactly the same mapping,
>>> even as more commits are added to the repository being mirrored, or,
>>> does that situation not matter?
>> For a given set of start and end points, and a given sort algorithm,
>> walking the commit tree can yield deterministic results.
>  excellent.  out of curiosity, is it as efficient as git pack-objects
> for the same start and end points?

That isn't a sensible question; walking the revision tree is something
that many commands, including git pack-objects, do internally.

>> Did you look at any of the previous research I linked to before?
>  i've been following this since you first originally started it, sam
> :)  it would have been be nice if it was a completed implementation
> that i could test and see "for real" what you're referring to (above)
> - the fact that it's in perl and has "TODO" at some of the critical
> points, after trying to work with it for several days i stopped and
> went "i'm not getting anywhere with this" and focussed on bittorrent
> "as a black box" instead.
>
>  if i recall, the original gittorrent work that you did (mirror-sync),
> the primary aim was to rely solely and exclusively on a one-to-one
> direct link between one machine and another.  in other words, whilst
> syncing, if that peer went "offline", you're screwed - you have to
> start again.  is that a fair assessment?  please do correct any
> assumptions that i've made.

Ok.  Well, first off - I didn't start gittorrent; that was Jonas
Fonseca, it was his Masters thesis.  Criticism about not having a
completed implementation to work with is therefore shared between him
and people who have come along since such as myself.

I don't know why you got the idea that the protocol is one to one.  It's
one to one just like BitTorrent is - every communication is between two
nodes who share information about what they have and what they need,
before transferring data.  It is supposed to be restartable and it is
not supposed to matter which node data is exchanged with.  In that way,
you could in principle download from multiple nodes at once, or you
could have restartable transfers.  If you lose connectivity then the
most that should have to be re-transferred are incomplete blocks.

>  because on the basis _of_ that assumption, i decided not to proceed
> with mirror-sync, instead to pursue a "cache git pack-objects"
> approach and to use bittorrent "black-box-style".  which i
> demonstrated (minus the cacheing) works perfectly well, several months
> back.

Right, but as others have noted, there are significant drawbacks with
this approach.  For a start, to participate in such a network, you need
to get the particular exact pack that is currently being torrented; just
having a clone is not enough.  This is because the result of git
pack-objects is not repeatable.

That being said for many projects that would be an acceptable compromise
for the advantages of a restartable clone.  That is why I suggest that a
torrent transfer, treated as a mirror which is infrequently updated, may
be a better approach than trying to overly automate everything.

>  as well, after nicolas and others went to all the trouble to explain
> what git pack-objects is, how it works, and how damn efficient it is,
> i'm pretty much convinced that an approach to uniquely identify, then
> pick and cache the *best* git pack-object made [by all the peers
> requested to provide a particular commit range], is the best, most
> efficient - and importantly simplest and easiest to understand -
> approach so far that i've heard.  perhaps that's because i came up
> with it, i dunno :)  but the important thing is that i can _show_ that
> it works (http://gitorious.org/python-libbittorrent/pybtlib - go back
> a few revisions)

That's great.  If you want to continue this simple approach and ignore
the gittorrent/mirror-sync path altogether, that's fine too.

Trying to determine the "best" pack-object may be counter-productive. 
Here's a simple approach which allows the repository owner to easily
arrange for efficient torrenting of essential object files:

Add to the .torrent manifest just these files:

  .git/objects/pack/pack-*.pack - just the files with .keep files
  .git/packed-refs - just the references which are completely available
via the .keep packs

In that way, a repository owner can periodically re-pack their repo,
mark the new pack files as .keep, then re-generate the .torrent file. 
All nodes will just have to transfer the new packs, not everything.

>  so - perhaps it would help if mirrorsync was revived, so that it can
> be used to demonstrate what you mean (there aren't any instructions on
> how to set up mirrorsync, for example).  that would then allow people
> to do a comparative analysis of the approaches being taken.

Ok, that sounds like a good plan - I'll see if I can devote some time to
an explanatory series with working examples with reference to the
existing code etc over the coming months.

Cheers,
Sam

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2011-01-16  2:16 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-05 16:23 Resumable clone/Gittorrent (again) Nguyen Thai Ngoc Duy
2011-01-05 16:56 ` Luke Kenneth Casson Leighton
2011-01-05 17:13   ` Thomas Rast
2011-01-05 18:07     ` Luke Kenneth Casson Leighton
2011-01-06  1:47       ` Nguyen Thai Ngoc Duy
2011-01-06 17:50         ` Luke Kenneth Casson Leighton
2011-01-05 23:28 ` Maaartin
2011-01-06  1:32   ` Nguyen Thai Ngoc Duy
2011-01-06  3:34     ` Maaartin-1
2011-01-06  6:36       ` Nguyen Thai Ngoc Duy
2011-01-08  1:04         ` Maaartin-1
2011-01-08  2:40           ` Nguyen Thai Ngoc Duy
2011-01-07  3:21 ` Nicolas Pitre
2011-01-07  6:34   ` Nguyen Thai Ngoc Duy
2011-01-07 15:59   ` Luke Kenneth Casson Leighton
2011-01-08  2:17     ` Nguyen Thai Ngoc Duy
2011-01-08 17:21       ` Luke Kenneth Casson Leighton
2011-01-09  3:34         ` Nguyen Thai Ngoc Duy
2011-01-09 13:55           ` Luke Kenneth Casson Leighton
2011-01-09 17:48             ` Nguyen Thai Ngoc Duy
2011-01-13 11:39               ` Luke Kenneth Casson Leighton
2011-01-13 23:40                 ` Sam Vilain
2011-01-14 14:26                   ` Luke Kenneth Casson Leighton
2011-01-16  2:11                     ` Sam Vilain
2011-01-10 21:38         ` Sam Vilain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.