All of lore.kernel.org
 help / color / mirror / Atom feed
* Re-Transmission of blobs?
@ 2013-09-10 13:08 Josef Wolf
  2013-09-10 17:51 ` Junio C Hamano
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Wolf @ 2013-09-10 13:08 UTC (permalink / raw)
  To: git

Hello,

as we all know, files are identified by their SHA. Thus I had the impression
that when transfering files, git would know by the SHA whether a given file is
already available in the destination repository and the transfer would be of
no use. But this don't seem to be tha case. Lets see this example:

  $ cat t.sh
  #! /bin/sh -ex
  
  LANG=
  
  rm -rf 1 2
  git init 1
  git clone 1 2
  
  cd 1
  git commit --allow-empty -m "initial structure"
  git co -b somebranch
  dd if=/dev/urandom count=10 bs=1024k >t
  git add t
  git commit -m "blah"
  
  cd ../2
  git pull
  git cherry-pick origin/somebranch
  git push -v
  
  $ ./t.sh
  + LANG=
  + rm -rf 1 2
  + git init 1
  Initialized empty Git repository in /home/jw/test/1/.git/
  + git clone 1 2
  Cloning into '2'...
  warning: You appear to have cloned an empty repository.
  done.
  + cd 1
  + git commit --allow-empty -m 'initial structure'
  [master (root-commit) 97e52e2] initial structure
  + git co -b somebranch
  Switched to a new branch 'somebranch'
  + dd if=/dev/urandom count=10 bs=1024k
  10+0 records in
  10+0 records out
  10485760 bytes (10 MB) copied, 1.3202 s, 7.9 MB/s
  + git add t
  + git commit -m blah
  [somebranch b11cf51] blah
   1 file changed, 0 insertions(+), 0 deletions(-)
   create mode 100644 t
  + cd ../2
  + git pull
  remote: Counting objects: 5, done.
  remote: Compressing objects: 100% (3/3), done.
  remote: Total 5 (delta 0), reused 0 (delta 0)
  Unpacking objects: 100% (5/5), done.
  From /home/jw/test/1
   * [new branch]      master     -> origin/master
   * [new branch]      somebranch -> origin/somebranch
  + git cherry-pick origin/somebranch
  [master 9e8f1c6] blah
   1 file changed, 0 insertions(+), 0 deletions(-)
   create mode 100644 t
  + git push -v
  warning: push.default is unset; its implicit value is changing in
  Git 2.0 from 'matching' to 'simple'. To squelch this message
  and maintain the current behavior after the default changes, use:
  
    git config --global push.default matching
  
  To squelch this message and adopt the new behavior now, use:
  
    git config --global push.default simple
  
  See 'git help config' and search for 'push.default' for further information.
  (the 'simple' mode was introduced in Git 1.7.11. Use the similar mode
  'current' instead of 'simple' if you sometimes use older versions of Git)
  
  Pushing to /home/jw/test/1
  Counting objects: 4, done.
  Delta compression using up to 2 threads.
  Compressing objects: 100% (2/2), done.
  Writing objects: 100% (3/3), 10.00 MiB, done.
  Total 3 (delta 0), reused 0 (delta 0)
  To /home/jw/test/1
     97e52e2..9e8f1c6  master -> master
  updating local tracking ref 'refs/remotes/origin/master'
  $


As we can see in this example, the big file is tranferred back to the first
repository, although it is already available there. This is very annoying if
you have a very slow connection.

Am I missing some important point here?

-- 
Josef Wolf
jw@raven.inka.de

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-10 13:08 Re-Transmission of blobs? Josef Wolf
@ 2013-09-10 17:51 ` Junio C Hamano
  2013-09-11 11:27   ` Josef Wolf
  0 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2013-09-10 17:51 UTC (permalink / raw)
  To: Josef Wolf; +Cc: git

Josef Wolf <jw@raven.inka.de> writes:

> as we all know, files are identified by their SHA. Thus I had the impression
> that when transfering files, git would know by the SHA whether a given file is
> already available in the destination repository and the transfer would be of
> no use.

That is unfortunately not how things work.  It is not like the
receiving end sends the names of all objects it has, and the sending
end excludes these objects from what it is going to send.

Consider this simple history with only a handful of commits (as
usual, time flows from left to right):

              E
             /   
    A---B---C---D

where D is at the tip of the sending side, E is at the tip of the
receiving side.  The exchange goes roughly like this:

    (receiving side): what do you have?

    (sending side): my tip is at D.

    (receiving side): D?  I've never heard of it --- please give it
                      to me.  I have E.

    (sending side): E?  I don't know about it; must be something you
                    created since you forked from me.  Tell me about
                    its ancestors.

    (receiving side): OK, I have C.

    (sending side): Oh, C I know about. You do not have to tell me
                    anything more.  A packfile to bring you up to
                    date will follow.

At this point, the sender knows that the receiver needs the commit
D, and trees and blobs in D.  It does also know it has the commit C
and trees and blobs in C.  It does the best thing it can do using
these (and only these) information, namely, to send the commit D,
and send trees and blobs in D that are not in the commit C.

You may happen to have something in E that match what is in D but
not in C.  Because the sender does not know anything about E at all
in the first place, that information cannot be used to reduce the
transfer.

The sender theoretically _could_ also exploit the fact that any
receiver that has C must have B and A and all trees and blobs
associated with these ancestor commits [*1*], but that information
is not currently discovered nor used during the object transfer.

There may happen to be a tree or a blob in A that matches a tree or
a blob in D.  But because the common ancestor discovery exchange
above stops at C, the sender does not bother enumerating all the
objects that are in the ancestor commits of C when figuring out what
objects to send to ensure that the receiving end has all the objects
necessary to complete D.  If you modified a blob at B (or C) and
then resurrected the old version of the blob at D, it is likely that
the blob is going to be sent again when the receiving end asks for
D.

There are some work being done to optimize this further using
various techniques, but they are not ready yet.


[Footnote]

*1* only down to the shallow boundary, if the receiving end is a
shallow clone.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-10 17:51 ` Junio C Hamano
@ 2013-09-11 11:27   ` Josef Wolf
  2013-09-11 17:14     ` Junio C Hamano
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Wolf @ 2013-09-11 11:27 UTC (permalink / raw)
  To: git

On Di, Sep 10, 2013 at 10:51:02 -0700, Junio C Hamano wrote:
> Consider this simple history with only a handful of commits (as
> usual, time flows from left to right):
> 
>               E
>              /   
>     A---B---C---D
> 
> where D is at the tip of the sending side, E is at the tip of the
> receiving side.  The exchange goes roughly like this:
> 
>     (receiving side): what do you have?
> 
>     (sending side): my tip is at D.
> 
>     (receiving side): D?  I've never heard of it --- please give it
>                       to me.  I have E.

At this point, why would the receiving side not tell all the heads it knows
about? That would enable the sending side to see whether it knows any of those
commits. It then would be able to remove from the sending list all the objects
that can be reached from the commits it knows about.

>     (sending side): E?  I don't know about it; must be something you
>                     created since you forked from me.  Tell me about
>                     its ancestors.

This is not exactly true. In the example I had given, the sending side has all
three commits: C, D and E. So the sending side has no reason to say that it
doesn't know anything about E. Therefore the sending side has all information
it needs to deduce exactly which objects need to be sent to the receiving side.

What needs to be sent are all the objects in C..D but without all the objects
in C..E. I guess this operation would be called set-difference in english.

And if the receiving side would have told that it has heads X Y Z in addition,
and the sending side happens to have Y, then the sending side could in
addition remove any objects that can be reached from Y from the sending list.

-- 
Josef Wolf
jw@raven.inka.de

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-11 11:27   ` Josef Wolf
@ 2013-09-11 17:14     ` Junio C Hamano
  2013-09-12  7:42       ` Josef Wolf
  0 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2013-09-11 17:14 UTC (permalink / raw)
  To: Josef Wolf; +Cc: git

Josef Wolf <jw@raven.inka.de> writes:

> On Di, Sep 10, 2013 at 10:51:02 -0700, Junio C Hamano wrote:
>> Consider this simple history with only a handful of commits (as
>> usual, time flows from left to right):
>> 
>>               E
>>              /   
>>     A---B---C---D
>> 
>> where D is at the tip of the sending side, E is at the tip of the
>> receiving side.  The exchange goes roughly like this:
>> 
>>     (receiving side): what do you have?
>> 
>>     (sending side): my tip is at D.
>> 
>>     (receiving side): D?  I've never heard of it --- please give it
>>                       to me.  I have E.
>
> At this point, why would the receiving side not tell all the heads it knows
> about?

It did.  The receiving end had only one branch whose tip is E.  It
may have a tracking branch that knows where the tip of the sending
end used to be when it forked (which is C), so the above may say "I
have E and C".  It actually would say "I have B and A and ..." for a
bounded number of commits, but that does not fundamentally change
the picture---the important point is it is bounded and there is a
horizon.

>> There are some work being done to optimize this further using
>> various techniques, but they are not ready yet.

And this still stands.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-11 17:14     ` Junio C Hamano
@ 2013-09-12  7:42       ` Josef Wolf
  2013-09-12  9:23         ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Wolf @ 2013-09-12  7:42 UTC (permalink / raw)
  To: git

On Mi, Sep 11, 2013 at 10:14:54 -0700, Junio C Hamano wrote:
> Josef Wolf <jw@raven.inka.de> writes:
> > On Di, Sep 10, 2013 at 10:51:02 -0700, Junio C Hamano wrote:
> >> Consider this simple history with only a handful of commits (as
> >> usual, time flows from left to right):
> >> 
> >>               E
> >>              /   
> >>     A---B---C---D
> >> 
> >> where D is at the tip of the sending side, E is at the tip of the
> >> receiving side.  The exchange goes roughly like this:
> >> 
> >>     (receiving side): what do you have?
> >> 
> >>     (sending side): my tip is at D.
> >> 
> >>     (receiving side): D?  I've never heard of it --- please give it
> >>                       to me.  I have E.
> >
> > At this point, why would the receiving side not tell all the heads it knows
> > about?
> 
> It did.  The receiving end had only one branch whose tip is E.  It
> may have a tracking branch that knows where the tip of the sending
> end used to be when it forked (which is C), so the above may say "I
> have E and C".  It actually would say "I have B and A and ..." for a
> bounded number of commits, but that does not fundamentally change
> the picture---the important point is it is bounded and there is a
> horizon.

Therefore, the sending sinde has all information it needs to do any
optimizations you can think of...

> >> There are some work being done to optimize this further using
> >> various techniques, but they are not ready yet.
> 
> And this still stands.

Do you have a pointer or something? I'd like to check out whether I can
contribute to this work.

-- 
Josef Wolf
jw@raven.inka.de

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-12  7:42       ` Josef Wolf
@ 2013-09-12  9:23         ` Jeff King
  2013-09-12 10:35           ` Josef Wolf
  2013-09-12 12:45           ` Pyeron, Jason J CTR (US)
  0 siblings, 2 replies; 19+ messages in thread
From: Jeff King @ 2013-09-12  9:23 UTC (permalink / raw)
  To: Josef Wolf, git

On Thu, Sep 12, 2013 at 09:42:41AM +0200, Josef Wolf wrote:

> > >> There are some work being done to optimize this further using
> > >> various techniques, but they are not ready yet.
> > 
> > And this still stands.
> 
> Do you have a pointer or something? I'd like to check out whether I can
> contribute to this work.

I think Junio is referring to the reachability bitmap work. We may know
that the other side has commit "E" (and therefore every object reachable
from it), but we do not walk the graph to find the complete set of
reachable objects. Doing so requires a lot of CPU and I/O, and in most
cases does not help much.

However, if we had an index of reachable objects (e.g., a bitmap) for
each commit, then we could very cheaply compute the set difference
between what the other side wants and what they have.

JGit has support for pack bitmaps already. There was a patch series a
few months ago to implement a similar functionality for C git, but the
on-disk format was not compatible with JGit's. That series has been
reworked off-list to be compatible with the JGit implementation.

Those patches need a little cleanup before they are ready for the list,
but hopefully that should happen soon-ish.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-12  9:23         ` Jeff King
@ 2013-09-12 10:35           ` Josef Wolf
  2013-09-12 19:44             ` Jeff King
  2013-09-12 12:45           ` Pyeron, Jason J CTR (US)
  1 sibling, 1 reply; 19+ messages in thread
From: Josef Wolf @ 2013-09-12 10:35 UTC (permalink / raw)
  To: git

On Do, Sep 12, 2013 at 05:23:40 -0400, Jeff King wrote:
> On Thu, Sep 12, 2013 at 09:42:41AM +0200, Josef Wolf wrote:
>
> I think Junio is referring to the reachability bitmap work. We may know
> that the other side has commit "E" (and therefore every object reachable
> from it), but we do not walk the graph to find the complete set of
> reachable objects. Doing so requires a lot of CPU and I/O, and in most
> cases does not help much.

I'm not sure I understand correctly. I see that bitmaps can be used to
implement set operations. But how comes that walking the graph requires a lot
of CPU? Isn't it O(n)?

> However, if we had an index of reachable objects (e.g., a bitmap) for
> each commit, then we could very cheaply compute the set difference
> between what the other side wants and what they have.

Those bitmaps would be stored in the git metadata? Is it worth it? Storing a
bitmap for every commit just to be used once-in-a-while seems to be a pretty
big overhead to me. Not to mention the interoperability problems you mentioned
below.

> JGit has support for pack bitmaps already. There was a patch series a
> few months ago to implement a similar functionality for C git, but the
> on-disk format was not compatible with JGit's. That series has been
> reworked off-list to be compatible with the JGit implementation.
> 
> Those patches need a little cleanup before they are ready for the list,
> but hopefully that should happen soon-ish.

Sounds like you're already almost done and don't really need help
anymore. Just out of curiosity, I'd be interested in a pointer anyway ;-)

-- 
Josef Wolf
jw@raven.inka.de

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Re-Transmission of blobs?
  2013-09-12  9:23         ` Jeff King
  2013-09-12 10:35           ` Josef Wolf
@ 2013-09-12 12:45           ` Pyeron, Jason J CTR (US)
  2013-09-12 19:56             ` Jeff King
  1 sibling, 1 reply; 19+ messages in thread
From: Pyeron, Jason J CTR (US) @ 2013-09-12 12:45 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Josef Wolf

[-- Attachment #1: Type: text/plain, Size: 1497 bytes --]

> -----Original Message-----
> From: Jeff King
> Sent: Thursday, September 12, 2013 5:24 AM
> 
> On Thu, Sep 12, 2013 at 09:42:41AM +0200, Josef Wolf wrote:
> 
> > > >> There are some work being done to optimize this further using
> > > >> various techniques, but they are not ready yet.
> > >
> > > And this still stands.
> >
> > Do you have a pointer or something? I'd like to check out whether I
> can
> > contribute to this work.
> 
> I think Junio is referring to the reachability bitmap work. We may know
> that the other side has commit "E" (and therefore every object
> reachable
> from it), but we do not walk the graph to find the complete set of
> reachable objects. Doing so requires a lot of CPU and I/O, and in most
> cases does not help much.

If the rules of engagement are change a bit, the server side can be release from most of its work (CPU/IO).

Client does the following, looping as needed:

Heads=server->heads();
KnownCommits=Local->AllCommits();
Missingblobs=[];
Foreach(commit:heads) if (!knownCommits->contains(commit)) MissingBlobs[]=commit;
Foreach(commit:knownCommit) if (!commit->isValid()) MissingBlobs[]=commit->blobs();
If (missingBlobs->size()>0) server->FetchBlobs(missingBlobs);


This should work efficiently for the server if
a) the client is empty
b) the client is corrupt
c) the client is up to date

Extending the server->fetchBlobs() to be more fancy, like taking patterns, such as between aaaaaa and dddddd exclusive is an exercise for someone else.



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 13922 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-12 10:35           ` Josef Wolf
@ 2013-09-12 19:44             ` Jeff King
  2013-09-13 10:09               ` Josef Wolf
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2013-09-12 19:44 UTC (permalink / raw)
  To: Josef Wolf, git

On Thu, Sep 12, 2013 at 12:35:32PM +0200, Josef Wolf wrote:

> I'm not sure I understand correctly. I see that bitmaps can be used to
> implement set operations. But how comes that walking the graph requires a lot
> of CPU? Isn't it O(n)?

Yes and no. Your "n" there is the entirety of history. Whereas a simple
"git push" generally only has to look at the recent history. So even
though you are looking at each commit and tree only once, it's still a
large number of them (and each one needs to be pulled off of the disk,
decompressed, and reconstructed from deltas).

Secondly, the graph traversal ends up seeing the same sha1s over and
over again in tree entries (because most entries in the tree don't
change from commit to commit). We spend a non-trivial amount of time
looking those up in a hash table.

Just try "git rev-list --objects --all" in your favorite repository to
get a sense. It takes something like 30 seconds in the kernel repo. You
would probably not want to add 30 seconds of CPU time to a trivial push.

> Those bitmaps would be stored in the git metadata? Is it worth it? Storing a
> bitmap for every commit just to be used once-in-a-while seems to be a pretty
> big overhead to me. Not to mention the interoperability problems you mentioned
> below.

There are tricks to make them smaller (run-length compression,
bitmapping a subset of commits and traversing to the nearest one,
storing bitmaps as deltas against nearby bitmaps). And how often it is
used depends on your git workload. For a repository serving git clones
and fetches, it speeds up every operation.

Try starting a clone of:

  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

versus

  git://github.com/torvalds/linux.git

and see which one starts sending you data more quickly.

> Sounds like you're already almost done and don't really need help
> anymore. Just out of curiosity, I'd be interested in a pointer anyway
> ;-)

Shawn gave a talk on JGit here:

  http://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/files/Scaling%20Up%20JGit%20-%20EclipseCon%202013.pdf

and the scrapped patches for git are here:

  http://article.gmane.org/gmane.comp.version-control.git/228918

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-12 12:45           ` Pyeron, Jason J CTR (US)
@ 2013-09-12 19:56             ` Jeff King
  2013-09-12 20:06               ` Pyeron, Jason J CTR (US)
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2013-09-12 19:56 UTC (permalink / raw)
  To: Pyeron, Jason J CTR (US); +Cc: git, Josef Wolf

On Thu, Sep 12, 2013 at 12:45:44PM +0000, Pyeron, Jason J CTR (US) wrote:

> If the rules of engagement are change a bit, the server side can be release from most of its work (CPU/IO).
> 
> Client does the following, looping as needed:
> 
> Heads=server->heads();
> KnownCommits=Local->AllCommits();
> Missingblobs=[];
> Foreach(commit:heads) if (!knownCommits->contains(commit)) MissingBlobs[]=commit;
> Foreach(commit:knownCommit) if (!commit->isValid()) MissingBlobs[]=commit->blobs();
> If (missingBlobs->size()>0) server->FetchBlobs(missingBlobs);

That doesn't quite work. The client does not know the set of missing
objects just from the commits. It knows the sha1 of the root trees it is
missing. And then if it fetches those, it knows the sha1 of any
top-level entries it is missing. And when it gets those, it knows the
sha1 of any 2nd-level entries it is missing, and so forth.

You can progressively ask for each level, but:

  1. You are spending a round-trip for each request. Doing it per-object
     is awful (the dumb http walker will do this if the repo is not
     packed, and it's S-L-O-W). Doing it per-level would be better, but
     not great.

  2. You are losing opportunities for deltas (or you are making the
     state the server needs to maintain very complicated, as it must
     remember from request to request which objects you have gotten that
     can be used as delta bases).

  3. There is a lot of overhead in this protocol. The client has to
     mention each object individually by sha1. It may not seem like a
     lot, but it can easily add 10% to a clone (just look at the size of
     the pack .idx files versus the packfiles themselves).

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Re-Transmission of blobs?
  2013-09-12 19:56             ` Jeff King
@ 2013-09-12 20:06               ` Pyeron, Jason J CTR (US)
  2013-09-13 10:23                 ` Josef Wolf
  2013-09-13 12:16                 ` Duy Nguyen
  0 siblings, 2 replies; 19+ messages in thread
From: Pyeron, Jason J CTR (US) @ 2013-09-12 20:06 UTC (permalink / raw)
  To: git; +Cc: Josef Wolf, Jeff King

[-- Attachment #1: Type: text/plain, Size: 2672 bytes --]

> -----Original Message-----
> From: Jeff King
> Sent: Thursday, September 12, 2013 3:57 PM
> 
> On Thu, Sep 12, 2013 at 12:45:44PM +0000, Pyeron, Jason J CTR (US)
> wrote:
> 
> > If the rules of engagement are change a bit, the server side can be
> release from most of its work (CPU/IO).
> >
> > Client does the following, looping as needed:
> >
> > Heads=server->heads();
> > KnownCommits=Local->AllCommits();
> > Missingblobs=[];
> > Foreach(commit:heads) if (!knownCommits->contains(commit))
> MissingBlobs[]=commit;
> > Foreach(commit:knownCommit) if (!commit->isValid())
> MissingBlobs[]=commit->blobs();
> > If (missingBlobs->size()>0) server->FetchBlobs(missingBlobs);
> 
> That doesn't quite work. The client does not know the set of missing

"looping as needed"

> objects just from the commits. It knows the sha1 of the root trees it
> is
> missing. And then if it fetches those, it knows the sha1 of any
> top-level entries it is missing. And when it gets those, it knows the
> sha1 of any 2nd-level entries it is missing, and so forth.
> 
> You can progressively ask for each level, but:
> 
>   1. You are spending a round-trip for each request. Doing it per-
> object
>      is awful (the dumb http walker will do this if the repo is not
>      packed, and it's S-L-O-W). Doing it per-level would be better, but
>      not great.

Yes, but it is those awfully slow connections (slower that the looping issue) which happen to always drop while cloning from our office. And the round trip should be mitigated by http-keep-alives.

> 
>   2. You are losing opportunities for deltas (or you are making the
>      state the server needs to maintain very complicated, as it must
>      remember from request to request which objects you have gotten
> that
>      can be used as delta bases).

But, again if the connection drops, we have already lost the delta advantage. I would think the scenario would go like this:

git clone url://blah/blah
[fail]
cd blah
git clone --resume #uses normal methods....
[fail]
while ! git clone --resume --HitItWithAStick

replace clone with fetch for that use case too

> 
>   3. There is a lot of overhead in this protocol. The client has to
>      mention each object individually by sha1. It may not seem like a
>      lot, but it can easily add 10% to a clone (just look at the size
> of
>      the pack .idx files versus the packfiles themselves).

But if it finishes in a week, it is a lot better than it never, ever finishes.

I draw attention to the time I had to download DB2 UDB in Thailand via 28k modem. It was resumable with wget, if it were not, it would have required the use of a plane to sneaker net it back.



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 13922 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-12 19:44             ` Jeff King
@ 2013-09-13 10:09               ` Josef Wolf
  2013-09-16 21:55                 ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Wolf @ 2013-09-13 10:09 UTC (permalink / raw)
  To: git

On Thu, Sep 12, 2013 at 03:44:53PM -0400, Jeff King wrote:
> On Thu, Sep 12, 2013 at 12:35:32PM +0200, Josef Wolf wrote:
> 
> > I'm not sure I understand correctly. I see that bitmaps can be used to
> > implement set operations. But how comes that walking the graph requires a lot
> > of CPU? Isn't it O(n)?
> 
> Yes and no. Your "n" there is the entirety of history.

Is this really true?

> (and each one needs to be pulled off of the disk,
> decompressed, and reconstructed from deltas).

While you need to unpack commits/trees to traverse further down, I can't see
any reason to unpack/reconstruct blobs just to see whether you need to send
it. The SHA is all you need to know, isn't it?

> Secondly, the graph traversal ends up seeing the same sha1s over and
> over again in tree entries (because most entries in the tree don't
> change from commit to commit).

Whenever you see an object (whether commit or tree) that you already have
seen, you can stop traversing further down this part of the graph/tree, as
everything you will see on this part has already be seen before.

Why would you see the same commits/trees over and over again? You'd stop
traversing on the boundary of the already-seen-territory, leaving the vast
majority of the "duplicated" structure under the carpet. Somehow I fail to see
the problem here.

-- 
Josef Wolf
jw@raven.inka.de

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-12 20:06               ` Pyeron, Jason J CTR (US)
@ 2013-09-13 10:23                 ` Josef Wolf
  2013-09-13 11:51                   ` Jason Pyeron
  2013-09-13 12:16                 ` Duy Nguyen
  1 sibling, 1 reply; 19+ messages in thread
From: Josef Wolf @ 2013-09-13 10:23 UTC (permalink / raw)
  To: git

On Thu, Sep 12, 2013 at 08:06:35PM +0000, Pyeron, Jason J CTR (US) wrote:

> Yes, but it is those awfully slow connections (slower that the looping
> issue) which happen to always drop while cloning from our office. And the
> round trip should be mitigated by http-keep-alives.
[ ... ]
> But, again if the connection drops, we have already lost the delta
> advantage. I would think the scenario would go like this:
> 
> git clone url://blah/blah
> [fail]
> cd blah
> git clone --resume #uses normal methods....
> [fail]
> while ! git clone --resume --HitItWithAStick
> 
> replace clone with fetch for that use case too

Last time I checked, cloning could not be resumed:

http://git.661346.n2.nabble.com/git-clone-and-unreliable-links-td7570652.html

If you're on a slow/unreliable link, you've lost.

:-( :-( :-(

-- 
Josef Wolf
jw@raven.inka.de

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Re-Transmission of blobs?
  2013-09-13 10:23                 ` Josef Wolf
@ 2013-09-13 11:51                   ` Jason Pyeron
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Pyeron @ 2013-09-13 11:51 UTC (permalink / raw)
  To: git; +Cc: 'Josef Wolf'

> -----Original Message-----
> From: Josef Wolf
> Sent: Friday, September 13, 2013 6:23
> 
> On Thu, Sep 12, 2013 at 08:06:35PM +0000, Pyeron, Jason J CTR 
> (US) wrote:
> 
> > Yes, but it is those awfully slow connections (slower that 
> the looping
> > issue) which happen to always drop while cloning from our 
> office. And 
> > the round trip should be mitigated by http-keep-alives.
> [ ... ]
> > But, again if the connection drops, we have already lost the delta 
> > advantage. I would think the scenario would go like this:
> > 
> > git clone url://blah/blah
> > [fail]
> > cd blah
> > git clone --resume #uses normal methods....

I am using the mythical --resume, where it would fetch packs and indexes.

> > [fail]
> > while ! git clone --resume --HitItWithAStick
> > 
> > replace clone with fetch for that use case too
> 
> Last time I checked, cloning could not be resumed:
> 
> http://git.661346.n2.nabble.com/git-clone-and-unreliable-links-td7570652.html
> 
> If you're on a slow/unreliable link, you've lost.


That is kind of the point. It should be possible.


--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Principal Consultant              10 West 24th Street #100    -
- +1 (443) 269-1555 x333            Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.

 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-12 20:06               ` Pyeron, Jason J CTR (US)
  2013-09-13 10:23                 ` Josef Wolf
@ 2013-09-13 12:16                 ` Duy Nguyen
  1 sibling, 0 replies; 19+ messages in thread
From: Duy Nguyen @ 2013-09-13 12:16 UTC (permalink / raw)
  To: Pyeron, Jason J CTR (US); +Cc: git, Josef Wolf, Jeff King

On Fri, Sep 13, 2013 at 3:06 AM, Pyeron, Jason J CTR (US)
<jason.j.pyeron.ctr@mail.mil> wrote:
> But, again if the connection drops, we have already lost the delta advantage. I would think the scenario would go like this:
>
> git clone url://blah/blah
> [fail]
> cd blah
> git clone --resume #uses normal methods....
> [fail]
> while ! git clone --resume --HitItWithAStick
>
> replace clone with fetch for that use case too

Sorry if I missed something in this thread. But I think we could
stablize the transferred pack so that --resume works. The sender
constructs exactly the same pack as in the first "git clone" then it
starts sending from the offset given by the client. For that to work,
the first "git clone" must also be "git clone --resume". I started
working on that but now my focus is pack v4, so that has to wait.
-- 
Duy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-13 10:09               ` Josef Wolf
@ 2013-09-16 21:55                 ` Jeff King
  2013-09-20  9:27                   ` Josef Wolf
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2013-09-16 21:55 UTC (permalink / raw)
  To: Josef Wolf, git

On Fri, Sep 13, 2013 at 12:09:35PM +0200, Josef Wolf wrote:

> > > I'm not sure I understand correctly. I see that bitmaps can be used to
> > > implement set operations. But how comes that walking the graph requires a lot
> > > of CPU? Isn't it O(n)?
> > 
> > Yes and no. Your "n" there is the entirety of history.
> 
> Is this really true?

Yes. If you know that the receiver has commit X, and you want to know if
it has some blob Y, the only way to know for sure is to look at every
tree of every commit reachable from X, and see whether any of them
references Y. You might get lucky and see that one of the first commits
you looked at mentions Y, but in the negative case, you have to go all
the way down to the roots.

> > (and each one needs to be pulled off of the disk,
> > decompressed, and reconstructed from deltas).
> 
> While you need to unpack commits/trees to traverse further down, I can't see
> any reason to unpack/reconstruct blobs just to see whether you need to send
> it. The SHA is all you need to know, isn't it?

Correct. The full sentence that you partially quoted above was:

   So even though you are looking at each commit and tree only once,
   it's still a large number of them (and each one needs to be pulled
   off of the disk, decompressed, and reconstructed from deltas).

I.e., the "each one" is "commits and trees".  Even reading just them
takes a fair bit of time. Pulling each blob off of disk, too, takes even
longer. You can try it yourself like this:

  git rev-list --objects --all |
    cut -d' ' -f1 |
    git cat-file --batch-check >all-objects
  for i in commit tree blob; do
    grep $i all-objects | cut -d' ' -f1 >$i-objects
    echo >&2 "==> $i"
    time git cat-file --batch <$i-objects >/dev/null
  done

For git.git, commits take about 0.5 seconds on my machine, trees 1
second, and blobs 13 seconds. For the kernel, it's 5, 22, and 210
seconds, respectively.

Now those are times to actually cat the content to /dev/null. Just
looking at it internally is cheaper, but it gives you a ballpark figure
(and most of that time goes to zlib inflation, which is the same either
way).

> > Secondly, the graph traversal ends up seeing the same sha1s over and
> > over again in tree entries (because most entries in the tree don't
> > change from commit to commit).
> 
> Whenever you see an object (whether commit or tree) that you already have
> seen, you can stop traversing further down this part of the graph/tree, as
> everything you will see on this part has already be seen before.
> 
> Why would you see the same commits/trees over and over again? You'd stop
> traversing on the boundary of the already-seen-territory, leaving the vast
> majority of the "duplicated" structure under the carpet. Somehow I fail to see
> the problem here.

Yes, you do not have to recurse into sub-trees (or commits) you have
already seen. And we already do that optimization.  So you do not see
the whole recursive tree over and over, but you see "almost same"
single-level trees repeatedly.

Let me try to give an example.  Here's the root tree of git.git's v1.8.4
release:

  $ git ls-tree v1.8.4 | wc -l
  361

So we have to do 361 lookups, one per entry, to find that we haven't
yet processed each one.

Now what happens when we look at the next commit?

  $ git ls-tree v1.8.4^ | wc -l
  361
  $ git diff-tree --abbrev v1.8.4 v1.8.4^
  :040000 040000 f3aec4c... a6e780e... M  Documentation
  :100755 100755 06026ea... 572dfeb... M  GIT-VERSION-GEN

Still 361 entries, but only two are changed. Yet we still have
to go through all 361 to figure out _which_ were changed.

We can do that by going linearly through the tree and checking each sha1
against a "have we seen this?" data structure. Or we could diff
on-the-fly between adjacent trees, and only process those that we know
we didn't just see.

The current code uses the "seen this" strategy with a hash table. I've
tried the diff strategy, but I couldn't make it any faster than using
the hash table. If we had a tree storage format that made diffs cheap
(like packv4), then traversing via diffs would probably be a win.

Note also that the cost of traversing is dependent on the shape of the
tree. Putting all of your files in the root directory does not perform
as well as having a nicely balanced tree structure, because we can't
weed out as many entries by noticing the whole sub-tree is unchanged.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-16 21:55                 ` Jeff King
@ 2013-09-20  9:27                   ` Josef Wolf
  2013-09-24  7:36                     ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Wolf @ 2013-09-20  9:27 UTC (permalink / raw)
  To: git

On Mon, Sep 16, 2013 at 05:55:36PM -0400, Jeff King wrote:
> On Fri, Sep 13, 2013 at 12:09:35PM +0200, Josef Wolf wrote:
> 
> > > > I'm not sure I understand correctly. I see that bitmaps can be used to
> > > > implement set operations. But how comes that walking the graph requires a lot
> > > > of CPU? Isn't it O(n)?
> > > 
> > > Yes and no. Your "n" there is the entirety of history.
> > 
> > Is this really true?
> 
> Yes. If you know that the receiver has commit X, and you want to know if
> it has some blob Y, the only way to know for sure is to look at every
> tree of every commit reachable from X, and see whether any of them
> references Y.

Jeff, in my original example, I did a cherry-pick from origin/somebranch.
Even without asking, we can assume with great probability that
origin/somebranch is available at origin. And the file in question happens to
reside in the tree at the very tip of origin/somebranch, not in some of its
ancestors. In this case, there's no need to search the history at all. And
even in this pretty simple case, the algorithm seems to fail for some reason.

> Yes, you do not have to recurse into sub-trees (or commits) you have
> already seen. And we already do that optimization.

Why is the file re-transmitted, then?


With a little change in the protocol, a very simple optimization could be
implemented, avoiding the complicated bitmap strategy you were talking about:

Please consider Junio's description of the dialog:

[ Junio wrote: ]
> Consider this simple history with only a handful of commits (as
> usual, time flows from left to right):
>
>              E
>             /   
>    A---B---C---D
>
> where D is at the tip of the sending side, E is at the tip of the
> receiving side.  The exchange goes roughly like this:
>
>    (receiving side): what do you have?
>
>    (sending side): my tip is at D.
>
>    (receiving side): D?  I've never heard of it --- please give it
>                      to me.  I have E.
>
>    (sending side): E?  I don't know about it; must be something you
>                    created since you forked from me.  Tell me about
>                    its ancestors.
>
>    (receiving side): OK, I have C.
>
>    (sending side): Oh, C I know about. You do not have to tell me
>                    anything more.  A packfile to bring you up to
>                    date will follow.

In the last step, instead of sending a packfile, the sending side should send
a list of the SHA's which would be included in this packfile. The receiving
side would then be able to request all the objects it needs to get up-to-date.

I think this change would be considerably simpler than the reachability bitmap
you are talking about. And it would avoid all those time consuming traversals
through the history and the tree. And it would omit _all_ redundant
retransmissions. Even in the case when sender and receiver do not have _any_
common heads at all, _no_ files at all would be retransmitted unnecessarily.

-- 
Josef Wolf
jw@raven.inka.de

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-20  9:27                   ` Josef Wolf
@ 2013-09-24  7:36                     ` Jeff King
  2013-09-24 20:36                       ` Josef Wolf
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2013-09-24  7:36 UTC (permalink / raw)
  To: Josef Wolf, git

On Fri, Sep 20, 2013 at 11:27:15AM +0200, Josef Wolf wrote:

> > Yes. If you know that the receiver has commit X, and you want to know if
> > it has some blob Y, the only way to know for sure is to look at every
> > tree of every commit reachable from X, and see whether any of them
> > references Y.
> 
> Jeff, in my original example, I did a cherry-pick from origin/somebranch.

Sorry, I thought we were talking about the general case, not your
specific example.

> Even without asking, we can assume with great probability that
> origin/somebranch is available at origin.

Bear in mind that the transfer process does not know about
cherry-picking at all. It only sees the other side's tips and traverses.

> And the file in question happens to reside in the tree at the very tip
> of origin/somebranch, not in some of its ancestors. In this case,
> there's no need to search the history at all. And even in this pretty
> simple case, the algorithm seems to fail for some reason.

Correct. And in the current code, we should be looking at the tip tree
for your case.  However, the usual reason to do so is to mark those
objects as a "preferred base" in pack-objects for doing deltas. I wonder
if we are not correctly noticing the case that an object is both
requested to be sent and marked as a preferred base (in which case we
should drop it from our sending list).

If that's the problem, it should be easy to fix cheaply. It would not
work in the general case, but it would for your specific example. But
since it costs nothing, there's no reason not to.

I'll see if I can investigate using the example script you posted.

> > Yes, you do not have to recurse into sub-trees (or commits) you have
> > already seen. And we already do that optimization.
> 
> Why is the file re-transmitted, then?

I meant "we do the optimization during history traversal that avoids
going into sub-trees we have already seen". We do _not_ do the full
history traversal for a partial push.

> With a little change in the protocol, a very simple optimization could be
> implemented, avoiding the complicated bitmap strategy you were talking about:
> [...]
> In the last step, instead of sending a packfile, the sending side should send
> a list of the SHA's which would be included in this packfile. The receiving
> side would then be able to request all the objects it needs to get up-to-date.

I think you have a mis-impression of the problem bitmaps are trying to
solve, mostly because it is not the problem you presented in your
thread (but your problem is one that bitmaps can help).

Consider what the sending side has to do to come up with that list of
objects to send. It has to traverse history to do it, looking at each
tree of each commit that is going to be sent. That effort is
proportional to the amount of history we are going to send. For a small
push or fetch, it is not much. For a clone, it can be quite a lot (tens
of seconds of CPU time per clone). Bitmaps drastically reduce the amount
of CPU required.

Bitmaps can _also_ solve other problems, like letting us be more
thorough in realizing which objects the other side has (without spending
effort on an expensive traversal). If that were the only thing they did,
it might not be worth it. But we basically get that for "free" by
solving the other problem.

So I do not think such a protocol extension is an argument against
pack bitmaps; you would want them with or without the protocol change.

> I think this change would be considerably simpler than the reachability bitmap
> you are talking about. And it would avoid all those time consuming traversals
> through the history and the tree. And it would omit _all_ redundant
> retransmissions. Even in the case when sender and receiver do not have _any_
> common heads at all, _no_ files at all would be retransmitted unnecessarily.

Yes, that would be nice. However, in the common cases it would make
things much worse. A clone of linux.git has ~3.5M objects. That's 70MB
of sha1s that the server tells the client "tell me which of these you
need", and then another 70MB for the client to say "yep, I need all of
them".  You could have the client instead say "here are the few I
_don't_ need", which would save the second half in the common cases.

And of course it would be smaller for a smaller fetch/push. Just looking
at "git rev-list --objects --all --since=1.week.ago" in the kernel,
there are ~77K new objects. So over the course of a week, we use an
extra 1.5MB of bandwidth. How many objects did we save ourselves from
sending, and how big were they (keep in mind they will typically be
deltas against object you already have anyway)?

The answer would depend on your cherry-picking and reverting habits. But
I would guess in a normal workflow that you would not come close to
breaking even.

There are, of course, cases where the user _knows_ there is a huge
object that the other side has, and they do not want to have to send it
again. For that case, it would be useful to have something like the
protocol you described as an optional extension to turn on. Of course,
somebody has to implement it. :)

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re-Transmission of blobs?
  2013-09-24  7:36                     ` Jeff King
@ 2013-09-24 20:36                       ` Josef Wolf
  0 siblings, 0 replies; 19+ messages in thread
From: Josef Wolf @ 2013-09-24 20:36 UTC (permalink / raw)
  To: git

On Tue, Sep 24, 2013 at 03:36:13AM -0400, Jeff King wrote:
> On Fri, Sep 20, 2013 at 11:27:15AM +0200, Josef Wolf wrote:

> > Even without asking, we can assume with great probability that
> > origin/somebranch is available at origin.
> Bear in mind that the transfer process does not know about
> cherry-picking at all.

It dosn't need to know.

> It only sees the other side's tips and traverses.

The sender side knows with high probability that origin/somebranch is avalable
at the receivig side (unless it was deleted). And since the file in question
is part of the tree at the tip of origin/somebranch, we can deduce that the
file is available on the other side (unless it was deleted).

> > And the file in question happens to reside in the tree at the very tip
> > of origin/somebranch, not in some of its ancestors. In this case,
> > there's no need to search the history at all. And even in this pretty
> > simple case, the algorithm seems to fail for some reason.
> 
> Correct. And in the current code, we should be looking at the tip tree
> for your case.  However, the usual reason to do so is to mark those
> objects as a "preferred base" in pack-objects for doing deltas. I wonder
> if we are not correctly noticing the case that an object is both
> requested to be sent and marked as a preferred base (in which case we
> should drop it from our sending list).

Further, it seems that the marking as preferred base had no effect, since the
delta should have been zero in this case. Or is this mechanism deactivated for
binary data (/dev/zero in this case)?

> If that's the problem, it should be easy to fix cheaply. It would not
> work in the general case, but it would for your specific example. But
> since it costs nothing, there's no reason not to.
> 
> I'll see if I can investigate using the example script you posted.

Thanks!

> I meant "we do the optimization during history traversal that avoids
> going into sub-trees we have already seen". We do _not_ do the full
> history traversal for a partial push.

OK. I see. Maybe a config option to request a full traversal would be a
reasonable compromise? That way CPU could be traded against bandwidth for
repositories that happen to have slow/unreliable/expensive connections.

> Yes, that would be nice. However, in the common cases it would make
> things much worse. A clone of linux.git has ~3.5M objects.

Of course, if there's nothing you can drop, any attempt to drop objects will
add to overhead. That's similar to compressing compressed files. This will
enlarge the original file. Would that be a reasonable argument to get rid of
all attempts to compress files?

-- 
Josef Wolf
jw@raven.inka.de

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-09-24 20:40 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-10 13:08 Re-Transmission of blobs? Josef Wolf
2013-09-10 17:51 ` Junio C Hamano
2013-09-11 11:27   ` Josef Wolf
2013-09-11 17:14     ` Junio C Hamano
2013-09-12  7:42       ` Josef Wolf
2013-09-12  9:23         ` Jeff King
2013-09-12 10:35           ` Josef Wolf
2013-09-12 19:44             ` Jeff King
2013-09-13 10:09               ` Josef Wolf
2013-09-16 21:55                 ` Jeff King
2013-09-20  9:27                   ` Josef Wolf
2013-09-24  7:36                     ` Jeff King
2013-09-24 20:36                       ` Josef Wolf
2013-09-12 12:45           ` Pyeron, Jason J CTR (US)
2013-09-12 19:56             ` Jeff King
2013-09-12 20:06               ` Pyeron, Jason J CTR (US)
2013-09-13 10:23                 ` Josef Wolf
2013-09-13 11:51                   ` Jason Pyeron
2013-09-13 12:16                 ` Duy Nguyen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.