All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Multi-threaded 'git clone'
@ 2015-02-17  5:20 Martin Fick
  2015-02-17 23:32 ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Martin Fick @ 2015-02-17  5:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Koosha Khajehmoogahi, David Lang

There currently is a thread on the Gerrit list about how much faster cloning can be when using Gerrit/jgit GCed packs with bitmaps versus C git GCed packs with bitmaps.

Some differences outlined are that jgit seems to have more bitmaps, it creates one for every refs/heads, is C git doing that?  Another difference seems to be that jgit creates two packs, splitting stuff not reachable from refs/heads into its own pack.  This makes a clone have zero CPU server side in the pristine case.  In the Gerrit use case, this second "unreachable" packfile can be sizeable, I wonder if there are other use cases where this might also be the case (and this slowing down clones for C git GCed repos)?

If there is not a lot of parallelism left to squeak out, perhaps a focus with better returns is trying to do whatever is possible to make all clones (and potentially any fetch use case deemed important on a particular server) have zero CPU?  Depending on what a server's primary mission is, I could envision certain admins willing to sacrifice significant amounts of disk space to speed up their fetches.  Perhaps some more extreme thinking (such as what must have led to bitmaps) is worth brainstorming about to improve server use cases?

What if an admin were willing to sacrifice a packfile for every use case he deemed important, could git be made to support that easily?  For example, maybe the admin considers a clone or a fetch from master to be important, could zero percent CPU be achieved regularly for those two use cases?  Cloning is possible if the repository were repacked in the jgit style after any push to a head.  Is it worth exploring ways of making GC efficient enough to make this feasible?  Can bitmaps be leveraged to make repacking faster?  I believe that at least reachability checking could potentially be improved with bitmaps? Are there potentially any ways to make better deltification reuse during repacking (not bitmap related), by somehow reversing or translating deltas to new objects that were just received, without actually recalculating them, but yet still getting most objects deltified against the newest objects (achieving the same packs as git GC would achieve today, but faster)? What other pieces need to be improved to make repacking faster?

As for the single branch fetch case, could this somehow be improved by allocating one or more packfiles to this use case?  The simplest single branch fetch use case is likely someone doing a git init followed by a single branch fetch.  I think the android repo tool can be used in this way, so this may actually be a common use case?  With a packfile dedicated to this branch, git should be able to just stream it out without any CPU.  But I think git would need to know this packfile exists to be able to use it.  It would be nice if bitmaps could help here, but I believe bitmaps can so far only be used for one packfile.  I understand that making bitmaps span multiple packfiles would be very complicated, but maybe it would not be so hard to support bitmaps on multiple packfiles if each of these were "self contained"?  By self contained I mean that all objects referenced by objects in the packfile were contained in that packfile.

What other still unimplemented caching techniques could be used to improve clone/fetch use cases? 

- Shallow clones (dedicate a special packfile to this, what about another bitmap format that only maps objects in a single tree to help this)?

- Small fetches (simple branch FF updates), I suspect these are fast enough, but if not, maybe caching some thin packs (that could result in zero CPU requests for many clients) would be useful?  Maybe spread these out exponentially over time so that many will be available for recent updates and fewer for older updates?  I know git normally throws away thin packs after receiving them and resolving them, but if it kept them around (maybe in a special directory), it seems that they could be useful for updating other clients with zero CPU?  A thin pack cache might be something really easy to manage based on file timestamps, an admin may simply need to set a max cache size.  But how can git know what thin packs it has, and what they would be useful for, name them with their start and ending shas?

Sorry for the long winded rant. I suspect that some variation of all my suggestions have already been suggested, but maybe they will rekindle some older, now useful thoughts, or inspire some new ones.  And maybe some of these are better to pursue then more parallelism?

-Martin

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative ProjectOn Feb 16, 2015 8:47 AM, Jeff King <peff@peff.net> wrote:
>
> On Mon, Feb 16, 2015 at 07:31:33AM -0800, David Lang wrote: 
>
> > >Then the server streams the data to the client. It might do some light 
> > >work transforming the data as it comes off the disk, but most of it is 
> > >just blitted straight from disk, and the network is the bottleneck. 
> > 
> > Depending on how close to full the WAN link is, it may be possible to 
> > improve this with multiple connections (again, referencing bbcp), but 
> > there's also the question of if it's worth trying to use the entire WAN for 
> > a single user. The vast majority of the time the server is doing more than 
> > one thing and would rather let any individual user wait a bit and service 
> > the other users. 
>
> Yeah, I have seen clients that make multiple TCP connections to each 
> request a chunk of a file in parallel. The short answer is that this is 
> going to be very hard with git. Each clone generates the pack on the fly 
> based on what's on disk and streams it out. It should _usually_ be the 
> same, but there's nothing to guarantee byte-for-byte equality between 
> invocations. So you'd have to multiplex all of the connections into the 
> same server process. And even then it's hard; that process knows its 
> going to send you byte the bytes for object X, but it doesn't know at 
> exactly which offset until it gets there, which makes sending things out 
> of order tricky. And the whole output is checksummed by a single sha1 
> over the whole stream that comes at the end. 
>
> I think the most feasible thing would be to quickly spool it to a server 
> on the LAN, and then use an existing fetch-in-parallel tool to grab it 
> from there over the WAN. 
>
> -Peff 
> -- 
> To unsubscribe from this list: send the line "unsubscribe git" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Multi-threaded 'git clone'
  2015-02-17  5:20 Multi-threaded 'git clone' Martin Fick
@ 2015-02-17 23:32 ` Junio C Hamano
  2015-02-18  3:14   ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2015-02-17 23:32 UTC (permalink / raw)
  To: Martin Fick; +Cc: Jeff King, git, Koosha Khajehmoogahi, David Lang

Martin Fick <mfick@codeaurora.org> writes:

> Sorry for the long winded rant. I suspect that some variation of all
> my suggestions have already been suggested, but maybe they will
> rekindle some older, now useful thoughts, or inspire some new ones.
> And maybe some of these are better to pursue then more parallelism?

We avoid doing a grand design document without having some prototype
implementation, but I think the limitation of the current protocol
has become apparent enough that we should do something about it, and
we should do it in a way that different implementations of Git can
all implement.

I think "multi-threaded clone" is a wrong title for this discussion,
in that the user does not care if it is done by multi-threading the
current logic or in any other way.  The user just wants a faster
clone.

In addition, the current "fetch" protocol has the following problems
that limit us:

 - It is not easy to make it resumable, because we recompute every
   time.  This is especially problematic for the initial fetch aka
   "clone" as we will be talking about a large transfer [*1*].

 - The protocol extension has a fairly low length limit [*2*].

 - Because the protocol exchange starts by the server side
   advertising all its refs, even when the fetcher is interested in
   a single ref, the initial overhead is nontrivial, especially when
   you are doing a small incremental update.  The worst case is an
   auto-builder that polls every five minutes, even when there is no
   new commits to be fetched [*3*].

 - Because we recompute every time, taking into account of what the
   fetcher has, in addition to what the fetcher obtained earlier
   from us in order to reduce the transferred bytes, the payload for
   incremental updates become tailor-made for each fetch and cannot
   be easily reused [*4*].

I'd like to see a new protocol that lets us overcome the above
limitations (did I miss others? I am sure people can help here)
sometime this year.



[Footnotes]

*1* The "first fetch this bundle from elsewhere and then come back
    here for incremental updates" raised earlier in this thread may
    be a way to alleviate this, as the large bundle can be served
    from a static file.

*2* An earlier "this symbolic ref points at that concrete ref"
    attempt failed because of this and we only talk about HEAD.

*3* A new "fetch" protocol must avoid this "one side blindly gives a
    large message as the first thing".  I have been toying with the
    idea of making the fetcher talk first, by declaring "I am
    interested in your refs that match refs/heads/* or refs/tags/*,
    and I have a superset of objects that are reachable from the
    set of refs' values X you gave me earlier", where X is a small
    token generated by hashing the output from "git ls-remote $there
    refs/heads/* refs/tags/*".  In the best case where the server
    understands what X is and has a cached pack data, it can then
    send:

    - differences in the refs that match the wildcards (e.g. "Back
      then at X I did not have refs/heads/next but now I do and it
      points at this commit.  My refs/heads/master is now at that
      commit.  I no longer have refs/heads/pu.  Everything else in
      the refs/ hierarchy you are interested in is the same as state
      X").

    - The new name of the state Y (again, the hashed value of the
      output from "git ls-remote $there refs/heads/* refs/tags/*")
      to make sure the above differences can be verified at the
      receiving end.

    - the cached pack data that contains all necessary objects
      between X and Y.

    Note that the above would work if and only if we accept that it
    is OK to send objects between the remote tracking branches the
    fetcher has (i.e. the objects it last fetched from the server)
    and the current tips of branches the server has, without
    optimizing by taking into account that some commits in that set
    may have already been obtained by the fetcher from a
    third-party.

    If the server does not recognize state X (after all it is just a
    SHA-1 hash value, so the server cannot recreate the set of refs
    and their values from it unless it remembers), the exchange
    would have to degenerate to the traditional transfer.

    The server would want to recognize the result of hashing an
    empty string, though.  The fetcher is saying "I have nothing"
    in that case.


*4* The scheme in *3* can be extended to bring the fetcher
    step-wise.  If the server's state was X when the fetcher last
    contacted it, and since then the server received multiple pushes
    and has two snapshots of states, Y and Z, then the exchange may
    go like this:

    fetcher: I am interested in refs/heads/* and refs/tags/* and I
             have your state X.

    server:  Here is the incremental difference to the refs and the
             end result should hash to Y.  Here comes the pack data
             to bring you up to date.

    fetcher: (after receiving, unpacking and updating the
             remote-tracking refs) Thanks.  Do you have more?

    server:  Yes, here is the incremental difference to the refs and the
             end result should hash to Z.  Here comes the pack data
             to bring you up to date.

    fetcher: (after receiving, unpacking and updating the
             remote-tracking refs) Thanks.  Do you have more?

    server:  No, you are now fully up to date with me.  Bye.


    

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Multi-threaded 'git clone'
  2015-02-17 23:32 ` Junio C Hamano
@ 2015-02-18  3:14   ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2015-02-18  3:14 UTC (permalink / raw)
  To: Martin Fick; +Cc: Jeff King, git, Koosha Khajehmoogahi, David Lang

On Tue, Feb 17, 2015 at 3:32 PM, Junio C Hamano <gitster@pobox.com> wrote:

A few typofixes and clarifications.

> *4* The scheme in *3* can be extended to bring the fetcher
>     step-wise.  If the server's state was X when the fetcher last

"bring the fetcher up-to-date step-wise", or "update the fetcher step-wise".

>     contacted it, and since then the server received multiple pushes
>     and has two snapshots of states, Y and Z, then the exchange may
>     go like this:
>
>     fetcher: I am interested in refs/heads/* and refs/tags/* and I
>              have your state X.
>
>     server:  Here is the incremental difference to the refs and the
>              end result should hash to Y.  Here comes the pack data
>              to bring you up to date.
>
>     fetcher: (after receiving, unpacking and updating the
>              remote-tracking refs) Thanks.  Do you have more?
>
>     server:  Yes, here is the incremental difference to the refs and the
>              end result should hash to Z.  Here comes the pack data
>              to bring you up to date.
>
>     fetcher: (after receiving, unpacking and updating the
>              remote-tracking refs) Thanks.  Do you have more?
>
>     server:  No, you are now fully up to date with me.  Bye.

The initial part of this exchange may go like this, if the state the
fetcher grabbed the last time from this server is even older than X:

  fetcher: I am interested in refs/heads/* and refs/tags/* and I have
        your state W (or "I know I am too old that I do not know what
        you call that state").

  server: Sorry, I do not know what W is. Please be more specific.

  fetcher: Here are the refs and their objects: refs/heads/master points
        at commit A, refs/heads/maint points at commit B, ...

  server: (goes and checks and finds out that fetcher is behind X).
        OK, I'll compute a custom pack to bring you up to date with
        one of my states. Here is the incremental difference to the refs,
        and the end result should hash to X. Here comes the pack data.

  fetcher: (after receiving, unpacking and updating the
        remote-tracking refs) Thanks. Do you have more?

After that, the server would update this client to state Y and then
state Z as above.

Needless to say, this would naturally extend to a case where you
follow only a single branch (you would say "I am interested in your
refs/heads/dev" with a wildcard that matches exactly that branch).
Of course, depending on the access pattern by common project
participants, the server side may choose what set of refs to prepare
such snapshots and uncommon requests may always be served by
the traditional object enumeration codepath.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Multi-threaded 'git clone'
  2015-02-16 18:43         ` Junio C Hamano
@ 2015-02-17  3:16           ` Shawn Pearce
  0 siblings, 0 replies; 12+ messages in thread
From: Shawn Pearce @ 2015-02-17  3:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, David Lang, Koosha Khajehmoogahi, git

On Mon, Feb 16, 2015 at 10:43 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>
>> ... And the whole output is checksummed by a single sha1
>> over the whole stream that comes at the end.
>>
>> I think the most feasible thing would be to quickly spool it to a
>> server on the LAN, and then use an existing fetch-in-parallel tool
>> to grab it from there over the WAN.
>
> One possibility would be for the server to prepare a static bundle
> file to bootstrap all the "clone" clients with and publish it on a
> CDN.  A protocol extension would tell the client where to download
> the bundle from, the client can then grab the bundle to clone from
> it to become "slightly stale but mostly up to date", and then do a
> usual incremental update with the server after that to be complete.
>
> The server would update the bundle used to bootstrap clone clients
> periodically in order to keep the incrementals to the minimum, and
> would make sure their bitmap is anchored at the tips of bundles to
> minimize the object counting load during the incremental phase.
>
> I think "repo" used by folks who do AOSP does something similar to
> that by scripting around "git clone".  I'd imagine that they would
> be happy to see if "git clone" did all that inside.

Yes, the "repo" tool used by Android uses curl to download a
previously cached $URL/clone.bundle using resumable HTTP. For Android
the file is only updated ~every 6 months at major releases and is
easily cached by CDNs and HTTP proxy servers.

This is spooled to a temporary file on disk then unpacked using `git
fetch $path/clone.bundle refs/heads/*:refs/remotes/origin/*`.
Afterwards a normal git fetch is run to bring the new clone current
with the server, picking up any delta that happened since the bundle
was created and cached.

The Android Git servers at android.googlesource.com just recognize
*/clone.bundle GET requests and issue 302 redirects to the CDN farm
that actually stores and serves the precreated bundle files.

We really want to see this in stock git clone for HTTP transports, as
other projects like Chromium want to use it for their ~3 GiB
repository. Being able to build the bulk of the repo every few months
and serve it out using a CDN to bootstrap new clients would really
help developers on slower or flaky network connections.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Multi-threaded 'git clone'
  2015-02-16 23:16         ` Duy Nguyen
@ 2015-02-17  0:56           ` Jeff King
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2015-02-17  0:56 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: David Lang, Koosha Khajehmoogahi, git

On Tue, Feb 17, 2015 at 06:16:39AM +0700, Duy Nguyen wrote:

> On Mon, Feb 16, 2015 at 10:47 PM, Jeff King <peff@peff.net> wrote:
> > Each clone generates the pack on the fly
> > based on what's on disk and streams it out. It should _usually_ be the
> > same, but there's nothing to guarantee byte-for-byte equality between
> > invocations.
> 
> It's usually _not_ the same. I tried when I wanted to produce stable
> packs. The first condition is single-threaded pack-objects. Otherwise
> thread scheduler could make object order unpredictable.

True. If you keep your server repositories fully packed, that eliminates
the delta search (and/or makes it feasible to turn pack.threads to 1 to
make it deterministic). But any change in the repository (e.g., somebody
else pushing, even to a ref you are not fetching) can cause unexpected
changes in the bytes.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Multi-threaded 'git clone'
  2015-02-16 15:47       ` Jeff King
  2015-02-16 18:43         ` Junio C Hamano
@ 2015-02-16 23:16         ` Duy Nguyen
  2015-02-17  0:56           ` Jeff King
  1 sibling, 1 reply; 12+ messages in thread
From: Duy Nguyen @ 2015-02-16 23:16 UTC (permalink / raw)
  To: Jeff King; +Cc: David Lang, Koosha Khajehmoogahi, git

On Mon, Feb 16, 2015 at 10:47 PM, Jeff King <peff@peff.net> wrote:
> Each clone generates the pack on the fly
> based on what's on disk and streams it out. It should _usually_ be the
> same, but there's nothing to guarantee byte-for-byte equality between
> invocations.

It's usually _not_ the same. I tried when I wanted to produce stable
packs. The first condition is single-threaded pack-objects. Otherwise
thread scheduler could make object order unpredictable.
-- 
Duy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Multi-threaded 'git clone'
  2015-02-16 15:47       ` Jeff King
@ 2015-02-16 18:43         ` Junio C Hamano
  2015-02-17  3:16           ` Shawn Pearce
  2015-02-16 23:16         ` Duy Nguyen
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2015-02-16 18:43 UTC (permalink / raw)
  To: Jeff King; +Cc: David Lang, Koosha Khajehmoogahi, git

Jeff King <peff@peff.net> writes:

> ... And the whole output is checksummed by a single sha1
> over the whole stream that comes at the end.
>
> I think the most feasible thing would be to quickly spool it to a
> server on the LAN, and then use an existing fetch-in-parallel tool
> to grab it from there over the WAN.

One possibility would be for the server to prepare a static bundle
file to bootstrap all the "clone" clients with and publish it on a
CDN.  A protocol extension would tell the client where to download
the bundle from, the client can then grab the bundle to clone from
it to become "slightly stale but mostly up to date", and then do a
usual incremental update with the server after that to be complete.

The server would update the bundle used to bootstrap clone clients
periodically in order to keep the incrementals to the minimum, and
would make sure their bitmap is anchored at the tips of bundles to
minimize the object counting load during the incremental phase.

I think "repo" used by folks who do AOSP does something similar to
that by scripting around "git clone".  I'd imagine that they would
be happy to see if "git clone" did all that inside.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Multi-threaded 'git clone'
  2015-02-16 15:31     ` David Lang
@ 2015-02-16 15:47       ` Jeff King
  2015-02-16 18:43         ` Junio C Hamano
  2015-02-16 23:16         ` Duy Nguyen
  0 siblings, 2 replies; 12+ messages in thread
From: Jeff King @ 2015-02-16 15:47 UTC (permalink / raw)
  To: David Lang; +Cc: Koosha Khajehmoogahi, git

On Mon, Feb 16, 2015 at 07:31:33AM -0800, David Lang wrote:

> >Then the server streams the data to the client. It might do some light
> >work transforming the data as it comes off the disk, but most of it is
> >just blitted straight from disk, and the network is the bottleneck.
> 
> Depending on how close to full the WAN link is, it may be possible to
> improve this with multiple connections (again, referencing bbcp), but
> there's also the question of if it's worth trying to use the entire WAN for
> a single user. The vast majority of the time the server is doing more than
> one thing and would rather let any individual user wait a bit and service
> the other users.

Yeah, I have seen clients that make multiple TCP connections to each
request a chunk of a file in parallel. The short answer is that this is
going to be very hard with git. Each clone generates the pack on the fly
based on what's on disk and streams it out. It should _usually_ be the
same, but there's nothing to guarantee byte-for-byte equality between
invocations. So you'd have to multiplex all of the connections into the
same server process. And even then it's hard; that process knows its
going to send you byte the bytes for object X, but it doesn't know at
exactly which offset until it gets there, which makes sending things out
of order tricky. And the whole output is checksummed by a single sha1
over the whole stream that comes at the end.

I think the most feasible thing would be to quickly spool it to a server
on the LAN, and then use an existing fetch-in-parallel tool to grab it
from there over the WAN.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Multi-threaded 'git clone'
  2015-02-16 15:03   ` Jeff King
@ 2015-02-16 15:31     ` David Lang
  2015-02-16 15:47       ` Jeff King
  0 siblings, 1 reply; 12+ messages in thread
From: David Lang @ 2015-02-16 15:31 UTC (permalink / raw)
  To: Jeff King; +Cc: Koosha Khajehmoogahi, git

On Mon, 16 Feb 2015, Jeff King wrote:

> On Mon, Feb 16, 2015 at 05:31:13AM -0800, David Lang wrote:
>
>> I think it's an interesting question to look at, but before you start
>> looking at changing the architecture of the current code, I would suggest
>> doing a bit more analisys of the problem to see if the bottleneck is really
>> where you think it is.
>>
>> First measure, then optimize :-)
>
> Yes, very much so. Fortunately some people have already done some of
> this work. :)

nice summary

> Then the server streams the data to the client. It might do some light
> work transforming the data as it comes off the disk, but most of it is
> just blitted straight from disk, and the network is the bottleneck.

Depending on how close to full the WAN link is, it may be possible to improve 
this with multiple connections (again, referencing bbcp), but there's also the 
question of if it's worth trying to use the entire WAN for a single user. The 
vast majority of the time the server is doing more than one thing and would 
rather let any individual user wait a bit and service the other users.

David Lang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Multi-threaded 'git clone'
  2015-02-16 13:31 ` David Lang
@ 2015-02-16 15:03   ` Jeff King
  2015-02-16 15:31     ` David Lang
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2015-02-16 15:03 UTC (permalink / raw)
  To: David Lang; +Cc: Koosha Khajehmoogahi, git

On Mon, Feb 16, 2015 at 05:31:13AM -0800, David Lang wrote:

> I think it's an interesting question to look at, but before you start
> looking at changing the architecture of the current code, I would suggest
> doing a bit more analisys of the problem to see if the bottleneck is really
> where you think it is.
> 
> First measure, then optimize :-)

Yes, very much so. Fortunately some people have already done some of
this work. :)

On the server side of a clone, the things that must be done before
sending any data are:

  1. Count up all of the objects that must be sent by traversing the
     object graph.

  2. Find any pairs for delta compression (this is the "Compressing
     objects" phase of the progress reporting).

Step (1) naively takes 30-45 seconds for a kernel repo. However, with
reachability bitmaps, it's instant-ish. I just did a clone from
kernel.org, and it looks like they've turned on bitmaps.

For step (2), git will reuse deltas that already exist in the on-disk
packfile, and will not consider new deltas between objects that are
already in the same pack (because we would already have considered them
when packing in the first place). So the key for servers is to keep
things pretty packed. My kernel.org clone shows that they could probably
stand to repack torvalds/linux.git, but it's not too terrible.

This part is multithreaded, so what work we do happens in parallel. But
note that some servers may turn pack.threads down to 1 (since their many
CPUs are kept busy by multiple requests, rather than trying to finish a
single one).

Then the server streams the data to the client. It might do some light
work transforming the data as it comes off the disk, but most of it is
just blitted straight from disk, and the network is the bottleneck.

On the client side, the incoming data streams into an index-pack
process. For each full object it sees, it hashes and records the name of
the object as it comes in. For deltas, it queues them for resolution
after the complete pack arrives.

Once the full pack arrives, then it resolves all of the deltas. This
part is also multithreaded. If you check out "top" during the "resolving
deltas" phase of the clone, you should see multiple cores in use.

So I don't think there is any room for "just multithread it" in this
process. The CPU intensive bits are already multithreaded. There may be
room for optimizing that, though (e.g., reducing lock contention or
similar).

It would also be possible to resolve deltas while the pack is streaming
in, rather than waiting until the whole thing arrives. That's not
possible in all cases (an object may be a delta against a base that
comes later in the pack), but in practice git puts bases before their
deltas. However, it's overall less efficient, because you may end up
walking through the same parts of the delta chain more than once. For
example, imagine you see a stream of objects A, B, C, D. You get B and
see that it's a delta against A. So you resolve it, hash the object, and
are good. Now you see C, which is a delta against B. To generate C, you
have to compute B again. Now you get to D, which is another delta
against B. So now we compute B again.

You can get around this somewhat with a cache of intermediate object
contents, but of course there may be hundreds or thousands of chains
like this in use at once, so you're going to end up with some cache
misses.

What index-pack does instead is to wait until it has all of the objects,
then finds A and says "what objects use A as a base?". Then it computes
B, hashes it, and says "what objects use B as a base?". And finds C and
D, after which it nows it can drop the intermediate result B.

So that's less work over all, though in some workloads it may finish
faster if you were to stream it (because your many processors are
sitting idle while we are blocked on network bandwidth). So that's a
potential area of exploration.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Multi-threaded 'git clone'
  2015-02-16 13:16 Koosha Khajehmoogahi
@ 2015-02-16 13:31 ` David Lang
  2015-02-16 15:03   ` Jeff King
  0 siblings, 1 reply; 12+ messages in thread
From: David Lang @ 2015-02-16 13:31 UTC (permalink / raw)
  To: Koosha Khajehmoogahi; +Cc: git

On Mon, 16 Feb 2015, Koosha Khajehmoogahi wrote:

> Cloning huge repositories like Linux kernel takes considerable amount
> of time. Is it possible to incorporate a multi-threaded simultaneous
> connections functionality for cloning? To what extent do we need to
> change the architecture of the current code and how large would be the
> scope of the work? That just seems an interesting idea to me and would
> liked to share it with the community.

They key question is what is it that takes the time in clonding and can that be 
multi-threaded.

If it's the netwrok traffic that takes the most time, where is the bottleneck?

Is it in the server software assembling what will be sent? Is it in the 
receiving software processing it? If so, multiple threads could help.

Is it in network bandwidth? If so doing multiple connections won't help much. 
TCP connections favour a few connections passing a lot of data rather than many 
connections passing a little. The one place where multiple connections can help 
is when you have non-congestion induced packet loss as a lost packet on a 
connection will cause the throughput of that connection to drop (if the drop is 
due to congestion, this is TCP working as designed, throttling back to match the 
available bandwidth). This can be a significant effect if you have a very high 
bandwidth, high latency connection (think multiple Gb on international 
connections), but for lower bandwidth connections it's much less of a factor. 
You can look at projects like bbcp

I think it's an interesting question to look at, but before you start looking at 
changing the architecture of the current code, I would suggest doing a bit more 
analisys of the problem to see if the bottleneck is really where you think it 
is.

First measure, then optimize :-)

David Lang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Multi-threaded 'git clone'
@ 2015-02-16 13:16 Koosha Khajehmoogahi
  2015-02-16 13:31 ` David Lang
  0 siblings, 1 reply; 12+ messages in thread
From: Koosha Khajehmoogahi @ 2015-02-16 13:16 UTC (permalink / raw)
  To: git

Greetings,

Cloning huge repositories like Linux kernel takes considerable amount
of time. Is it possible to incorporate a multi-threaded simultaneous
connections functionality for cloning? To what extent do we need to
change the architecture of the current code and how large would be the
scope of the work? That just seems an interesting idea to me and would
liked to share it with the community.

Regards

-- 
Koosha

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-02-18  3:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-17  5:20 Multi-threaded 'git clone' Martin Fick
2015-02-17 23:32 ` Junio C Hamano
2015-02-18  3:14   ` Junio C Hamano
  -- strict thread matches above, loose matches on Subject: below --
2015-02-16 13:16 Koosha Khajehmoogahi
2015-02-16 13:31 ` David Lang
2015-02-16 15:03   ` Jeff King
2015-02-16 15:31     ` David Lang
2015-02-16 15:47       ` Jeff King
2015-02-16 18:43         ` Junio C Hamano
2015-02-17  3:16           ` Shawn Pearce
2015-02-16 23:16         ` Duy Nguyen
2015-02-17  0:56           ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.