git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: faster git clone
       [not found] <20210122030103.GA73465@gmail.com>
@ 2021-01-22 19:53 ` Emily Shaffer
  2021-01-23  0:41   ` brian m. carlson
  0 siblings, 1 reply; 2+ messages in thread
From: Emily Shaffer @ 2021-01-22 19:53 UTC (permalink / raw)
  To: William Chen, Git List

On Thu, Jan 21, 2021 at 7:01 PM William Chen <williamchen32335@gmail.com> wrote:
>
> Dear Emily,
>
> I see your excellent contribution to git clone. I hope that you are well.

Hi William, this is a question much better directed at the Git list as a whole.

>
> When I try to clone a repo of a large size from github, it is slow.
>
> $ git clone https://github.com/git/git
> ...
> remote: Enumerating objects: 56, done.
> remote: Counting objects: 100% (56/56), done.
> remote: Compressing objects: 100% (25/25), done.
> Receiving objects:  23% (70386/299751), 33.00 MiB | 450.00 KiB/s
>
> The following aria2c command, which can use multiple downloading threads, is much faster. Would you please let me know whether there is a way to speed up git clone (maybe by using parallelization)?

In general, it would be more compelling to see actual numbers than
"much faster", e.g. the outputs of `time git clone
https://github.com/git/git` and `time aria2c
https://github.com/git/git/archive/master.zip` - or even an estimation
from you, like, "I think clone takes a minute or two but aria does the
same thing in only a couple of seconds". "Much faster" means something
different to everyone :)

>
> Your help is much appreciated! I look forward to hearing from you. Thanks.
>
> $ aria2c https://github.com/git/git/archive/master.zip
>
> 01/21 20:16:04 [NOTICE] Downloading 1 item(s)
>
> 01/21 20:16:04 [NOTICE] CUID#7 - Redirecting to https://codeload.github.com/git/git/zip/master

Right here it looks like your zip download redirects to a CDN or
something, which is probably better optimized for serving archives
than the Git server itself, so I would guess that has something to do
with it too.

> [#59b6a2 8.2MiB/0B CN:1 DL:3.8MiB]
> 01/21 20:16:08 [NOTICE] Download complete: /private/tmp/git-master.zip
>
> Download Results:
> gid   |stat|avg speed  |path/URI
> ======+====+===========+=======================================================
> 59b6a2|OK  |   2.9MiB/s|/private/tmp/git-master.zip
>
> Status Legend:
> (OK):download completed.

There are others on the list who are better able to explain this than
me. But I'd guess the upshot is that 'git clone
https://github.com/git/git' is asking a Git server, which is good at
Git repo management (e.g. accepting pushes, generating packfiles to
send you a specific object or branch, etc) - but when you ask for
"git/git/archive/master.zip" you're getting the result of some work
the Git server already did a while ago to zip up the current 'master'
into an archive and give it to some other server.

We've done some other work[1] around enabling use of CDNs and prebuilt
chunks lately, but again, there are others on the list better able to
explain than me.

[1]: https://github.com/git/git/blob/master/Documentation/technical/packfile-uri.txt

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: faster git clone
  2021-01-22 19:53 ` faster git clone Emily Shaffer
@ 2021-01-23  0:41   ` brian m. carlson
  0 siblings, 0 replies; 2+ messages in thread
From: brian m. carlson @ 2021-01-23  0:41 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: William Chen, Git List

[-- Attachment #1: Type: text/plain, Size: 4912 bytes --]

On 2021-01-22 at 19:53:21, Emily Shaffer wrote:
> On Thu, Jan 21, 2021 at 7:01 PM William Chen <williamchen32335@gmail.com> wrote:
> > When I try to clone a repo of a large size from github, it is slow.
> >
> > $ git clone https://github.com/git/git
> > ...
> > remote: Enumerating objects: 56, done.
> > remote: Counting objects: 100% (56/56), done.
> > remote: Compressing objects: 100% (25/25), done.
> > Receiving objects:  23% (70386/299751), 33.00 MiB | 450.00 KiB/s
> >
> > The following aria2c command, which can use multiple downloading threads, is much faster. Would you please let me know whether there is a way to speed up git clone (maybe by using parallelization)?
> 
> In general, it would be more compelling to see actual numbers than
> "much faster", e.g. the outputs of `time git clone
> https://github.com/git/git` and `time aria2c
> https://github.com/git/git/archive/master.zip` - or even an estimation
> from you, like, "I think clone takes a minute or two but aria does the
> same thing in only a couple of seconds". "Much faster" means something
> different to everyone :)

When Git shows the download speed, I believe it shows the speed at the
given interval.  So it may be that at the given moment, performance
varies.  Part of that is server side, since Git performs compression of
data, and part of that is client side.  For instance, if you're using
SHA-1-DC (which you should, for security), there is a theoretical
performance limit for clones of about 50 MiB/s on the client side which
would be improved if you were cloning a SHA-256 repository[0].

If this is a fresh clone, it is probably already cached on the server
side, since frequently requested pack files are cached at GitHub
(although I'm not clear on whether a given request is cached can be
determined), so it could be that it's just pushing data over the pipe as
fast as your system can process it.

> > Your help is much appreciated! I look forward to hearing from you. Thanks.
> >
> > $ aria2c https://github.com/git/git/archive/master.zip
> >
> > 01/21 20:16:04 [NOTICE] Downloading 1 item(s)
> >
> > 01/21 20:16:04 [NOTICE] CUID#7 - Redirecting to https://codeload.github.com/git/git/zip/master
> 
> Right here it looks like your zip download redirects to a CDN or
> something, which is probably better optimized for serving archives
> than the Git server itself, so I would guess that has something to do
> with it too.

This is indeed backed by a CDN which may be much closer to you
physically.  Without seeing the full request, it's hard for me to say
where your request was served from (CDN or not).  I should point out
that in this case you're cloning a single revision (so much less data),
the data is usually cached, and your end of the system is not performing
any decompression or hash verification, so it may appear faster when
you're not performing equivalent work.

(I would kindly ask that you not try to download every revision in
history for a comparison, because that would be a clearly excessive and
abusive level of usage.)

I should also point out that you can't use multiple download threads to
download from these endpoints because they don't, in general, handle
Range requests.  (Basically, they do if they're already cached, but not
if they're not.)

> > [#59b6a2 8.2MiB/0B CN:1 DL:3.8MiB]
> > 01/21 20:16:08 [NOTICE] Download complete: /private/tmp/git-master.zip
> >
> > Download Results:
> > gid   |stat|avg speed  |path/URI
> > ======+====+===========+=======================================================
> > 59b6a2|OK  |   2.9MiB/s|/private/tmp/git-master.zip
> >
> > Status Legend:
> > (OK):download completed.
> 
> There are others on the list who are better able to explain this than
> me. But I'd guess the upshot is that 'git clone
> https://github.com/git/git' is asking a Git server, which is good at
> Git repo management (e.g. accepting pushes, generating packfiles to
> send you a specific object or branch, etc) - but when you ask for
> "git/git/archive/master.zip" you're getting the result of some work
> the Git server already did a while ago to zip up the current 'master'
> into an archive and give it to some other server.

It's impossible for me to say definitively what the performance problem
is in this case, but I don't think it's intrinsically Git if you're
seeing less than a 50 MiB/s speed.  Git can and does process data at
that speed on the local system (and at 15 MiB/s on my local network over
Wi-Fi), so I'd guess that it's either a limitation in network
performance based on two different serving locations or perhaps a
temporarily overloaded server combined with the packfile not being
cached.

[0] I personally think 50 MiB/s is a very reasonable transfer speed, but
some people disagree.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-01-23  0:43 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210122030103.GA73465@gmail.com>
2021-01-22 19:53 ` faster git clone Emily Shaffer
2021-01-23  0:41   ` brian m. carlson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).