All of lore.kernel.org
 help / color / mirror / Atom feed
* Add a "Flattened Cache" to `git --clone`?
@ 2020-05-14 14:34 Caleb Gray
  2020-05-14 20:33 ` Konstantin Ryabitsev
  0 siblings, 1 reply; 17+ messages in thread
From: Caleb Gray @ 2020-05-14 14:34 UTC (permalink / raw)
  To: git

I've done some searching around the Internet, mailing lists, and
reached out in IRC a couple of days ago... and haven't found anyone
else asking about a long-brewed contribution idea that I'd finally
like to implement. First I wanted to run it by you guys, though, since
this is my first time reaching out.

Assuming my idea doesn't contradict other best practices or standards
already in place,  I'd like to transform the typical `git clone` flow
from:

 Cloning into 'linux'...
 remote: Enumerating objects: 4154, done.
 remote: Counting objects: 100% (4154/4154), done.
 remote: Compressing objects: 100% (2535/2535), done.
 remote: Total 7344127 (delta 2564), reused 2167 (delta 1612),
pack-reused 7339973
 Receiving objects: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
 Resolving deltas: 100% (6180880/6180880), done.

To subsequent clones (until cache invalidated) using the "flattened
cache" version (presumably built while fulfilling the first clone
request above):

 Cloning into 'linux'...
 Receiving cache: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.

I've always imagined that this feature would only apply to a "vanilla"
clone (that is, one without any flags that change the end result)...
but that's only because I've never actually cracked open the `git`
codebase yet to validate/invalidated the complexity of this feature.
I'm writing in hopes that someone else has thought about it... and
might share what they already know. :P

Thanks so much for your time!

Sincerely,
Caleb

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 14:34 Add a "Flattened Cache" to `git --clone`? Caleb Gray
@ 2020-05-14 20:33 ` Konstantin Ryabitsev
  2020-05-14 20:54   ` Bryan Turner
                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Konstantin Ryabitsev @ 2020-05-14 20:33 UTC (permalink / raw)
  To: Caleb Gray; +Cc: git

On Thu, May 14, 2020 at 07:34:08AM -0700, Caleb Gray wrote:
> I've done some searching around the Internet, mailing lists, and
> reached out in IRC a couple of days ago... and haven't found anyone
> else asking about a long-brewed contribution idea that I'd finally
> like to implement. First I wanted to run it by you guys, though, since
> this is my first time reaching out.
> 
> Assuming my idea doesn't contradict other best practices or standards
> already in place,  I'd like to transform the typical `git clone` flow
> from:
> 
>  Cloning into 'linux'...
>  remote: Enumerating objects: 4154, done.
>  remote: Counting objects: 100% (4154/4154), done.
>  remote: Compressing objects: 100% (2535/2535), done.
>  remote: Total 7344127 (delta 2564), reused 2167 (delta 1612),
> pack-reused 7339973
>  Receiving objects: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
>  Resolving deltas: 100% (6180880/6180880), done.
> 
> To subsequent clones (until cache invalidated) using the "flattened
> cache" version (presumably built while fulfilling the first clone
> request above):
> 
>  Cloning into 'linux'...
>  Receiving cache: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.

I don't think it's a common workflow for someone to repeatedly clone 
linux.git. Automated processes like CI would be doing it, but they tend 
to blow away the local disk between jobs, so they are unlikely to 
benefit from any native git local cache for something like this (in 
fact, we recommend that people use clone.bundle files for their CI 
needs, as described here: 
https://www.kernel.org/best-way-to-do-linux-clones-for-your-ci.html).

I believe there's quite a bit of work being done by Gitlab folks to make 
it possible to offload more object fetching to lookaside-caches like 
CDN. Perhaps one of them can provide an update on how that is going.

-K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 20:33 ` Konstantin Ryabitsev
@ 2020-05-14 20:54   ` Bryan Turner
  2020-05-14 21:05   ` Theodore Y. Ts'o
  2020-05-14 21:19   ` Junio C Hamano
  2 siblings, 0 replies; 17+ messages in thread
From: Bryan Turner @ 2020-05-14 20:54 UTC (permalink / raw)
  To: Caleb Gray, Git Users

On Thu, May 14, 2020 at 1:33 PM Konstantin Ryabitsev
<konstantin@linuxfoundation.org> wrote:
>
> On Thu, May 14, 2020 at 07:34:08AM -0700, Caleb Gray wrote:
> > I've done some searching around the Internet, mailing lists, and
> > reached out in IRC a couple of days ago... and haven't found anyone
> > else asking about a long-brewed contribution idea that I'd finally
> > like to implement. First I wanted to run it by you guys, though, since
> > this is my first time reaching out.
> >
> > Assuming my idea doesn't contradict other best practices or standards
> > already in place,  I'd like to transform the typical `git clone` flow
> > from:
> >
> >  Cloning into 'linux'...
> >  remote: Enumerating objects: 4154, done.
> >  remote: Counting objects: 100% (4154/4154), done.
> >  remote: Compressing objects: 100% (2535/2535), done.
> >  remote: Total 7344127 (delta 2564), reused 2167 (delta 1612),
> > pack-reused 7339973
> >  Receiving objects: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
> >  Resolving deltas: 100% (6180880/6180880), done.
> >
> > To subsequent clones (until cache invalidated) using the "flattened
> > cache" version (presumably built while fulfilling the first clone
> > request above):
> >
> >  Cloning into 'linux'...
> >  Receiving cache: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
>
> I don't think it's a common workflow for someone to repeatedly clone
> linux.git. Automated processes like CI would be doing it, but they tend
> to blow away the local disk between jobs, so they are unlikely to
> benefit from any native git local cache for something like this (in
> fact, we recommend that people use clone.bundle files for their CI
> needs, as described here:
> https://www.kernel.org/best-way-to-do-linux-clones-for-your-ci.html).
>
> I believe there's quite a bit of work being done by Gitlab folks to make
> it possible to offload more object fetching to lookaside-caches like
> CDN. Perhaps one of them can provide an update on how that is going.

I can't speak for Gitlab, but Bitbucket Server (formerly Stash) has
done this for years, and I believe Github does as well. For Bitbucket
Server, our caching doesn't change what the client sees (i.e. they
still see "Counting objects", "Compressing objects"), but the early
steps essentially jump straight to 100% (since that progress
information is included in our cached data) and then the client starts
receiving the pack.

I'm not sure how straightforward--or desirable--it would be for
something like this to be done natively by Git itself. Certainly it
would make building hosting solutions simpler, which could be a win
for simpler setups that don't use something like Bitbucket Server,
Gitlab or Github, but I'm not sure that's a big "win". Effort on
something like clonebundles (in Mercurial parlance) or similar seems
likely to offer a lot more bang for the buck than caching packs for
specific wants/haves.

Just my 2 cents as someone who has directly worked on this sort of caching.

Bryan Turner

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 20:33 ` Konstantin Ryabitsev
  2020-05-14 20:54   ` Bryan Turner
@ 2020-05-14 21:05   ` Theodore Y. Ts'o
  2020-05-14 21:09     ` Eric Sunshine
                       ` (2 more replies)
  2020-05-14 21:19   ` Junio C Hamano
  2 siblings, 3 replies; 17+ messages in thread
From: Theodore Y. Ts'o @ 2020-05-14 21:05 UTC (permalink / raw)
  To: Caleb Gray, git

On Thu, May 14, 2020 at 04:33:26PM -0400, Konstantin Ryabitsev wrote:
> > Assuming my idea doesn't contradict other best practices or standards
> > already in place,  I'd like to transform the typical `git clone` flow
> > from:
> > 
> >  Cloning into 'linux'...
> >  remote: Enumerating objects: 4154, done.
> >  remote: Counting objects: 100% (4154/4154), done.
> >  remote: Compressing objects: 100% (2535/2535), done.
> >  remote: Total 7344127 (delta 2564), reused 2167 (delta 1612),
> > pack-reused 7339973
> >  Receiving objects: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
> >  Resolving deltas: 100% (6180880/6180880), done.
> > 
> > To subsequent clones (until cache invalidated) using the "flattened
> > cache" version (presumably built while fulfilling the first clone
> > request above):
> > 
> >  Cloning into 'linux'...
> >  Receiving cache: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
> 
> I don't think it's a common workflow for someone to repeatedly clone 
> linux.git. Automated processes like CI would be doing it, but they tend 
> to blow away the local disk between jobs, so they are unlikely to 
> benefit from any native git local cache for something like this (in 
> fact, we recommend that people use clone.bundle files for their CI 
> needs, as described here: 
> https://www.kernel.org/best-way-to-do-linux-clones-for-your-ci.html).

If the goal is a git local cache, we have this today.  I'm not sure
this is what Caleb was asking for, though:

git clone --bare https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git base
git clone --reference base https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext4

							- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 21:05   ` Theodore Y. Ts'o
@ 2020-05-14 21:09     ` Eric Sunshine
  2020-05-14 21:10     ` Konstantin Ryabitsev
  2020-05-14 21:33     ` Caleb Gray
  2 siblings, 0 replies; 17+ messages in thread
From: Eric Sunshine @ 2020-05-14 21:09 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Caleb Gray, Git List

On Thu, May 14, 2020 at 5:05 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
> If the goal is a git local cache, we have this today.  I'm not sure
> this is what Caleb was asking for, though:
>
> git clone --bare https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git base
> git clone --reference base https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext4

For that sort of use-case, git-worktree may also be a suitable solution.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 21:05   ` Theodore Y. Ts'o
  2020-05-14 21:09     ` Eric Sunshine
@ 2020-05-14 21:10     ` Konstantin Ryabitsev
  2020-05-14 21:23       ` Junio C Hamano
  2020-05-14 21:33     ` Caleb Gray
  2 siblings, 1 reply; 17+ messages in thread
From: Konstantin Ryabitsev @ 2020-05-14 21:10 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Caleb Gray, git

On Thu, May 14, 2020 at 05:05:01PM -0400, Theodore Y. Ts'o wrote:
> > 
> > I don't think it's a common workflow for someone to repeatedly clone 
> > linux.git. Automated processes like CI would be doing it, but they tend 
> > to blow away the local disk between jobs, so they are unlikely to 
> > benefit from any native git local cache for something like this (in 
> > fact, we recommend that people use clone.bundle files for their CI 
> > needs, as described here: 
> > https://www.kernel.org/best-way-to-do-linux-clones-for-your-ci.html).
> 
> If the goal is a git local cache, we have this today.  I'm not sure
> this is what Caleb was asking for, though:

Right, I think I misunderstood his request -- I'd assumed we were 
talking about local cache, whereas it's about server-side or proxy-side 
cache.

I think something like git-caching-proxy would be a neat project, 
because it would significantly improve mirroring for CI deployments 
without requiring that each individual job implements clone.bundle 
prefetching.

-K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 20:33 ` Konstantin Ryabitsev
  2020-05-14 20:54   ` Bryan Turner
  2020-05-14 21:05   ` Theodore Y. Ts'o
@ 2020-05-14 21:19   ` Junio C Hamano
  2 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2020-05-14 21:19 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Caleb Gray, git

Konstantin Ryabitsev <konstantin@linuxfoundation.org> writes:

> On Thu, May 14, 2020 at 07:34:08AM -0700, Caleb Gray wrote:
>> ...
>> To subsequent clones (until cache invalidated) using the "flattened
>> cache" version (presumably built while fulfilling the first clone
>> request above):
>> 
>>  Cloning into 'linux'...
>>  Receiving cache: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
>
> I don't think it's a common workflow for someone to repeatedly clone 
> linux.git. Automated processes like CI would be doing it, but they tend 
> to blow away the local disk between jobs, so they are unlikely to 
> benefit from any native git local cache for something like this (in 
> fact, we recommend that people use clone.bundle files for their CI 
> needs, as described here: 
> https://www.kernel.org/best-way-to-do-linux-clones-for-your-ci.html).

I have a feeling that the use case you are talking about is
different from what the original message assumes what use case needs
to be helped (even though the original message lacks substance and
it is hard to guess what idea is being proposed).  

Given the phrase like "while fulfilling the first clone request", I
took it to mean that a cache would sit on the source side, not on
the client side.  You seem to be talking about keeping a copy of
what you earlier cloned to save incoming bandwidth on the client
side.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 21:10     ` Konstantin Ryabitsev
@ 2020-05-14 21:23       ` Junio C Hamano
  2020-05-14 21:44         ` Konstantin Ryabitsev
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2020-05-14 21:23 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Theodore Y. Ts'o, Caleb Gray, git

Konstantin Ryabitsev <konstantin@linuxfoundation.org> writes:

> I think something like git-caching-proxy would be a neat project, 
> because it would significantly improve mirroring for CI deployments 
> without requiring that each individual job implements clone.bundle 
> prefetching.

What are we improving with such a proxy, though?

Not bandwidth to the client, apparently.  I thought that with the
reachability bitmap on the server side with reusing packed object,
it was more or less a solved problem that the server end spends way
too much time enumerating, deltifying and compressing the object
data?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 21:05   ` Theodore Y. Ts'o
  2020-05-14 21:09     ` Eric Sunshine
  2020-05-14 21:10     ` Konstantin Ryabitsev
@ 2020-05-14 21:33     ` Caleb Gray
  2020-05-14 21:56       ` Junio C Hamano
  2 siblings, 1 reply; 17+ messages in thread
From: Caleb Gray @ 2020-05-14 21:33 UTC (permalink / raw)
  To: git

To Clarify: I'm talking about a server-side only cache which behaves
much like a `tar` file: it is a flat version of exactly(*) what ends
up on the client's storage. When a client runs `git --clone` and
there's a valid cache on the other end, that's all that gets streamed.

Konstantin's point that a repo like Linux is bound to see little/no
benefit (in fact, it'll just constantly invalidate/rewrite the ~1gb
cache) is reasonable. This feature definitely targets the "niche"
audience of repos with less-frequent-pushes-to-master-than-clones.

Bryan is exactly on the right track for what I'm referring to: the CDN
approach did come to mind (and is superior in nearly every way).

Junio nailed it: I'm not hoping for anything revolutionary here, just
hoping to reduce the redundant steps in clone down to a single
(presumably faster) step.

If the community agrees that there's little/no benefit to the
limitations of having a "cache for master and that's all," I'm also
more than capable of designing a more useful/complex graph/reduce
based solution which could dynamically bundle the most statistically
relevant data for whatever context the code is working in, though-- I
can't commit to any sort of deadline for that sort of a contribution.



On Thu, May 14, 2020 at 2:05 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Thu, May 14, 2020 at 04:33:26PM -0400, Konstantin Ryabitsev wrote:
> > > Assuming my idea doesn't contradict other best practices or standards
> > > already in place,  I'd like to transform the typical `git clone` flow
> > > from:
> > >
> > >  Cloning into 'linux'...
> > >  remote: Enumerating objects: 4154, done.
> > >  remote: Counting objects: 100% (4154/4154), done.
> > >  remote: Compressing objects: 100% (2535/2535), done.
> > >  remote: Total 7344127 (delta 2564), reused 2167 (delta 1612),
> > > pack-reused 7339973
> > >  Receiving objects: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
> > >  Resolving deltas: 100% (6180880/6180880), done.
> > >
> > > To subsequent clones (until cache invalidated) using the "flattened
> > > cache" version (presumably built while fulfilling the first clone
> > > request above):
> > >
> > >  Cloning into 'linux'...
> > >  Receiving cache: 100% (7344127/7344127), 1.22 GiB | 8.51 MiB/s, done.
> >
> > I don't think it's a common workflow for someone to repeatedly clone
> > linux.git. Automated processes like CI would be doing it, but they tend
> > to blow away the local disk between jobs, so they are unlikely to
> > benefit from any native git local cache for something like this (in
> > fact, we recommend that people use clone.bundle files for their CI
> > needs, as described here:
> > https://www.kernel.org/best-way-to-do-linux-clones-for-your-ci.html).
>
> If the goal is a git local cache, we have this today.  I'm not sure
> this is what Caleb was asking for, though:
>
> git clone --bare https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git base
> git clone --reference base https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext4
>
>                                                         - Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 21:23       ` Junio C Hamano
@ 2020-05-14 21:44         ` Konstantin Ryabitsev
  2020-05-15 21:42           ` Eric Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Ryabitsev @ 2020-05-14 21:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Theodore Y. Ts'o, Caleb Gray, git

On Thu, May 14, 2020 at 02:23:44PM -0700, Junio C Hamano wrote:
> > I think something like git-caching-proxy would be a neat project, 
> > because it would significantly improve mirroring for CI deployments 
> > without requiring that each individual job implements clone.bundle 
> > prefetching.
> 
> What are we improving with such a proxy, though?
> 
> Not bandwidth to the client, apparently. 

Well, if it sits in front of the CI subnet, then it *does* save 
bandwidth.

Here's an example with the exact situation we have:

- the Gerrit server is on the US West Coast
- the CI builder is on the East Coast
- each CI job does a full transfer of the multi-MB repo across the 
  continent, even when cloning shallow

We solve this by having a local mirror of the repository, but this 
requires active mirroring to be pre-setup. A caching proxy that could:

- receive a request for a repository
- stream the response back to the client
- cache objects locally
- use local cache to construct future requests, so only missing objects 
  are fetched from the remote repo regardless of the haves on the actual 
  client...

..now, that would be kinda neat, but I'm not sure how sane or fragile 
that setup would be. :)

> I thought that with the
> reachability bitmap on the server side with reusing packed object,
> it was more or less a solved problem that the server end spends way
> too much time enumerating, deltifying and compressing the object
> data?

Indeed, it's not really solving anything for this case.

-K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 21:33     ` Caleb Gray
@ 2020-05-14 21:56       ` Junio C Hamano
  2020-05-14 22:04         ` Caleb Gray
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2020-05-14 21:56 UTC (permalink / raw)
  To: Caleb Gray; +Cc: git

Caleb Gray <hey@calebgray.com> writes:

> To Clarify: I'm talking about a server-side only cache which behaves
> much like a `tar` file: it is a flat version of exactly(*) what ends
> up on the client's storage. When a client runs `git --clone` and
> there's a valid cache on the other end, that's all that gets streamed.

So this is to save server processing time only.  It does not save
bandwidth (the "cache" is bit-for-bit idential replay of the clone
request it served earlier), and it does not save client processing
cycles (as the receiving end must validate the whole packdata it
received before it can even know what objects it received).

OK.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 21:56       ` Junio C Hamano
@ 2020-05-14 22:04         ` Caleb Gray
  2020-05-14 22:30           ` Junio C Hamano
  2020-05-14 22:44           ` Bryan Turner
  0 siblings, 2 replies; 17+ messages in thread
From: Caleb Gray @ 2020-05-14 22:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Actually those are the steps that I'm explicitly hoping can be
skipped, both on server and client, after the first successful clone
request transaction. The cache itself would be of the end resulting
`.git` directory (client side)... unless I have misconceptions about
the complexity of reproducing what ends up on the client side from the
server side... I figured the shared library probably offers endpoints
for the information I'd need to achieve that.


On Thu, May 14, 2020 at 2:56 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Caleb Gray <hey@calebgray.com> writes:
>
> > To Clarify: I'm talking about a server-side only cache which behaves
> > much like a `tar` file: it is a flat version of exactly(*) what ends
> > up on the client's storage. When a client runs `git --clone` and
> > there's a valid cache on the other end, that's all that gets streamed.
>
> So this is to save server processing time only.  It does not save
> bandwidth (the "cache" is bit-for-bit idential replay of the clone
> request it served earlier), and it does not save client processing
> cycles (as the receiving end must validate the whole packdata it
> received before it can even know what objects it received).
>
> OK.
>
>
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 22:04         ` Caleb Gray
@ 2020-05-14 22:30           ` Junio C Hamano
  2020-05-14 22:44           ` Bryan Turner
  1 sibling, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2020-05-14 22:30 UTC (permalink / raw)
  To: Caleb Gray; +Cc: git

Caleb Gray <hey@calebgray.com> writes:

> Actually those are the steps that I'm explicitly hoping can be
> skipped, both on server and client, after the first successful clone
> request transaction. The cache itself would be of the end resulting
> `.git` directory (client side)... unless I have misconceptions about

If you look at .git/objects/pack/ directory in your repository that
is a clone of somebody else, most likely you'd find even number of
files in there, those whose filename ends with .pack and their
counterparts whose filename ends with .idx extension.  Both files
must exist to perform any local operation, but during the initial
cloning, only the bits in the former are transferred, and the
contents of the latter must be constructed from the bits in the
former.

You can introduce a new protocol that copies the contents of the
.idx, but the contents of that file MUST be validated on the
receiving end, which entails the same amount of computation as our
clients currently spend to construct it out of .pack, so in the end,
you'd be wasting more bandwidth to transfer .idx which is redundant
information without saving processing cycles.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 22:04         ` Caleb Gray
  2020-05-14 22:30           ` Junio C Hamano
@ 2020-05-14 22:44           ` Bryan Turner
  1 sibling, 0 replies; 17+ messages in thread
From: Bryan Turner @ 2020-05-14 22:44 UTC (permalink / raw)
  To: Caleb Gray; +Cc: Junio C Hamano, Git Users

On Thu, May 14, 2020 at 3:05 PM Caleb Gray <hey@calebgray.com> wrote:
>
> Actually those are the steps that I'm explicitly hoping can be
> skipped, both on server and client, after the first successful clone
> request transaction. The cache itself would be of the end resulting
> `.git` directory (client side)... unless I have misconceptions about
> the complexity of reproducing what ends up on the client side from the
> server side... I figured the shared library probably offers endpoints
> for the information I'd need to achieve that.

I don't know that such an approach would ever get accepted. At most it
could only be a partial replica of a `.get` directory. For example,
including `.git/config` or `.git/hooks` carries some heavy security
considerations that make it very unlikely such a change would get
accepted. When you pare down the things from the `.git` directory that
can reasonably be included, I suspect you're pretty much going to be
left with `.git/objects/pack/pack-<something>.pack`, and perhaps
`.git/packed-refs` (although the ref negotiations the client needs to
do in order to even request a pack means you're unlikely to actually
_benefit_ from including `packed-refs`).

For such client-side caching, really you may be better of not trying
to reinvent the wheel and instead, as others have suggested, simply
use `git clone --reference` (possibly plus `--dissociate` if you don't
want any long-term connection between clones) to allow `git clone` to
reference all the objects you have available locally to skip most of
the pack transfer. If you do this, then `git clone` can _already_ make
use of local `idx` files, in addition to packs, to save work.

Bryan Turner

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-14 21:44         ` Konstantin Ryabitsev
@ 2020-05-15 21:42           ` Eric Wong
  2020-05-17 22:12             ` Konstantin Ryabitsev
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2020-05-15 21:42 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Junio C Hamano, Theodore Y. Ts'o, Caleb Gray, git

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Thu, May 14, 2020 at 02:23:44PM -0700, Junio C Hamano wrote:
> > > I think something like git-caching-proxy would be a neat project, 
> > > because it would significantly improve mirroring for CI deployments 
> > > without requiring that each individual job implements clone.bundle 
> > > prefetching.
> > 
> > What are we improving with such a proxy, though?
> > 
> > Not bandwidth to the client, apparently. 
> 
> Well, if it sits in front of the CI subnet, then it *does* save 
> bandwidth.

Agreed.

> Here's an example with the exact situation we have:
> 
> - the Gerrit server is on the US West Coast
> - the CI builder is on the East Coast
> - each CI job does a full transfer of the multi-MB repo across the 
>   continent, even when cloning shallow
> 
> We solve this by having a local mirror of the repository, but this 
> requires active mirroring to be pre-setup. A caching proxy that could:
> 
> - receive a request for a repository
> - stream the response back to the client
> - cache objects locally
> - use local cache to construct future requests, so only missing objects 
>   are fetched from the remote repo regardless of the haves on the actual 
>   client...

An off-the-shelf HTTP caching proxy (e.g. polipo, Squid) could
do a good enough job with dumb HTTP clones (via GIT_SMART_HTTP=0
env).

With well-packed repos, the dumb HTTP transfer cost shouldn't be
too high (and git 2.10+ got way faster on the client side with
poorly-packed repos, thanks to the Linux kernel-derived list.h).

The occasional full repack on the source git server will
invalidate caches and result in a giant download; but it's
better than no caching at all and doing giant cross-country
transfers all day long.

That said, I'm not sure if any client-side caching proxies can
MITM HTTPS and save bandwidth with HTTPS everywhere, nowadays.
I seem to recall polipo being abandoned because of HTTPS.
Maybe there's a caching HTTPS MITM proxy out there...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
  2020-05-15 21:42           ` Eric Wong
@ 2020-05-17 22:12             ` Konstantin Ryabitsev
       [not found]               ` <1061511589863147@mail.yandex.ru>
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Ryabitsev @ 2020-05-17 22:12 UTC (permalink / raw)
  To: Eric Wong; +Cc: Junio C Hamano, Theodore Y. Ts'o, Caleb Gray, git

On Fri, May 15, 2020 at 09:42:57PM +0000, Eric Wong wrote:
> That said, I'm not sure if any client-side caching proxies can
> MITM HTTPS and save bandwidth with HTTPS everywhere, nowadays.
> I seem to recall polipo being abandoned because of HTTPS.
> Maybe there's a caching HTTPS MITM proxy out there...

Right, this can't operate as a transparent proxy. However, it could work 
in combination with insteadOf on the client, e.g., if the repo URL is 
https://example.com/foo/bar.git, the CI builder could set a global 
insteadOf in /etc/gitconfig before kicking off the job:

[url http://local.proxy]
  insteadOf = https://example.com

This way CI job maintainers could continue to use canonical repo URLs, 
but actual requests would go out to the local proxy and be cached.

-K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Add a "Flattened Cache" to `git --clone`?
       [not found]               ` <1061511589863147@mail.yandex.ru>
@ 2020-05-25 14:02                 ` Caleb Gray
  0 siblings, 0 replies; 17+ messages in thread
From: Caleb Gray @ 2020-05-25 14:02 UTC (permalink / raw)
  To: Konstantin Tokarev
  Cc: Konstantin Ryabitsev, Eric Wong, Junio C Hamano,
	Theodore Y. Ts'o, git

For a repo like git itself, the assertions regarding the way git
currently builds its data (in fact, including the `checkout` portion)
does compete directly with the "cached result" methodology! Holy shit
guys, I'm impressed as hell.

tl;dr: The way I read the raw numbers, `git` ends up being as-fast-as
(or faster) than a "cache" of the .git folder. Without doing further
research, I'm inclined to agree with the previously mentioned bitmap
method already being effectively as efficient as (more efficient
than!?) a cache.


Methodology/Reasoning:
virtualized: verified zero network chatter on eth0 before and after each test.
tcpflow: to gather the bits for the entire transaction... from just
before the execution of `git clone` was started, and closing the
listener just after execution ended. (not worrying about
protocols/overhead)
tar: to compare the size of the repository on disk with the tcpflow
results. (not worrying about compensating for
headers/metadata/overhead)
gzip: to theoretically, I haven't checked anything, compensate for
seemingly arbitrary size differences when downloading over HTTPS.
time: (really) rough measure of execution time.


Commands used to generate files:
*.tcpflow: `sudo tcpflow -p -c -i eth0 > $filename.tcpflow`
*.tar: `tar cf $filename.tar .git`
*.gz: `gzip -9 $filename.tar`


Results:

75M kernelorg.tar
72M kernelorg.tar.gz
69M kernelorg_git.tcpflow
69M kernelorg_https.tcpflow

145M github.tar
143M github.tar.gz
143M github_git.tcpflow
142M github_https.tcpflow


Other Tests (sanity checks):

Cloned a gitea mirror of kernel.org's git:
69M gitea_git.tcpflow
69M gitea_https.tcpflow

Cloned a bitbucket mirror of kernel.org's git:
69M bitbucket_git.tcpflow
69M bitbucket_https.tcpflow

$ time git clone git://git.kernel.org/pub/scm/git/git.git
Cloning into 'git'...
remote: Enumerating objects: 15475, done.
remote: Counting objects: 100% (15475/15475), done.
remote: Compressing objects: 100% (861/861), done.
remote: Total 287977 (delta 14910), reused 14907 (delta 14610),
pack-reused 272502
Receiving objects: 100% (287977/287977), 66.09 MiB | 4.87 MiB/s, done.
Resolving deltas: 100% (217420/217420), done.

real    0m20.000s
user    0m15.414s
sys     0m1.606s

$ time wget https://calebgray.com/public/kernelorg.tar.gz
--2020-05-25 06:11:29--  https://calebgray.com/public/kernelorg.tar.gz
Resolving calebgray.com (calebgray.com)... 192.3.203.78
Connecting to calebgray.com (calebgray.com)|192.3.203.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 74593708 (71M) [application/octet-stream]
Saving to: ‘kernelorg.tar.gz’

kernelorg.tar.gz
100%[========================================================================================>]
 71.14M  4.81MB/s    in 19s

2020-05-25 06:11:48 (3.79 MB/s) - ‘kernelorg.tar.gz’ saved [74593708/74593708]

real 0m19.420s
user 0m0.030s
sys 0m0.280s


Thanks everyone for your input and time! I love git, you guys do great work!

P.S. I ran a few other benchmarks outside of these, and the timing
always worked out to be more/less the same between the reported
transfer rate (as told by my router, as well) and the "real" time it
took to download (for both `git` and `wget`).

P.P.S. I haven't investigated the reason for the github repo being
nearly twice the size as the kernel.org hosted copy. That one stands
out as potentially part of the proxy discussion, or there's actually a
difference in the repo's data. Curiosity will likely get the best of
me eventually.




On Mon, May 18, 2020 at 9:40 PM Konstantin Tokarev <annulen@yandex.ru> wrote:
>
>
>
> 18.05.2020, 01:12, "Konstantin Ryabitsev" <konstantin@linuxfoundation.org>:
> > On Fri, May 15, 2020 at 09:42:57PM +0000, Eric Wong wrote:
> >>  That said, I'm not sure if any client-side caching proxies can
> >>  MITM HTTPS and save bandwidth with HTTPS everywhere, nowadays.
> >>  I seem to recall polipo being abandoned because of HTTPS.
> >>  Maybe there's a caching HTTPS MITM proxy out there...
> >
> > Right, this can't operate as a transparent proxy.
>
> AFAIK, Squid can do MITM, caching and operate transparently.
> In the past it was done via ssl_bump directive, but seems like syntax changed a bit
> in modern versions.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-05-25 14:02 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-14 14:34 Add a "Flattened Cache" to `git --clone`? Caleb Gray
2020-05-14 20:33 ` Konstantin Ryabitsev
2020-05-14 20:54   ` Bryan Turner
2020-05-14 21:05   ` Theodore Y. Ts'o
2020-05-14 21:09     ` Eric Sunshine
2020-05-14 21:10     ` Konstantin Ryabitsev
2020-05-14 21:23       ` Junio C Hamano
2020-05-14 21:44         ` Konstantin Ryabitsev
2020-05-15 21:42           ` Eric Wong
2020-05-17 22:12             ` Konstantin Ryabitsev
     [not found]               ` <1061511589863147@mail.yandex.ru>
2020-05-25 14:02                 ` Caleb Gray
2020-05-14 21:33     ` Caleb Gray
2020-05-14 21:56       ` Junio C Hamano
2020-05-14 22:04         ` Caleb Gray
2020-05-14 22:30           ` Junio C Hamano
2020-05-14 22:44           ` Bryan Turner
2020-05-14 21:19   ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.