git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Questions about partial clone with '--filter=tree:0'
@ 2020-10-20 17:09 Alexandr Miloslavskiy
  2020-10-20 22:29 ` Taylor Blau
  2020-10-26 18:24 ` Jonathan Tan
  0 siblings, 2 replies; 9+ messages in thread
From: Alexandr Miloslavskiy @ 2020-10-20 17:09 UTC (permalink / raw)
  To: git; +Cc: christian.couder, jonathantanmy, Marc Strapetz

This is a edited copy of message I sent 2 weeks ago, which unfortunately
didn't receive any replies. I tried to make make it shorter this time :)

----

We are implementing a git UI. One interesting case is the repository
cloned with '--filter=tree:0', because it makes it a lot harder to
run basic git operations such as file log and blame.

The problems and potential solutions are outlined below. We should be
able to make patches for (2) and (3) if it makes sense to patch these.

(1) Is it even considered a realistic use case?
-----------------------------------------------
Summary: is '--filter=tree:0' a realistic or "crazy" scenario that is
not considered worthy of supporting?

I decided to use Linux repo, which is reasonably large, and it seems
that '--filter=tree:0' could be desired because it helps with disk
space (~0.66gb) and network (~0.54gb):

https://github.com/torvalds/linux.git
   951025 commits total.

   git clone --bare <url>
	7'624'042 objects
	   2.86gb network
	   3.10gb disk
   git clone --bare --filter=blob:none <url>
	5'484'714 (71.9%) objects
	   1.01gb (35.3%) network
	   1.16gb (37.4%) disk
   git clone --bare --filter=tree:0 <url>
	  951'693 (12.5%) objects
	   0.47gb (16.4%) network
	   0.50gb (16.1%) disk
   git clone --bare --depth 1 --branch master <url>
	   74'380 ( 0.9%) objects
	   0.19gb ( 6.6%) network
	   0.19gb ( 6.1%) disk

(2) A command to enrich repo with trees
---------------------------------------
There is no good way to "un-partial" repository that was cloned with
'--filter=tree:0' to have all trees, but no blobs.

There seems to be a dirty way of doing that by abusing 'fetch --deepen'
which happens to skip "ref tip already present locally" check, but
it will also re-download all commits, which means extra ~0.5gb network
in case of Linux repo.

(3) A command to download ALL trees and/or blobs for a subpath
-----------------------------------------------
Summary: Running a Blame or file log in '--filter=tree:0' repo is
currently very inefficient, up to a point where it can be discussed
as not really working.

The suggested command will be able to accept a path and download ALL
trees and/or blobs that match it.

This will solve many problems at once:
* Solve (2)
* Make it possible to prepare for efficient blame and file log
* Make a new experience with super-mono-repos, where user will now
   be able to only download a part of it by path.

Currently '--filter=sparse:oid' is there to support that, but it is
very hard to use on client side, because it requires paths to be
already present in a commit on server.

For a possible solution, it sounds reasonable to have such filter:
   --filter=sparse:pathlist=/1/2'
Path list could be delimited with some special character, and paths
themselves could be escaped.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about partial clone with '--filter=tree:0'
  2020-10-20 17:09 Questions about partial clone with '--filter=tree:0' Alexandr Miloslavskiy
@ 2020-10-20 22:29 ` Taylor Blau
  2020-10-21 17:10   ` Alexandr Miloslavskiy
  2020-10-26 18:24 ` Jonathan Tan
  1 sibling, 1 reply; 9+ messages in thread
From: Taylor Blau @ 2020-10-20 22:29 UTC (permalink / raw)
  To: Alexandr Miloslavskiy; +Cc: git, christian.couder, jonathantanmy, Marc Strapetz

Hi Alexandr,

On Tue, Oct 20, 2020 at 07:09:36PM +0200, Alexandr Miloslavskiy wrote:
> This is a edited copy of message I sent 2 weeks ago, which unfortunately
> didn't receive any replies. I tried to make make it shorter this time :)

Oops. That can happen sometimes, but thanks for re-sending. I'll try to
answer the basic points below.

> ----
>
> We are implementing a git UI. One interesting case is the repository
> cloned with '--filter=tree:0', because it makes it a lot harder to
> run basic git operations such as file log and blame.
>
> The problems and potential solutions are outlined below. We should be
> able to make patches for (2) and (3) if it makes sense to patch these.
>
> (1) Is it even considered a realistic use case?
> -----------------------------------------------
> Summary: is '--filter=tree:0' a realistic or "crazy" scenario that is
> not considered worthy of supporting?

It's not an unrealistic scenario, but it might be for what you're trying
to build. If your UI needs to run, say, 'git log --patch' to show a
historical revision, then you're going to need to fault in a lot of
missing objects.

If that's not something that you need to do often or ever, then having
'--filter=tree:0' is a good way to get the least amount of data possible
when using a partial clone. But if you're going to be performing
operations that need those missing objects, you're probably better eat
the network/storage cost of it all at once, rather than making the user
wait for Git to fault in the set of missing objects that it happens to
need.

> (2) A command to enrich repo with trees
> ---------------------------------------
> There is no good way to "un-partial" repository that was cloned with
> '--filter=tree:0' to have all trees, but no blobs.

There is no command to do that directly, but it is something that Git is
capable of.

It would look something like:

  $ git config remote.origin.partialclonefilter 'blob:none'

Now your repository is in a state where it has no blobs or trees, but
the filter does not prohibit it from getting the trees, so you can ask
it to grab everything you're missing with:

  $ git fetch origin

This should even be a pretty fast operation for repositories that have
bitmaps due to some topics that Peff and I sent to the list a while ago.
If it isn't, please let me know.

> There seems to be a dirty way of doing that by abusing 'fetch --deepen'
> which happens to skip "ref tip already present locally" check, but
> it will also re-download all commits, which means extra ~0.5gb network
> in case of Linux repo.

Mmm, this is probably not what you're looking for. You may be confusing
shallow clones (of which --deepen is relevant) with partial clones
(to which --deepen is irrelevant).

> (3) A command to download ALL trees and/or blobs for a subpath
> -----------------------------------------------
> Summary: Running a Blame or file log in '--filter=tree:0' repo is
> currently very inefficient, up to a point where it can be discussed
> as not really working.

This may be a "don't hold it that way" kind of response, but I don't
think that this is quite what you want. Recall that cloning a
repository with an object filter happens in two steps: first, an initial
download of all of the objects that it thinks you need, and then
(second) a follow-up fetch requesting the objects that you need to
populate your checkout.

I think what you probably want is a step 1.5 to tell Git "I'm not going
to ask for or care about the entirety of my working copy, I really just
want objects in path...", and you can do that with sparse checkouts. See
https://git-scm.com/docs/git-sparse-checkout for more.

The flow might be something like:

  $ git clone --sparse --filter=tree:0 git@yourhost.com:repo.git

and then:

  $ cd repo
  $ git sparse-checkout add foo bar baz
  $ git checkout .

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about partial clone with '--filter=tree:0'
  2020-10-20 22:29 ` Taylor Blau
@ 2020-10-21 17:10   ` Alexandr Miloslavskiy
  2020-10-21 17:31     ` Taylor Blau
  0 siblings, 1 reply; 9+ messages in thread
From: Alexandr Miloslavskiy @ 2020-10-21 17:10 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, christian.couder, jonathantanmy, Marc Strapetz

On 21.10.2020 0:29, Taylor Blau wrote:
> Oops. That can happen sometimes, but thanks for re-sending. I'll try to
> answer the basic points below.

Thanks for stepping in!

>> (1) Is it even considered a realistic use case?
>> -----------------------------------------------
>> Summary: is '--filter=tree:0' a realistic or "crazy" scenario that is
>> not considered worthy of supporting?
>
> It's not an unrealistic scenario, but it might be for what you're trying
> to build. If your UI needs to run, say, 'git log --patch' to show a
> historical revision, then you're going to need to fault in a lot of
> missing objects.
>
> If that's not something that you need to do often or ever, then having
> '--filter=tree:0' is a good way to get the least amount of data possible
> when using a partial clone. But if you're going to be performing
> operations that need those missing objects, you're probably better eat
> the network/storage cost of it all at once, rather than making the user
> wait for Git to fault in the set of missing objects that it happens to
> need.

We currently do not intend to use '--filter=tree:0' ourself, but we are 
trying to support all kinds of user repositories with our UI. So we 
basically have these choices:

A) Declare '--filter=tree:0' repos as completely wrong and unsupported
    in out UI, also giving an option to "un-partial" them.

B) Support '--filter=tree:0' repos, but don't support operations such
    as blame and file log

C) Use some magic to efficiently download objects that will be needed
    for a command such as Blame, while keeping the rest of the repository
    partial. This is where the command described in (3) will help a lot.

We would of course prefer (C) if it's reasonably possible.

>> (2) A command to enrich repo with trees
>> ---------------------------------------
>> There is no good way to "un-partial" repository that was cloned with
>> '--filter=tree:0' to have all trees, but no blobs.
>
> There is no command to do that directly, but it is something that Git is
> capable of.
>
> It would look something like:
>
>    $ git config remote.origin.partialclonefilter 'blob:none'
>
> Now your repository is in a state where it has no blobs or trees, but
> the filter does not prohibit it from getting the trees, so you can ask
> it to grab everything you're missing with:
>
>    $ git fetch origin
>
> This should even be a pretty fast operation for repositories that have
> bitmaps due to some topics that Peff and I sent to the list a while ago.
> If it isn't, please let me know.

Unfortunately this does not work as expected. Try the following steps:

A) Clone repo with '--filter=tree:0'
    $ git clone --bare --filter=tree:0 --branch master 
https://github.com/git/git.git

B) Change filter to 'blob:none'
    $ cd git.git
    $ git config remote.origin.partialclonefilter 'blob:none'

C) fetch
    $ git fetch origin
    Note that there is no 'Receiving objects:' output.

D) Verify that trees were downloaded
    $ git cat-file -p HEAD | grep tree
      tree ee5b5b41305cda618862beebc9c94859ae276e5a
    $ git cat-file -t ee5b5b41305cda618862beebc9c94859ae276e5a
      Note that 1 object gets downloaded. This confirms that (C) didn't
      achieve the goal.

It happens due to 'check_exist_and_connected()' test in 'fetch_refs()'.
Since the tip of the ref is already available locally (even though it
is missing all trees), nothing is downloaded.

>> There seems to be a dirty way of doing that by abusing 'fetch --deepen'
>> which happens to skip "ref tip already present locally" check, but
>> it will also re-download all commits, which means extra ~0.5gb network
>> in case of Linux repo.
>
> Mmm, this is probably not what you're looking for. You may be confusing
> shallow clones (of which --deepen is relevant) with partial clones
> (to which --deepen is irrelevant).

Yes, '--deepen' is intended for shallow clones. But abusing it for
partial clones allows to skip 'check_exist_and_connected()' test.
However, I did more testing today, and in many cases server itself
refuses to send objects, probably due to sent 'HAVE' or something
else. So even '--deepen' doesn't really help.

> I think what you probably want is a step 1.5 to tell Git "I'm not going
> to ask for or care about the entirety of my working copy, I really just
> want objects in path...", and you can do that with sparse checkouts. See
> https://git-scm.com/docs/git-sparse-checkout for more.

For simplicity of discussion, let's focus on the problem of running
Blame efficiently in a repo that was cloned with '--filter=tree:0'. In
order to blame file '/1/2/Foo.txt', we will need the following:

* Trees '/1'
* Trees '/1/2'
* Blobs '/1/2/Foo.txt'

All of these will be needed to unknown commit depth. For simplicity,
the proposed command will download these for all commits. Specifying
a range of revisions could be nice, but I feel that it's not worth the
complexity.

Correct me if I'm wrong: I think that sparse checkout will not help to
achieve the goal?

This is why I suggest a command that will accept paths and send
requested objects, also forcing server to assume that all of them are
missing in client's repository.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about partial clone with '--filter=tree:0'
  2020-10-21 17:10   ` Alexandr Miloslavskiy
@ 2020-10-21 17:31     ` Taylor Blau
  2020-10-21 17:46       ` Alexandr Miloslavskiy
  0 siblings, 1 reply; 9+ messages in thread
From: Taylor Blau @ 2020-10-21 17:31 UTC (permalink / raw)
  To: Alexandr Miloslavskiy
  Cc: Taylor Blau, git, christian.couder, jonathantanmy, Marc Strapetz

On Wed, Oct 21, 2020 at 07:10:02PM +0200, Alexandr Miloslavskiy wrote:
> We currently do not intend to use '--filter=tree:0' ourself, but we are
> trying to support all kinds of user repositories with our UI. So we
> basically have these choices:
>
> A) Declare '--filter=tree:0' repos as completely wrong and unsupported
>    in out UI, also giving an option to "un-partial" them.
>
> B) Support '--filter=tree:0' repos, but don't support operations such
>    as blame and file log
>
> C) Use some magic to efficiently download objects that will be needed
>    for a command such as Blame, while keeping the rest of the repository
>    partial. This is where the command described in (3) will help a lot.
>
> We would of course prefer (C) if it's reasonably possible.

(C) is probably the most reasonable. If you have a promisor remote which
is missing objects, running 'git blame' etc. will transparently download
whatever objects it is missing.

> Unfortunately this does not work as expected. Try the following steps:
>
> A) Clone repo with '--filter=tree:0'
>    $ git clone --bare --filter=tree:0 --branch master
> https://github.com/git/git.git
>
> B) Change filter to 'blob:none'
>    $ cd git.git
>    $ git config remote.origin.partialclonefilter 'blob:none'
>
> C) fetch
>    $ git fetch origin
>    Note that there is no 'Receiving objects:' output.

Ah; I would have thought that the server would have sent objects, even
though we have lots of 'have' lines, since we are treating the server as
a promisor remote and might not have the full reachability closure over
the haves.

Jonathan Tan knows better than I do here. Maybe he could chime in.

> > I think what you probably want is a step 1.5 to tell Git "I'm not going
> > to ask for or care about the entirety of my working copy, I really just
> > want objects in path...", and you can do that with sparse checkouts. See
> > https://git-scm.com/docs/git-sparse-checkout for more.
>
> For simplicity of discussion, let's focus on the problem of running
> Blame efficiently in a repo that was cloned with '--filter=tree:0'. In
> order to blame file '/1/2/Foo.txt', we will need the following:
>
> * Trees '/1'
> * Trees '/1/2'
> * Blobs '/1/2/Foo.txt'
>
> All of these will be needed to unknown commit depth. For simplicity,
> the proposed command will download these for all commits. Specifying
> a range of revisions could be nice, but I feel that it's not worth the
> complexity.
>
> Correct me if I'm wrong: I think that sparse checkout will not help to
> achieve the goal?

I see what you're saying. Here sparse-checkout and partial clones
confusingly diverge: what you really want is to say "I want all of the
objects that I need to construct this directory at any point in history"
so that you can run "git blame" on some path within that directory
without the need for a follow-up fetch.

> This is why I suggest a command that will accept paths and send
> requested objects, also forcing server to assume that all of them are
> missing in client's repository.

In any case the '--filter=sparse:<oid>' bit is not recommended for use,
but perhaps this is a convincing use-case. I didn't follow the partial
clone development close enough to know whether this has already been
discussed, but I'm sure that it has.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about partial clone with '--filter=tree:0'
  2020-10-21 17:31     ` Taylor Blau
@ 2020-10-21 17:46       ` Alexandr Miloslavskiy
  0 siblings, 0 replies; 9+ messages in thread
From: Alexandr Miloslavskiy @ 2020-10-21 17:46 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, christian.couder, jonathantanmy, Marc Strapetz

On 21.10.2020 19:31, Taylor Blau wrote:

 > If you have a promisor remote which is missing objects, running
 > 'git blame' etc. will transparently download whatever objects it
 > is missing.

This is correct, but it downloads things one at a time, which in
case of larger repo such as Linux could take weeks to complete. And
downloading more things at once isn't easy without the suggested
command.

It is possible to traverse commit graph, requesting all discovered
objects at once, but again in case of Linux, that would mean sending
multiple requests with lists of 1 million+ oids. And the number of
requests is around the maximum tree depth. Doesn't sound nice.

> Jonathan Tan knows better than I do here. Maybe he could chime in.

I already CC'ed him, I hope he finds time to reply.

> I see what you're saying. Here sparse-checkout and partial clones
> confusingly diverge: what you really want is to say "I want all of the
> objects that I need to construct this directory at any point in history"
> so that you can run "git blame" on some path within that directory
> without the need for a follow-up fetch.

Right.

> In any case the '--filter=sparse:<oid>' bit is not recommended for use,
> but perhaps this is a convincing use-case. I didn't follow the partial
> clone development close enough to know whether this has already been
> discussed, but I'm sure that it has.

Unfortunately '--filter=sparse:<oid>' requires the list to be already
committed on server, which limits the usefulness of it a lot.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about partial clone with '--filter=tree:0'
  2020-10-20 17:09 Questions about partial clone with '--filter=tree:0' Alexandr Miloslavskiy
  2020-10-20 22:29 ` Taylor Blau
@ 2020-10-26 18:24 ` Jonathan Tan
  2020-10-26 18:44   ` Alexandr Miloslavskiy
  1 sibling, 1 reply; 9+ messages in thread
From: Jonathan Tan @ 2020-10-26 18:24 UTC (permalink / raw)
  To: alexandr.miloslavskiy
  Cc: git, christian.couder, jonathantanmy, marc.strapetz, me

> (1) Is it even considered a realistic use case?
> -----------------------------------------------
> Summary: is '--filter=tree:0' a realistic or "crazy" scenario that is
> not considered worthy of supporting?
> 
> I decided to use Linux repo, which is reasonably large, and it seems
> that '--filter=tree:0' could be desired because it helps with disk
> space (~0.66gb) and network (~0.54gb):

Sorry for the late reply - I have been out of office for a while.

As Taylor said in another email, it's good for some use cases but
perhaps not for the "blame" one that you describe later.

> (2) A command to enrich repo with trees
> ---------------------------------------
> There is no good way to "un-partial" repository that was cloned with
> '--filter=tree:0' to have all trees, but no blobs.
> 
> There seems to be a dirty way of doing that by abusing 'fetch --deepen'
> which happens to skip "ref tip already present locally" check, but
> it will also re-download all commits, which means extra ~0.5gb network
> in case of Linux repo.

That's true. I made some progress with cbe566a071 ("negotiator/noop: add
noop fetch negotiator", 2020-08-18) (which adds a no-op negotiatior, so
the client never reports its own commits as "have") but as you said in
another email, we still run into the problem that if we have the commit
that we're fetching, we still won't fetch it.

> (3) A command to download ALL trees and/or blobs for a subpath
> -----------------------------------------------
> Summary: Running a Blame or file log in '--filter=tree:0' repo is
> currently very inefficient, up to a point where it can be discussed
> as not really working.
> 
> The suggested command will be able to accept a path and download ALL
> trees and/or blobs that match it.
> 
> This will solve many problems at once:
> * Solve (2)
> * Make it possible to prepare for efficient blame and file log
> * Make a new experience with super-mono-repos, where user will now
>    be able to only download a part of it by path.

To clarify: we partially support the last point - "git clone" now
supports "--sparse". When used with "--filter", only the blobs in the
sparse checkout specification will be fetched, so users are already able
to download only the objects in a specific path. Having said that, I
think you also want the histories of these objects, so admittedly this
is not complete for your use case.

> Currently '--filter=sparse:oid' is there to support that, but it is
> very hard to use on client side, because it requires paths to be
> already present in a commit on server.
> 
> For a possible solution, it sounds reasonable to have such filter:
>    --filter=sparse:pathlist=/1/2'
> Path list could be delimited with some special character, and paths
> themselves could be escaped.

Having such an option (and teaching "blame" to use it to prefetch) would
indeed speed up "blame". But if we implement this, what would happen if
the user ran "blame" on the same file twice? I can't think of a way of
preventing the same fetch from happening twice except by checking the
existence of, say, the last 10 OIDs corresponding to that path. But if
we have the list of those 10 OIDs, we could just prefetch those 10 OIDs
without needing a new filter.

Another issue (but a smaller one) is this does not fetch all objects
necessary if the file being "blame"d has been renamed, but that is
probably solvable - we can just refetch with the old name.

Another possible solution that has been discussed before (but a much
more involved one) is to teach Git to be able to serve results of
computations, and then have "blame" be able to stitch that with local
data. (For example, "blame" could check the history of a certain path to
find the commit(s) that the remote has information of, query the remote
for those commits, and then stitch the results together with local
history.) This scheme would work not only for "blame" but for things
like "grep" (with history) and "log -S", whereas
"--filter=sparse:parthlist" would only work with "blame". But
admittedly, this solution is more involved.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about partial clone with '--filter=tree:0'
  2020-10-26 18:24 ` Jonathan Tan
@ 2020-10-26 18:44   ` Alexandr Miloslavskiy
  2020-10-26 19:46     ` Jonathan Tan
  0 siblings, 1 reply; 9+ messages in thread
From: Alexandr Miloslavskiy @ 2020-10-26 18:44 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, christian.couder, marc.strapetz, me

On 26.10.2020 19:24, Jonathan Tan wrote:
> Sorry for the late reply - I have been out of office for a while.

I'm quite happy to get the replies at all, even if later. Thanks!

> As Taylor said in another email, it's good for some use cases but
> perhaps not for the "blame" one that you describe later.

OK, so our expectations seem to match your expectations, that's good.

> That's true. I made some progress with cbe566a071 ("negotiator/noop: add
> noop fetch negotiator", 2020-08-18) (which adds a no-op negotiatior, so
> the client never reports its own commits as "have") but as you said in
> another email, we still run into the problem that if we have the commit
> that we're fetching, we still won't fetch it.

Right, I already discovered 'fetch.negotiationAlgorithm=noop' and gave 
it a quick try, but it didn't seem to help at all.

> To clarify: we partially support the last point - "git clone" now
> supports "--sparse". When used with "--filter", only the blobs in the
> sparse checkout specification will be fetched, so users are already able
> to download only the objects in a specific path.

I see. Still, it seems that two other problems will be solved.

> Having said that, I
> think you also want the histories of these objects, so admittedly this
> is not complete for your use case.

Right.

> Having such an option (and teaching "blame" to use it to prefetch) would
> indeed speed up "blame". But if we implement this, what would happen if
> the user ran "blame" on the same file twice? I can't think of a way of
> preventing the same fetch from happening twice except by checking the
> existence of, say, the last 10 OIDs corresponding to that path. But if
> we have the list of those 10 OIDs, we could just prefetch those 10 OIDs
> without needing a new filter.

I must admit that I didn't notice this problem. Still, it seems easy 
enough to solve with this approach:

1) Estimate number of missing things
2) If "many", just download everything for <path> as described before
    and consider it done.
3) If "not so many", assemble a list of OIDs on the boundary of unknown
    (for example, all root tree OIDs for commits that are missing any
    trees) and use the usual fetch to download all OIDs in one go.
4) Repeat step 3 multiple times. Only N=<maximum tree depth> requests
    are needed, regardless of the number of commits.

> Another issue (but a smaller one) is this does not fetch all objects
> necessary if the file being "blame"d has been renamed, but that is
> probably solvable - we can just refetch with the old name.

Right, we also discussed this and figured that we'd just query more
things as needed. Maybe also individual other blobs for rename detection.

> Another possible solution that has been discussed before (but a much
> more involved one) is to teach Git to be able to serve results of
> computations, and then have "blame" be able to stitch that with local
> data. (For example, "blame" could check the history of a certain path to
> find the commit(s) that the remote has information of, query the remote
> for those commits, and then stitch the results together with local
> history.) This scheme would work not only for "blame" but for things
> like "grep" (with history) and "log -S", whereas
> "--filter=sparse:parthlist" would only work with "blame". But
> admittedly, this solution is more involved.

I understand that you're basically talking about implementing 
prefetching in git itself? To my understanding, this will still need 
either the command I suggested, or implement graph walking with massive 
OID requests as described above in 1)2)3)4). The latter will not require 
protocol changes, but will involve sending quite a bit of OIDs around.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about partial clone with '--filter=tree:0'
  2020-10-26 18:44   ` Alexandr Miloslavskiy
@ 2020-10-26 19:46     ` Jonathan Tan
  2020-10-26 20:08       ` Alexandr Miloslavskiy
  0 siblings, 1 reply; 9+ messages in thread
From: Jonathan Tan @ 2020-10-26 19:46 UTC (permalink / raw)
  To: alexandr.miloslavskiy
  Cc: jonathantanmy, git, christian.couder, marc.strapetz, me

> > Having such an option (and teaching "blame" to use it to prefetch) would
> > indeed speed up "blame". But if we implement this, what would happen if
> > the user ran "blame" on the same file twice? I can't think of a way of
> > preventing the same fetch from happening twice except by checking the
> > existence of, say, the last 10 OIDs corresponding to that path. But if
> > we have the list of those 10 OIDs, we could just prefetch those 10 OIDs
> > without needing a new filter.
> 
> I must admit that I didn't notice this problem. Still, it seems easy 
> enough to solve with this approach:
> 
> 1) Estimate number of missing things
> 2) If "many", just download everything for <path> as described before
>     and consider it done.
> 3) If "not so many", assemble a list of OIDs on the boundary of unknown
>     (for example, all root tree OIDs for commits that are missing any
>     trees) and use the usual fetch to download all OIDs in one go.
> 4) Repeat step 3 multiple times. Only N=<maximum tree depth> requests
>     are needed, regardless of the number of commits.

My point was that if you can estimate it ("have the list of those 10
OIDs"), then you can just fetch it. This does send "quite a bit of
OIDs", as you said below - I'll address it below.

> > Another possible solution that has been discussed before (but a much
> > more involved one) is to teach Git to be able to serve results of
> > computations, and then have "blame" be able to stitch that with local
> > data. (For example, "blame" could check the history of a certain path to
> > find the commit(s) that the remote has information of, query the remote
> > for those commits, and then stitch the results together with local
> > history.) This scheme would work not only for "blame" but for things
> > like "grep" (with history) and "log -S", whereas
> > "--filter=sparse:parthlist" would only work with "blame". But
> > admittedly, this solution is more involved.
> 
> I understand that you're basically talking about implementing 
> prefetching in git itself?

No - I did talk about prefetching earlier, but here I mean having Git on
the server perform the "blame" computation itself.

For example, let's say I want to run "blame" on foo.txt at HEAD. HEAD
and HEAD^ are commits that only the local client has, whereas HEAD^^ was
fetched from the remote. By comparing HEAD, HEAD^, and HEAD^^, Git knows
which lines come from HEAD and HEAD^. For the rest, Git would make a
request to the server, passing the commit ID and the path, and would get
back a list of line numbers and commits.

> To my understanding, this will still need 
> either the command I suggested, or implement graph walking with massive 
> OID requests as described above in 1)2)3)4). The latter will not require 
> protocol changes, but will involve sending quite a bit of OIDs around.

Yes, prefetching will require graph walking with large OID requests but
will not require protocol changes, as you say. I'm not too worried about
the large numbers of OIDs - Git servers already have to support
relatively large numbers of OIDs to support the bulk prefetch we do
during things like checkout and diff.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about partial clone with '--filter=tree:0'
  2020-10-26 19:46     ` Jonathan Tan
@ 2020-10-26 20:08       ` Alexandr Miloslavskiy
  0 siblings, 0 replies; 9+ messages in thread
From: Alexandr Miloslavskiy @ 2020-10-26 20:08 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, christian.couder, marc.strapetz, me

On 26.10.2020 20:46, Jonathan Tan wrote:
 > No - I did talk about prefetching earlier, but here I mean having
 > Git on the server perform the "blame" computation itself.

Oh! That's an interesting twist. Unfortunately for us, we are
implementing our own Blame logic. Thinking of which, I'm now becoming
more convinced that graph walking could be the best solution for us,
because it allows any logic, including custom file rename detection.

 > For example, let's say I want to run "blame" on foo.txt at HEAD. HEAD
 > and HEAD^ are commits that only the local client has, whereas HEAD^^ was
 > fetched from the remote. By comparing HEAD, HEAD^, and HEAD^^, Git knows
 > which lines come from HEAD and HEAD^. For the rest, Git would make a
 > request to the server, passing the commit ID and the path, and would get
 > back a list of line numbers and commits.

Sounds quite involved indeed! It's curious how git kind of shifts
towards classic server-side VCS such as SVN. When partial clones are
involved, that is.

 > Yes, prefetching will require graph walking with large OID requests but
 > will not require protocol changes, as you say. I'm not too worried about
 > the large numbers of OIDs - Git servers already have to support
 > relatively large numbers of OIDs to support the bulk prefetch we do
 > during things like checkout and diff.

Hmm, let's talk about Linux repository for the sake of the numbers.
The number of commits is ~1M. For a typical Blame (without rename
detection), every request will traverse the trees one level deeper, and
for just one file blamed, that would mean 1 or 0 trees per commit 
(depending on whether the tree was modified by the commit). The first
request to discover root trees is going to be the largest, and will
request (1*numCommits) OIDs. That makes 1M OIDs in worst case, with
subsequent requests probably at ~0.1M, and there will be 1 request per
every path component in blamed path.

So the question is, will git server (or git hosting) become upset
about requests for 1M OIDs? Never really tried what is the cost of such
request, what do you think?

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-10-26 20:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-20 17:09 Questions about partial clone with '--filter=tree:0' Alexandr Miloslavskiy
2020-10-20 22:29 ` Taylor Blau
2020-10-21 17:10   ` Alexandr Miloslavskiy
2020-10-21 17:31     ` Taylor Blau
2020-10-21 17:46       ` Alexandr Miloslavskiy
2020-10-26 18:24 ` Jonathan Tan
2020-10-26 18:44   ` Alexandr Miloslavskiy
2020-10-26 19:46     ` Jonathan Tan
2020-10-26 20:08       ` Alexandr Miloslavskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).