git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, christian.couder@gmail.com,
	jonathantanmy@google.com,
	Marc Strapetz <marc.strapetz@syntevo.com>
Subject: Re: Questions about partial clone with '--filter=tree:0'
Date: Wed, 21 Oct 2020 19:10:02 +0200	[thread overview]
Message-ID: <a4a20c67-4ee3-77b2-8d57-f30843572aa4@syntevo.com> (raw)
In-Reply-To: <20201020222934.GB93217@nand.local>

On 21.10.2020 0:29, Taylor Blau wrote:
> Oops. That can happen sometimes, but thanks for re-sending. I'll try to
> answer the basic points below.

Thanks for stepping in!

>> (1) Is it even considered a realistic use case?
>> -----------------------------------------------
>> Summary: is '--filter=tree:0' a realistic or "crazy" scenario that is
>> not considered worthy of supporting?
>
> It's not an unrealistic scenario, but it might be for what you're trying
> to build. If your UI needs to run, say, 'git log --patch' to show a
> historical revision, then you're going to need to fault in a lot of
> missing objects.
>
> If that's not something that you need to do often or ever, then having
> '--filter=tree:0' is a good way to get the least amount of data possible
> when using a partial clone. But if you're going to be performing
> operations that need those missing objects, you're probably better eat
> the network/storage cost of it all at once, rather than making the user
> wait for Git to fault in the set of missing objects that it happens to
> need.

We currently do not intend to use '--filter=tree:0' ourself, but we are 
trying to support all kinds of user repositories with our UI. So we 
basically have these choices:

A) Declare '--filter=tree:0' repos as completely wrong and unsupported
    in out UI, also giving an option to "un-partial" them.

B) Support '--filter=tree:0' repos, but don't support operations such
    as blame and file log

C) Use some magic to efficiently download objects that will be needed
    for a command such as Blame, while keeping the rest of the repository
    partial. This is where the command described in (3) will help a lot.

We would of course prefer (C) if it's reasonably possible.

>> (2) A command to enrich repo with trees
>> ---------------------------------------
>> There is no good way to "un-partial" repository that was cloned with
>> '--filter=tree:0' to have all trees, but no blobs.
>
> There is no command to do that directly, but it is something that Git is
> capable of.
>
> It would look something like:
>
>    $ git config remote.origin.partialclonefilter 'blob:none'
>
> Now your repository is in a state where it has no blobs or trees, but
> the filter does not prohibit it from getting the trees, so you can ask
> it to grab everything you're missing with:
>
>    $ git fetch origin
>
> This should even be a pretty fast operation for repositories that have
> bitmaps due to some topics that Peff and I sent to the list a while ago.
> If it isn't, please let me know.

Unfortunately this does not work as expected. Try the following steps:

A) Clone repo with '--filter=tree:0'
    $ git clone --bare --filter=tree:0 --branch master 
https://github.com/git/git.git

B) Change filter to 'blob:none'
    $ cd git.git
    $ git config remote.origin.partialclonefilter 'blob:none'

C) fetch
    $ git fetch origin
    Note that there is no 'Receiving objects:' output.

D) Verify that trees were downloaded
    $ git cat-file -p HEAD | grep tree
      tree ee5b5b41305cda618862beebc9c94859ae276e5a
    $ git cat-file -t ee5b5b41305cda618862beebc9c94859ae276e5a
      Note that 1 object gets downloaded. This confirms that (C) didn't
      achieve the goal.

It happens due to 'check_exist_and_connected()' test in 'fetch_refs()'.
Since the tip of the ref is already available locally (even though it
is missing all trees), nothing is downloaded.

>> There seems to be a dirty way of doing that by abusing 'fetch --deepen'
>> which happens to skip "ref tip already present locally" check, but
>> it will also re-download all commits, which means extra ~0.5gb network
>> in case of Linux repo.
>
> Mmm, this is probably not what you're looking for. You may be confusing
> shallow clones (of which --deepen is relevant) with partial clones
> (to which --deepen is irrelevant).

Yes, '--deepen' is intended for shallow clones. But abusing it for
partial clones allows to skip 'check_exist_and_connected()' test.
However, I did more testing today, and in many cases server itself
refuses to send objects, probably due to sent 'HAVE' or something
else. So even '--deepen' doesn't really help.

> I think what you probably want is a step 1.5 to tell Git "I'm not going
> to ask for or care about the entirety of my working copy, I really just
> want objects in path...", and you can do that with sparse checkouts. See
> https://git-scm.com/docs/git-sparse-checkout for more.

For simplicity of discussion, let's focus on the problem of running
Blame efficiently in a repo that was cloned with '--filter=tree:0'. In
order to blame file '/1/2/Foo.txt', we will need the following:

* Trees '/1'
* Trees '/1/2'
* Blobs '/1/2/Foo.txt'

All of these will be needed to unknown commit depth. For simplicity,
the proposed command will download these for all commits. Specifying
a range of revisions could be nice, but I feel that it's not worth the
complexity.

Correct me if I'm wrong: I think that sparse checkout will not help to
achieve the goal?

This is why I suggest a command that will accept paths and send
requested objects, also forcing server to assume that all of them are
missing in client's repository.

  reply	other threads:[~2020-10-21 17:10 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-20 17:09 Questions about partial clone with '--filter=tree:0' Alexandr Miloslavskiy
2020-10-20 22:29 ` Taylor Blau
2020-10-21 17:10   ` Alexandr Miloslavskiy [this message]
2020-10-21 17:31     ` Taylor Blau
2020-10-21 17:46       ` Alexandr Miloslavskiy
2020-10-26 18:24 ` Jonathan Tan
2020-10-26 18:44   ` Alexandr Miloslavskiy
2020-10-26 19:46     ` Jonathan Tan
2020-10-26 20:08       ` Alexandr Miloslavskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a4a20c67-4ee3-77b2-8d57-f30843572aa4@syntevo.com \
    --to=alexandr.miloslavskiy@syntevo.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=marc.strapetz@syntevo.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).