Re: git-clone --single-branch clones objects outside of branch

From: Jeff King <peff@peff.net>
To: Chris Jerdonek <chris.jerdonek@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: git-clone --single-branch clones objects outside of branch
Date: Tue, 28 Jan 2020 04:48:01 -0500	[thread overview]
Message-ID: <20200128094801.GC574544@coredump.intra.peff.net> (raw)
In-Reply-To: <CAOTb1wd9D3YytevTt0cGnw1o-9cN1-yxCqbuH4oLH1KB6mzEeA@mail.gmail.com>

On Sun, Jan 26, 2020 at 10:46:07PM -0800, Chris Jerdonek wrote:

> Thanks for the reply. It's okay for that to be the expected behavior.
> My suggestion would just be that the documentation for --single-branch
> be updated to clarify that objects unreachable from the specified
> branch can still be in the cloned repo when run using the --local
> optimizations. For example, it can matter for security if one is
> trying to create a clone of a repo that doesn't include data from
> branches with sensitive info (e.g. in following Git's advice to create
> a separate repo if security of private data is desired:
> https://git-scm.com/docs/gitnamespaces#_security ).

I think it would make sense to talk about that under "--local". There
are other subtle reasons you might want "--no-local", too: it will
perform more consistency checks, which could be valuable if you're
thinking about deleting the old copy.

I'm not sure how much Git guarantees in general that you won't get extra
objects. It's true that we try to avoid sending objects that aren't
needed, but it's mostly as an optimization. If we later modified
pack-objects to sometimes send unneeded objects (say, because we're able
to compute the set more efficiently if we use an approximation that errs
on the conservative side), then I think that's something we'd consider.
And I suppose it's already possible with the dumb-http protocol, which
has to fetch whole packfiles. And the same would be true of recently
proposed schemes to clients to a pre-generated packfile URL.

> I'm guessing other flags also don't apply when --local is being used.
> For example, I'm guessing --reference is also ignored when using
> --local, but I haven't checked yet to confirm. It would be nice if the
> documentation gave a heads up in cases like these. Even if hard links
> are being used, it's not clear from the docs whether the objects are
> filtered first, prior to hard linking, when flags like --single-branch
> and --reference are passed.

No, "--reference" behaves as usual. However, "--depth" is ignored (and
issues a warning). I don't think it would be wrong to issue a warning
when --single-branch is used locally (though it would not be "single
branch is ignored, since it does impact which refs are copied). But I
kind of wonder if it would be annoying for people who don't care about
having the extra objects reachable.

> > This one behaves as you expected because git-fetch does not perform the
> > same optimizations (it wouldn't make as much sense there, as generally
> > in a fetch we already have most of the objects from the other side
> > anyway, so hard-linking would just give us duplicates).
> 
> Incidentally, here's a thread from 2010 requesting that this
> optimization be available in the git-fetch case:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=573909
> (I don't know how reports on that Debian list relate to this list.)

Sometimes they get forwarded here, and sometimes not. :)

I think there are subtle issues that make naively using the optimization
a bad idea, as it could actually backfire and cause more disk usage.
E.g., consider a sequence like this:

  1. Repo A has a 100MB packfile.

  2. "git clone A B" uses hardlinks. Now we have two copies of the repo,
     storing 100MB.

  3. Repo A adds a few more commits, and then does a "git gc", breaking
     the hardlinks. Now we have ~200MB used. We're no worse off than if
     we hadn't done the hardlinks in the first place.

  4. Repo B fetches from A. It wants the new commits, but they're in
     repo A's big packfile. So it hardlinks that. We have ~200MB in repo
     B, but half of that is hardlinked and shared with A. So we're still
     using ~200MB. So far so good.

  5. Repo A repacks again, breaking the hardlinks. Now it's using
     ~100MB, but repo B is still using ~300MB. We're worse off than we
     would be without the optimization.

If you really want to keep sharing objects over time, I think using
"clone -s" is a better choice (though it comes with its own
complications and dangers, too; see the git-clone documentation).

-Peff