Re: Compressing packed-refs

From: Jeff King <peff@peff.net>
To: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: Compressing packed-refs
Date: Mon, 20 Jul 2020 13:32:20 -0400	[thread overview]
Message-ID: <20200720173220.GB2045458@coredump.intra.peff.net> (raw)
In-Reply-To: <20200718182618.yqo5dcljf3h6q57q@chatter.i7.local>

On Sat, Jul 18, 2020 at 02:26:18PM -0400, Konstantin Ryabitsev wrote:

> >   - getting new objects into the object store. It sounds like you might
> >     do this with "git fetch", which does need up-to-date refs. We used
> >     to do that, too, but it can be quite slow. These days we migrate the
> >     objects directly via hardlinks, and then use "update-ref --stdin" to
> >     sync the refs into the shared storage repo.
> [...]
> Can you elaborate on the details of that operation, if it's not secret 
> sauce? Say, I have two repos:

No secret sauce, but it's pretty much what you wrote out. A few
comments:

> 1. locate all pack/* and XX/* files in repoA/objects (what about the 
>    info/packs file, or do you loosen all packs first?)

We only copy the pack and loose object files. We don't generate
info/packs at all, since we don't allow dumb-http access. Nor do we copy
any commit-graph files over (those are generated only in the shared
storage repo, and then every fork gets to use them).

Definitely don't loosen packs. It's very expensive. :)

> 2. hardlink them into the same location in repoS/objects

Yep. And now they're available atomically in both places.

> 3. use git-show-ref from repoA to generate stdin for git-update-ref in 
>    repoS

Use for-each-ref for this. It's received more optimizations over the
years (especially around looking at the minimum of packed-refs when it
can). Don't forget to delete refs that have gone away. We do something
like (typed in email, so watch out for errors):

  id=123
  git --git-dir=repoA for-each-ref \
    --format="%(objectname) refs/remotes/$id/%(refname)' >want
  git --git-dir=repoS for-each-ref \
    --format="%(objectname) %(refname)" refs/remotes/$id/ >have

and then compare the results (our code is in ruby using hashes, but you
could do it with comm or similar). And then you should end up with a set
of updates and deletions, which you can feed to "git update-ref --stdin"
(which is smart enough to do deletions before additions to save you from
directory/file conflicts in the namespace).

(There's no particular reason you need to use refs/remotes/ in the
shared repo; for us it's just historical since we really did define
configure remotes for each fork many many years ago).

> 4. Consequent runs of repack in repoA should unreference the hardlinked 
>    files in repoA/objects and leave only their copy in repoS

Yeah, I think it would do so, but we just unlink them immediately.

> I'm not sure I'm quite comfortable doing this kind of spinal surgery on 
> git repos yet, but I'm willing to wet my feet in some safe environments.  
> :)

We resisted it for a long time, too, because I didn't want to violate
any of Git's assumptions. But the cost of fetches was just getting too
high (especially because we queue a sync job after every push, and some
users like to push a _lot_).

> Yes, I did ponder using this, especially when dealing with objstore 
> repos with hundreds of thousands of refs -- thanks for another nudge in 
> this direction. I am planning to add a concept of indicating "baseline" 
> repos to grokmirror, which allows us to:
> 
> 1. set them as islandCore in objstore repositories
> 2. return only their refs via alternateRefsCommand
> 
> This one seems fairly straightforward and I will probably add that in 
> next week.

Yeah, it is. Our alternateRefsCommand is a script that basically does:

  # receive parent id info out-of-band in environment; if it's not
  # there, then show no alternate tips
  test -z "$parent_repo_id" && exit 0

  git --git-dir="$1" for-each-ref \
    --format='%(objectname)' refs/remotes/$i/heads/

Note that we only advertise "heads/" from the fork, and ignore the tags.
I don't know that we did a very rigorous study, but our general finding
was that tags don't often help much, and do clutter up the response for
some repos (again, some users think that 50,000 tags is reasonable).

-Peff