[RFC] Extending git-replace

* [RFC] Extending git-replace
@ 2020-01-14  5:33 Kaushik Srenevasan
  2020-01-14  6:55 ` Elijah Newren
  2020-01-14 18:19 ` David Turner
  0 siblings, 2 replies; 9+ messages in thread
From: Kaushik Srenevasan @ 2020-01-14  5:33 UTC (permalink / raw)
  To: git

We’ve been trying to get rid of objects larger than a certain size
from one of our repositories that contains tens of thousands of
branches and hundreds of thousands of commits. While we’re able to
accomplish this using BFG[0] , it results in ~ 90% of the repository’s
history being rewritten. This presents the following problems
1. There are various systems (Phabricator for one) that use the commit
hash as a key in various databases. Rewriting history will require
that we update all of these systems.
2. We’ll have to force everyone to reclone a copy of this repository.

I was looking through the git code base to see if there is a way
around it when I chanced upon `git-replace`. While the basic idea of
`git-replace` is what I am looking for, it doesn’t quite fit the bill
due to the `--no-replace-objects` switch, the `GIT_NO_REPLACE_OBJECTS`
environment variable, and `--no-replace-objects` being the default for
certain git commands. Namely fsck, upload-pack, pack/unpack-objects,
prune and index-pack. That Git may still try to load a replaced object
when a git command is run with the `--no-replace-objects` option
prevents me from removing it from the ODB permanently. Not being able
to run prune and fsck on a repository where we’ve deleted the object
that’s been replaced with git-replace effectively rules this option
out for us.

A feature that allowed such permanent replacement (say a
`git-blacklist` or a `git-replace --blacklist`) might work as follows:
1. Blacklisted objects are stored as references under a new namespace
-- `refs/blacklist`.
2. The object loader unconditionally translates a blacklisted OID into
the OID it’s been replaced with.
3. The `+refs/blacklist/*:refs/blacklist/*` refspec is implicitly
always a part of fetch and push transactions.

This essentially turns the blacklist references namespace into an
additional piece of metadata that gets transmitted to a client when a
repository is cloned and is kept updated automatically.

I’ve been playing around with a prototype I wrote and haven’t observed
any breakage yet. I’m writing to seek advice on this approach and to
understand if this is something (if not in its current form, some
version of it) that has a chance of making it into the product if we
were to implement it. Happy to write up a more detailed design and
share my prototype as a starting point for discussion.

                                           -- Kaushik

[0] https://rtyley.github.io/bfg-repo-cleaner/

^ permalink raw reply	[flat|nested] 9+ messages in thread