* [RFC] Extending git-replace @ 2020-01-14 5:33 Kaushik Srenevasan 2020-01-14 6:55 ` Elijah Newren 2020-01-14 18:19 ` David Turner 0 siblings, 2 replies; 9+ messages in thread From: Kaushik Srenevasan @ 2020-01-14 5:33 UTC (permalink / raw) To: git We’ve been trying to get rid of objects larger than a certain size from one of our repositories that contains tens of thousands of branches and hundreds of thousands of commits. While we’re able to accomplish this using BFG[0] , it results in ~ 90% of the repository’s history being rewritten. This presents the following problems 1. There are various systems (Phabricator for one) that use the commit hash as a key in various databases. Rewriting history will require that we update all of these systems. 2. We’ll have to force everyone to reclone a copy of this repository. I was looking through the git code base to see if there is a way around it when I chanced upon `git-replace`. While the basic idea of `git-replace` is what I am looking for, it doesn’t quite fit the bill due to the `--no-replace-objects` switch, the `GIT_NO_REPLACE_OBJECTS` environment variable, and `--no-replace-objects` being the default for certain git commands. Namely fsck, upload-pack, pack/unpack-objects, prune and index-pack. That Git may still try to load a replaced object when a git command is run with the `--no-replace-objects` option prevents me from removing it from the ODB permanently. Not being able to run prune and fsck on a repository where we’ve deleted the object that’s been replaced with git-replace effectively rules this option out for us. A feature that allowed such permanent replacement (say a `git-blacklist` or a `git-replace --blacklist`) might work as follows: 1. Blacklisted objects are stored as references under a new namespace -- `refs/blacklist`. 2. The object loader unconditionally translates a blacklisted OID into the OID it’s been replaced with. 3. The `+refs/blacklist/*:refs/blacklist/*` refspec is implicitly always a part of fetch and push transactions. This essentially turns the blacklist references namespace into an additional piece of metadata that gets transmitted to a client when a repository is cloned and is kept updated automatically. I’ve been playing around with a prototype I wrote and haven’t observed any breakage yet. I’m writing to seek advice on this approach and to understand if this is something (if not in its current form, some version of it) that has a chance of making it into the product if we were to implement it. Happy to write up a more detailed design and share my prototype as a starting point for discussion. -- Kaushik [0] https://rtyley.github.io/bfg-repo-cleaner/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Extending git-replace 2020-01-14 5:33 [RFC] Extending git-replace Kaushik Srenevasan @ 2020-01-14 6:55 ` Elijah Newren 2020-01-14 19:11 ` Jonathan Tan 2020-01-16 3:30 ` Kaushik Srenevasan 2020-01-14 18:19 ` David Turner 1 sibling, 2 replies; 9+ messages in thread From: Elijah Newren @ 2020-01-14 6:55 UTC (permalink / raw) To: Kaushik Srenevasan; +Cc: Git Mailing List, Jonathan Tan Hi Kaushik, On Mon, Jan 13, 2020 at 9:39 PM Kaushik Srenevasan <kaushik@twitter.com> wrote: > > We’ve been trying to get rid of objects larger than a certain size > from one of our repositories that contains tens of thousands of > branches and hundreds of thousands of commits. While we’re able to > accomplish this using BFG[0] , it results in ~ 90% of the repository’s > history being rewritten. This presents the following problems > 1. There are various systems (Phabricator for one) that use the commit > hash as a key in various databases. Rewriting history will require > that we update all of these systems. Not necessarily... > 2. We’ll have to force everyone to reclone a copy of this repository. True. > I was looking through the git code base to see if there is a way > around it when I chanced upon `git-replace`. While the basic idea of > `git-replace` is what I am looking for, it doesn’t quite fit the bill > due to the `--no-replace-objects` switch, the `GIT_NO_REPLACE_OBJECTS` > environment variable, and `--no-replace-objects` being the default for > certain git commands. Namely fsck, upload-pack, pack/unpack-objects, > prune and index-pack. That Git may still try to load a replaced object > when a git command is run with the `--no-replace-objects` option > prevents me from removing it from the ODB permanently. Not being able > to run prune and fsck on a repository where we’ve deleted the object > that’s been replaced with git-replace effectively rules this option > out for us. > > A feature that allowed such permanent replacement (say a > `git-blacklist` or a `git-replace --blacklist`) might work as follows: > 1. Blacklisted objects are stored as references under a new namespace > -- `refs/blacklist`. > 2. The object loader unconditionally translates a blacklisted OID into > the OID it’s been replaced with. > 3. The `+refs/blacklist/*:refs/blacklist/*` refspec is implicitly > always a part of fetch and push transactions. > > This essentially turns the blacklist references namespace into an > additional piece of metadata that gets transmitted to a client when a > repository is cloned and is kept updated automatically. > > I’ve been playing around with a prototype I wrote and haven’t observed > any breakage yet. I’m writing to seek advice on this approach and to > understand if this is something (if not in its current form, some > version of it) that has a chance of making it into the product if we > were to implement it. Happy to write up a more detailed design and > share my prototype as a starting point for discussion. I'll get back to this in a minute, but wanted to point out a couple other ideas for consideration: 1) You can rewrite history, and then use replace references to map old commit IDs to new commit IDs. This allows anyone to continue using old commit IDs (which aren't even part of the new repository anymore) in git commands and git automatically uses and shows the new commit IDs. No problems with fsck or prune or fetch either. Creating these replace refs is fairly simple if your repository rewriting program (e.g. git-filter-repo or BFG Repo Cleaner) provides a mapping of old IDs to new IDs, and if you are using git-filter-repo it even creates the replace refs for you. (The one downside is that you can't use abbreviated refs to refer to replace refs, thus you can't use abbreviated old commit IDs in this scheme.) The downside is that various repository hosting tools ignore replace refs. Thus if you try to browse to a commit in the web UI of Gerrit or GitHub using the old commit IDs, it'll just show you a commit not found page. Phabricator and GitLab may well be the same (haven't tried yet). However, teaching these tools to pay attention to replace refs would make this simple mechanism for rewriting feel close to seamless other than asking people to reclone. It's possible that teaching the Webby tools to pay attention to replace refs might not be too difficult, at least for the open source systems, though I admit I haven't dug into it myself. 2) Some folks might be okay with a clone that won't pass fsck or prune, at least in special circumstances. We're actually doing that on purpose to deal with one of our large repositories. We don't provide that to normal developers, but we do use "cheap, fake clones" in our CI systems. These slim clones have 99% of all objects, but happen to be missing the really big ones, resulting in only needing 1/7 of the time to download. (And no, don't try to point out shallow clones to me. I hate those things, they're an awful hack, *and* they don't work for us. It's nice getting all commit history, all trees, and most blobs including all for at least the last two years while still saving lots of space.) [For the curious, I did make a simple script to create these "cheap, fake clones" for repositories of interest. See https://github.com/newren/sequester-old-big-blobs. But they are definitely a hack with some sharp corners, with failing fsck and prunes only being part of the story.] 3) Back to your idea... What you're proposing actually sounds very similar to partial clones, whose idea is to make it okay to download a subset of history. The primary problems with partial clones are (a) they are still under development and are just experimental, (b) they are currently implemented with a "promisor" mode, meaning that if a command tries to run over any piece of missing data then the command pauses while the objects are downloaded from the server. I want an offline mode (even if I'm online) where only explicit downloading from the server (clone, fetch, etc.) occurs. Instead of inventing yet another partial-clone-like thing, it'd be nice if your new mechanism could just be implemented in terms of partial clones, extending them as you need. I don't like the idea of supporting multiple competing implementations of partial clones withing git.git, but if it's just some extensions of the existing capability then it sounds great. But you may want to talk with Jonathan Tan if you want to go this route (cc'd), since he's the partial clone expert. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Extending git-replace 2020-01-14 6:55 ` Elijah Newren @ 2020-01-14 19:11 ` Jonathan Tan 2020-01-16 3:30 ` Kaushik Srenevasan 1 sibling, 0 replies; 9+ messages in thread From: Jonathan Tan @ 2020-01-14 19:11 UTC (permalink / raw) To: newren; +Cc: kaushik, git, jonathantanmy > 2) Some folks might be okay with a clone that won't pass fsck or > prune, at least in special circumstances. We're actually doing that > on purpose to deal with one of our large repositories. We don't > provide that to normal developers, but we do use "cheap, fake clones" > in our CI systems. These slim clones have 99% of all objects, but > happen to be missing the really big ones, resulting in only needing > 1/7 of the time to download. (And no, don't try to point out shallow > clones to me. I hate those things, they're an awful hack, *and* they > don't work for us. It's nice getting all commit history, all trees, > and most blobs including all for at least the last two years while > still saving lots of space.) > > [For the curious, I did make a simple script to create these "cheap, > fake clones" for repositories of interest. See > https://github.com/newren/sequester-old-big-blobs. But they are > definitely a hack with some sharp corners, with failing fsck and > prunes only being part of the story.] If you want to reduce the sharpness of the corners, it might be possible to designate the pack with missing blobs as a promisor pack (add a .promisor file - which is just like the .keep file except s/keep/promisor/) and a fake promisor remote. That will make fsck and repack (GC) work. > 3) Back to your idea... > > What you're proposing actually sounds very similar to partial clones, > whose idea is to make it okay to download a subset of history. The > primary problems with partial clones are (a) they are still under > development and are just experimental, (b) they are currently > implemented with a "promisor" mode, meaning that if a command tries to > run over any piece of missing data then the command pauses while the > objects are downloaded from the server. I want an offline mode (even > if I'm online) where only explicit downloading from the server (clone, > fetch, etc.) occurs. David Turner had an idea of what could be done (instead of fetching) in such an offline mode [1], so I replied there. [1] https://lore.kernel.org/git/d4361b6d34513a3fdefa564d10269f60d4732208.camel@novalis.org/ > Instead of inventing yet another partial-clone-like thing, it'd be > nice if your new mechanism could just be implemented in terms of > partial clones, extending them as you need. I don't like the idea of > supporting multiple competing implementations of partial clones > withing git.git, but if it's just some extensions of the existing > capability then it sounds great. But you may want to talk with > Jonathan Tan if you want to go this route (cc'd), since he's the > partial clone expert. Ah, thanks for your kind words. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Extending git-replace 2020-01-14 6:55 ` Elijah Newren 2020-01-14 19:11 ` Jonathan Tan @ 2020-01-16 3:30 ` Kaushik Srenevasan 1 sibling, 0 replies; 9+ messages in thread From: Kaushik Srenevasan @ 2020-01-16 3:30 UTC (permalink / raw) To: Elijah Newren; +Cc: Git Mailing List, Jonathan Tan Hi Elijah, On Mon, Jan 13, 2020 at 10:55 PM Elijah Newren <newren@gmail.com> wrote: > 1) You can rewrite history, and then use replace references to map old > commit IDs to new commit IDs. This allows anyone to continue using > old commit IDs (which aren't even part of the new repository anymore) > in git commands and git automatically uses and shows the new commit > IDs. No problems with fsck or prune or fetch either. Creating these > replace refs is fairly simple if your repository rewriting program > (e.g. git-filter-repo or BFG Repo Cleaner) provides a mapping of old > IDs to new IDs, and if you are using git-filter-repo it even creates > the replace refs for you. (The one downside is that you can't use > abbreviated refs to refer to replace refs, thus you can't use > abbreviated old commit IDs in this scheme.) > This is the path we're considering taking unless something easier comes out of this (or other) proposal(s). We're working on determining compatibility with tools. Thanks for the pointer to git-filter-repo. It looks great! > Instead of inventing yet another partial-clone-like thing, it'd be > nice if your new mechanism could just be implemented in terms of > partial clones, extending them as you need. I don't like the idea of > supporting multiple competing implementations of partial clones > withing git.git, but if it's just some extensions of the existing > capability then it sounds great. But you may want to talk with > Jonathan Tan if you want to go this route (cc'd), since he's the > partial clone expert. I agree that it isn't worth inventing another partial clone like feature. It sounds however, like something based on partial clone will not solve the problem on the "server"? or perhaps I'm missing something (as I've not had a chance to check out the implementation yet). While I'm not at all insisting that `git-blacklist` be the way to achieve it, we'd (Twitter) like to be able to permanently get rid of the objects in question while retaining the ability to run GC and FSCK on all copies of the repository, preferably without having to rewrite history. Even merely making `--no-replace-objects` be FALSE by default for GC and FSCK (and printing a warning instead), while retaining existing behavior when it is explicitly requested, would significantly improve `git-replace`'s usability (for this purpose). The bits related to ref transfer in my proposal are optional. Git users can either be required to explicitly fetch the refs/replacement namespace (as they do today), or we could print a message (at the end of clone), letting the user know that there are replacements available on the server. I'd only proposed a new command as changing `git-reaplce` thus, would break backward compatibility. -- Kaushik ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Extending git-replace 2020-01-14 5:33 [RFC] Extending git-replace Kaushik Srenevasan 2020-01-14 6:55 ` Elijah Newren @ 2020-01-14 18:19 ` David Turner 2020-01-14 19:03 ` Jonathan Tan 1 sibling, 1 reply; 9+ messages in thread From: David Turner @ 2020-01-14 18:19 UTC (permalink / raw) To: Kaushik Srenevasan, git On Mon, 2020-01-13 at 21:33 -0800, Kaushik Srenevasan wrote: > A feature that allowed such permanent replacement (say a > `git-blacklist` or a `git-replace --blacklist`) might work as > follows: > 1. Blacklisted objects are stored as references under a new namespace > -- `refs/blacklist`. > 2. The object loader unconditionally translates a blacklisted OID > into > the OID it’s been replaced with. > 3. The `+refs/blacklist/*:refs/blacklist/*` refspec is implicitly > always a part of fetch and push transactions. There are definitely some security implications here. I assume that there's a config on the client to trust the server's refs/blacklist/*, and that the documentation for this explains that it allows your repo to be messed with in quite dangerous ways. And on the server, I would expect that only privileged users could push to refs/blacklist/* To Elijah's point that this is related to partial clones and promisors, I think Kaushik's idea is subtly different in that it involves replacements, while promisors try to offer a seamless experience. I wonder whether Kaushik actually needs the replacement functionality? That is, would it be sufficient if every replaced file were replaced with the exact text "me caga en la leche" instead of a custom hand- crafted replacement? I guess it's a bit complicated because while that's a reasonable blob, it's not a valid commit. So maybe this mechanism would be limited to blobs. I thought about whether we could a different flavor of replacement for commits, but those generally have to be custom because they each have different parents. And if that would be sufficient, could promisors be used for this? I don't know how those interact with fsck and the other commands that you're worried about. Basically, the idea would be to use most of the existing promisor code, and then have a mode where instead of visiting the promisor, we just always return "me caga en la leche" (and this does not have its SHA checked, of course). This could work together with some sort refs/blacklist mechanism to enable the server to choose which objects the client replaces. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Extending git-replace 2020-01-14 18:19 ` David Turner @ 2020-01-14 19:03 ` Jonathan Tan 2020-01-14 20:39 ` Elijah Newren 0 siblings, 1 reply; 9+ messages in thread From: Jonathan Tan @ 2020-01-14 19:03 UTC (permalink / raw) To: novalis; +Cc: kaushik, git, Jonathan Tan > That is, would it be sufficient if every replaced file were replaced > with the exact text "me caga en la leche" instead of a custom hand- > crafted replacement? I guess it's a bit complicated because while > that's a reasonable blob, it's not a valid commit. So maybe this > mechanism would be limited to blobs. I thought about whether we could > a different flavor of replacement for commits, but those generally have > to be custom because they each have different parents. Since the original email just discussed blobs, I'll confine myself to discussing blobs. (Commits are trickier, as you said.) > And if that would be sufficient, could promisors be used for this? I > don't know how those interact with fsck and the other commands that > you're worried about. Basically, the idea would be to use most of the > existing promisor code, and then have a mode where instead of visiting > the promisor, we just always return "me caga en la leche" (and this > does not have its SHA checked, of course). Missing promisor objects do not prevent fsck from passing - this is part of the original design (any packfiles we download from the specifically designated promisor remote are marked as such, and any objects that the objects in the packfile refer to are considered OK to be missing). Currently, when a missing object is read, it is first fetched (there are some more details that I can go over if you have any specific questions). What you're suggesting here is to return a fake blob with wrong hash - I haven't looked at all the callers of read-object functions in detail, but I don't think all of them are ready for such a behavioral change. Maybe it would be sufficient to just make this work in a more limited scope (e.g. checkout only - and if we need different replacement blobs for different object IDs, maybe we could have something similar to the clean/smudge filters). > This could work together with some sort refs/blacklist mechanism to > enable the server to choose which objects the client replaces. In the original email, Kaushik mentioned objects larger than a certain size - we already have support for that (--filter=blob:limit=1000000, for example). Having said that, Git is already able to tolerate any exclusion (of tree or blob) from the server - we already need this in order to support changing of filters, for example. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Extending git-replace 2020-01-14 19:03 ` Jonathan Tan @ 2020-01-14 20:39 ` Elijah Newren 2020-01-14 21:57 ` Jonathan Tan 0 siblings, 1 reply; 9+ messages in thread From: Elijah Newren @ 2020-01-14 20:39 UTC (permalink / raw) To: Jonathan Tan; +Cc: novalis, Kaushik Srenevasan, Git Mailing List On Tue, Jan 14, 2020 at 11:05 AM Jonathan Tan <jonathantanmy@google.com> wrote: > > > That is, would it be sufficient if every replaced file were replaced > > with the exact text "me caga en la leche" instead of a custom hand- > > crafted replacement? I guess it's a bit complicated because while > > that's a reasonable blob, it's not a valid commit. So maybe this > > mechanism would be limited to blobs. I thought about whether we could > > a different flavor of replacement for commits, but those generally have > > to be custom because they each have different parents. > > Since the original email just discussed blobs, I'll confine myself to > discussing blobs. (Commits are trickier, as you said.) > > > And if that would be sufficient, could promisors be used for this? I > > don't know how those interact with fsck and the other commands that > > you're worried about. Basically, the idea would be to use most of the > > existing promisor code, and then have a mode where instead of visiting > > the promisor, we just always return "me caga en la leche" (and this > > does not have its SHA checked, of course). Maybe; it doesn't necessarily need to be the same object returned, and these replacements could be user-specified via replace refs... > Missing promisor objects do not prevent fsck from passing - this is part > of the original design (any packfiles we download from the specifically > designated promisor remote are marked as such, and any objects that the > objects in the packfile refer to are considered OK to be missing). Is there ever a risk that objects in the downloaded packfile come across as deltas against other objects that are missing/excluded, or does the partial clone machinery ensure that doesn't happen? (Because this was certainly the biggest pain-point with my "fake cheap clone" hacks.) > Currently, when a missing object is read, it is first fetched (there are > some more details that I can go over if you have any specific > questions). What you're suggesting here is to return a fake blob with > wrong hash - I haven't looked at all the callers of read-object > functions in detail, but I don't think all of them are ready for such a > behavioral change. git-replace already took care of that for you and provides that guarantee, modulo the --no-replace-objects & fsck & prune & fetch & whatnot cases that ignore replace objects as Kaushik mentioned. I took advantage of this to great effect with my "fake cheap clone" hacks. Based in part on your other email where you made a suggestion about promisors, I'm starting to think a pretty good first cut solution might look like the following: * user manually adds a bunch of replace refs to map the unwanted big blobs to something else (e.g. a README about how the files were stripped, or something similar to this) * a partial clone specification that says "exclude objects that are referenced by replace refs" * add a fake promisor to the downloaded promisor pack so that if anyone runs with --no-replace-objects or similar then they get an error saying the specified objects don't exist and can't be downloaded. Anyone see any obvious problems with this? > Maybe it would be sufficient to just make this work > in a more limited scope (e.g. checkout only - and if we need different > replacement blobs for different object IDs, maybe we could have > something similar to the clean/smudge filters). > > > This could work together with some sort refs/blacklist mechanism to > > enable the server to choose which objects the client replaces. > > In the original email, Kaushik mentioned objects larger than a certain > size - we already have support for that (--filter=blob:limit=1000000, > for example). Having said that, Git is already able to tolerate any > exclusion (of tree or blob) from the server - we already need this in > order to support changing of filters, for example. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Extending git-replace 2020-01-14 20:39 ` Elijah Newren @ 2020-01-14 21:57 ` Jonathan Tan 2020-01-14 22:46 ` Elijah Newren 0 siblings, 1 reply; 9+ messages in thread From: Jonathan Tan @ 2020-01-14 21:57 UTC (permalink / raw) To: newren; +Cc: jonathantanmy, novalis, kaushik, git > > Missing promisor objects do not prevent fsck from passing - this is part > > of the original design (any packfiles we download from the specifically > > designated promisor remote are marked as such, and any objects that the > > objects in the packfile refer to are considered OK to be missing). > > Is there ever a risk that objects in the downloaded packfile come > across as deltas against other objects that are missing/excluded, or > does the partial clone machinery ensure that doesn't happen? (Because > this was certainly the biggest pain-point with my "fake cheap clone" > hacks.) The server may send thin packs during a fetch or clone, but because the client runs index-pack (which calculates the hash of every object downloaded, necessitating having the full object, which in turn triggers fetches of any delta bases), this should not happen. But if you create the packfile in some other way and then manually set a fake promisor remote (as I perhaps too naively suggested) then the mechanism will attempt to fetch missing delta bases, which (I think) is not what you want. > > Currently, when a missing object is read, it is first fetched (there are > > some more details that I can go over if you have any specific > > questions). What you're suggesting here is to return a fake blob with > > wrong hash - I haven't looked at all the callers of read-object > > functions in detail, but I don't think all of them are ready for such a > > behavioral change. > > git-replace already took care of that for you and provides that > guarantee, modulo the --no-replace-objects & fsck & prune & fetch & > whatnot cases that ignore replace objects as Kaushik mentioned. I > took advantage of this to great effect with my "fake cheap clone" > hacks. Based in part on your other email where you made a suggestion > about promisors, I'm starting to think a pretty good first cut > solution might look like the following: > > * user manually adds a bunch of replace refs to map the unwanted big > blobs to something else (e.g. a README about how the files were > stripped, or something similar to this) > * a partial clone specification that says "exclude objects that are > referenced by replace refs" > * add a fake promisor to the downloaded promisor pack so that if > anyone runs with --no-replace-objects or similar then they get an > error saying the specified objects don't exist and can't be > downloaded. > > Anyone see any obvious problems with this? Looking at the list of commands given in the original email (fsck, upload-pack, pack/unpack-objects, prune and index-pack), if we use a filter by blob size (instead of the partial clone specification suggested), this would satisfy the purposes of fsck and prune only. If we had a partial clone specification that excludes object referenced by replace refs, then upload-pack from this partial repository (and pack-objects) would work too. But there might be non-obvious problems that I haven't thought of. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Extending git-replace 2020-01-14 21:57 ` Jonathan Tan @ 2020-01-14 22:46 ` Elijah Newren 0 siblings, 0 replies; 9+ messages in thread From: Elijah Newren @ 2020-01-14 22:46 UTC (permalink / raw) To: Jonathan Tan; +Cc: novalis, Kaushik Srenevasan, Git Mailing List On Tue, Jan 14, 2020 at 1:57 PM Jonathan Tan <jonathantanmy@google.com> wrote: > > > > Missing promisor objects do not prevent fsck from passing - this is part > > > of the original design (any packfiles we download from the specifically > > > designated promisor remote are marked as such, and any objects that the > > > objects in the packfile refer to are considered OK to be missing). > > > > Is there ever a risk that objects in the downloaded packfile come > > across as deltas against other objects that are missing/excluded, or > > does the partial clone machinery ensure that doesn't happen? (Because > > this was certainly the biggest pain-point with my "fake cheap clone" > > hacks.) > > The server may send thin packs during a fetch or clone, but because the > client runs index-pack (which calculates the hash of every object > downloaded, necessitating having the full object, which in turn triggers > fetches of any delta bases), this should not happen. So if a user does a partial clone, filtering by blob size >= 1M, and if they have several blobs of size just above and just below that limit, then the partial clone will work but probably cause them to still download several blobs above the limit size anyway? (Which, if I'm understanding correctly, happens because the blobs just smaller than 1M likely will delta well against the blobs just larger than 1M.) > But if you create the packfile in some other way and then manually set a > fake promisor remote (as I perhaps too naively suggested) then the > mechanism will attempt to fetch missing delta bases, which (I think) is > not what you want. Well, it's not optimal, but we're currently just dying with cryptic errors whenever we have missing delta bases, and this happens whenever we have an accidental fetch of older branches (although this does have the nice side effect of notifying us of stray fetches in our CI scripts). Your promisor suggestion would at least permit gc's & prunes if we use it in more places, so should be an improvement. I just wanted to verify whether this problem with delta bases would remain. > > > Currently, when a missing object is read, it is first fetched (there are > > > some more details that I can go over if you have any specific > > > questions). What you're suggesting here is to return a fake blob with > > > wrong hash - I haven't looked at all the callers of read-object > > > functions in detail, but I don't think all of them are ready for such a > > > behavioral change. > > > > git-replace already took care of that for you and provides that > > guarantee, modulo the --no-replace-objects & fsck & prune & fetch & > > whatnot cases that ignore replace objects as Kaushik mentioned. I > > took advantage of this to great effect with my "fake cheap clone" > > hacks. Based in part on your other email where you made a suggestion > > about promisors, I'm starting to think a pretty good first cut > > solution might look like the following: > > > > * user manually adds a bunch of replace refs to map the unwanted big > > blobs to something else (e.g. a README about how the files were > > stripped, or something similar to this) > > * a partial clone specification that says "exclude objects that are > > referenced by replace refs" > > * add a fake promisor to the downloaded promisor pack so that if > > anyone runs with --no-replace-objects or similar then they get an > > error saying the specified objects don't exist and can't be > > downloaded. > > > > Anyone see any obvious problems with this? > > Looking at the list of commands given in the original email (fsck, > upload-pack, pack/unpack-objects, prune and index-pack), if we use a > filter by blob size (instead of the partial clone specification > suggested), this would satisfy the purposes of fsck and prune only. > > If we had a partial clone specification that excludes object referenced > by replace refs, then upload-pack from this partial repository (and > pack-objects) would work too. > > But there might be non-obvious problems that I haven't thought of. Cool, sounds like it's at least worth investigating. Maybe Kaushik is interested, or maybe I consider throwing it on my backlog and coming back to it in a year or two. :-) ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-01-16 3:30 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-01-14 5:33 [RFC] Extending git-replace Kaushik Srenevasan 2020-01-14 6:55 ` Elijah Newren 2020-01-14 19:11 ` Jonathan Tan 2020-01-16 3:30 ` Kaushik Srenevasan 2020-01-14 18:19 ` David Turner 2020-01-14 19:03 ` Jonathan Tan 2020-01-14 20:39 ` Elijah Newren 2020-01-14 21:57 ` Jonathan Tan 2020-01-14 22:46 ` Elijah Newren
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).