* removing content from git history @ 2007-02-21 16:45 Michael Hendricks 2007-02-21 16:56 ` Shawn O. Pearce 2007-02-21 17:14 ` Linus Torvalds 0 siblings, 2 replies; 25+ messages in thread From: Michael Hendricks @ 2007-02-21 16:45 UTC (permalink / raw) To: git I assume that this question has already been addressed on the mailing list, but I wasn't able to find anything about it in the archives. Is it possible to remove content entirely from git's history? I have a client who does not use git for version control. A couple months ago they committed some sensitive client information which should never have been committed. Recently, they realized the mistake and now want to remove all traces of the mistake from history. I would like to migrate them to git at some point. However, if they had been using git for version control already, I'm not sure how I would solved this particular problem. Any suggestions? -- Michael ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 16:45 removing content from git history Michael Hendricks @ 2007-02-21 16:56 ` Shawn O. Pearce 2007-02-21 17:17 ` J. Bruce Fields 2007-02-21 17:14 ` Linus Torvalds 1 sibling, 1 reply; 25+ messages in thread From: Shawn O. Pearce @ 2007-02-21 16:56 UTC (permalink / raw) To: Michael Hendricks; +Cc: git Michael Hendricks <michael@ndrix.org> wrote: > Is it possible to remove content entirely from git's history? No, not once it has been published around to another repository. Since every developer has a copy of the repository its very difficult to remove something, as it must be removed from every developer's repository, and each developer must perform an action to agree to that removal. So just one hold-out will keep the bad content around. > I have a > client who does not use git for version control. A couple months ago > they committed some sensitive client information which should never have > been committed. Recently, they realized the mistake and now want to > remove all traces of the mistake from history. > > I would like to migrate them to git at some point. However, if they had > been using git for version control already, I'm not sure how I would > solved this particular problem. Any suggestions? The *only* way to do this in Git is to completely recreate every commit after that point. This changes all commit IDs and basically forks the project into two completely different histories: the one with the bad thing in it, and the one without the bad thing. Users who have the bad thing will continue to have the bad thing until they take explicit action to throw away all of that history and switch to the other one. Now this is actually not a huge deal if you do it on your local repository and go "whoops, I should not have committed that". If you have not yet pushed the commit to another repository (and someone has not yet fetched it from you either) you can use git-rebase to discard it. But once its been pushed/fetched the genie is out of the bottle, and its not going back in. -- Shawn. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 16:56 ` Shawn O. Pearce @ 2007-02-21 17:17 ` J. Bruce Fields 2007-02-21 18:02 ` Linus Torvalds 0 siblings, 1 reply; 25+ messages in thread From: J. Bruce Fields @ 2007-02-21 17:17 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Michael Hendricks, git On Wed, Feb 21, 2007 at 11:56:36AM -0500, Shawn O. Pearce wrote: > Now this is actually not a huge deal if you do it on your local > repository and go "whoops, I should not have committed that". If you > have not yet pushed the commit to another repository (and someone > has not yet fetched it from you either) you can use git-rebase to > discard it. Also it can't have done any (non-fast-forward) merges since then. Reconstructing history with a bunch of merges seems like something that could be a huge pain. (Though with some tools it might be doable.) --b. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 17:17 ` J. Bruce Fields @ 2007-02-21 18:02 ` Linus Torvalds 2007-02-21 18:24 ` Linus Torvalds 2007-02-21 21:00 ` Shawn O. Pearce 0 siblings, 2 replies; 25+ messages in thread From: Linus Torvalds @ 2007-02-21 18:02 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Shawn O. Pearce, Michael Hendricks, git On Wed, 21 Feb 2007, J. Bruce Fields wrote: > > Reconstructing history with a bunch of merges seems like something that > could be a huge pain. (Though with some tools it might be doable.) It's not actually that painful, but it *is* expensive. I wrote git-convert-cache (now "git convert-objects") back when we did the SHA1/compression switchover changes and the date format translation, so we've actually had a tool that can do history rewriting pretty much since day 1 (well, "day 14", to be exact, but still.. April 2005). BUT: - I'm not guaranteeing that it works any more. We haven't changed the fundamental object format since, so that particular program has never gotten any testing. It still compiles, but does it work? I dunno. I actually tested it on git itself. It converted the top of the git tree successfully, and generated a *new* git history. Why? Because it will actually rewrite the old git tree entries that have permission 0664 into 0644: the *data* will be identical (and no git tools except for "git fsck --pedantic" will even notice the difference), but the converted tree avoids one of the legacy decisions that we never fixed in the git repository itself. So it works at least to *some* degree, but I would suggest you be very very careful! - it can be slow. For something like git, which isn't *that* big, and where we actually don't need to do a lot of rewriting (ie all the blobs stay the same, and only a few trees have to be rewritten, and so it's really just rewriting commits), it's not that bad. It actyally converted the whole git history in less than ten seconds for me. But if you have a *huge* tree, and you actually convert objects too (say, you started using git on Windows before the "autocrlf" thing, and want to convert the old blobs from CRLF -> LF), it would (a) require some extensions to convert-object.c to do the blob conversion (b) be *much* slower (c) generate tons of unpacked objects (because git-convert-objects doesn't know to pack in between, and doesn't use anything newfangled like "git-fast-import" to do anything clever) For the kernel, it took 2 minutes, but again, it was exactly the same thing: just a few old tree objects that it rewrote, and as a result, every single commit SHA1 changed. Still, it was almost _only_ commits (it generated 49521 new objects, 49332 of which was the new commit history) If you want to rewrite a *lot* (ie somethign that exists in more than just a few trees), and you have lots of history, it can be very expensive indeed. - It currently doesn't convert the SHA1 numbers that show up in commit messages. It could, and it should. But it doesn't. So once you convert a git project, it doesn't do the nice "gitk does links from the SHA1 text in a commit message to the commit it talks about" any more. Somebody should fix that. Anyway, git-convert-objects does kind of give you a starting point. It should be fixed to use "git-fast-import" or repack once in a while (so that it doesn't leave tons and tons of unpacked objects), and it should be fixed to fix up any commit messages that mention SHA1's that it has already converted to something else, but it seems to still work. It would not be impossible at all to extend the tree-rewriting logic to remove some file or a particular SHA1 object you want to replace. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 18:02 ` Linus Torvalds @ 2007-02-21 18:24 ` Linus Torvalds 2007-02-21 21:00 ` Shawn O. Pearce 1 sibling, 0 replies; 25+ messages in thread From: Linus Torvalds @ 2007-02-21 18:24 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Shawn O. Pearce, Michael Hendricks, git On Wed, 21 Feb 2007, Linus Torvalds wrote: > > For the kernel, it took 2 minutes, but again, it was exactly the same > thing: just a few old tree objects that it rewrote, and as a result, > every single commit SHA1 changed. Still, it was almost _only_ commits > (it generated 49521 new objects, 49332 of which was the new commit > history) Side note: I wasn't entirelyaccurate. The kernel had trees with file mode 0644 for all the early commits, because my umask is 0022. So everything up to commit 4bfa437cf1 is shared after the conversion. But the next one (commit 5dfa9c1b4f) introduced the file include/asm-mips/vr41xx/pci.h with file mode 0664, and I'm not 100% sure why that one happened with that file mode, but as a result, every single commit ever after will have a different SHA1, because the tree got rewritten (and subsequent commits - even if their trees did *not* get rewritten - will obviously have different parent SHA1's). So 56 commits are shared, and "only" 49276 commits were rewritten (and apparently 245 trees). Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 18:02 ` Linus Torvalds 2007-02-21 18:24 ` Linus Torvalds @ 2007-02-21 21:00 ` Shawn O. Pearce 2007-02-21 21:11 ` Linus Torvalds 1 sibling, 1 reply; 25+ messages in thread From: Shawn O. Pearce @ 2007-02-21 21:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: J. Bruce Fields, Michael Hendricks, git Linus Torvalds <torvalds@linux-foundation.org> wrote: > Anyway, git-convert-objects does kind of give you a starting point. It > should be fixed to use "git-fast-import" or repack once in a while (so > that it doesn't leave tons and tons of unpacked objects), and it should be > fixed to fix up any commit messages that mention SHA1's that it has > already converted to something else, but it seems to still work. It would > not be impossible at all to extend the tree-rewriting logic to remove some > file or a particular SHA1 object you want to replace. One idea Junio and I kicked around on #git a short while ago was to arrange for a pipe between the current Git process and git-fast-import, where the pipe was used from within write_sha1_file() rather than creating the loose object. This way an existing process like git-apply or git-convert-objects could easily spew hundreds of thousands of objects without needing to worry about repacking in the middle; nor would we need to worry about the complexity of trying to disentagle the multiobject packing parts of fast-import into some sort of library. Obviously this is only a good idea if we are going to be making enough objects to warrant using a packfile; small 10-20 bursts of objects from a git-apply doesn't really justify a packfile. But applying 100s of patches in a row might, if we could keep them all fed through the same git-fast-import backend (and thus into the same packfile). -- Shawn. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 21:00 ` Shawn O. Pearce @ 2007-02-21 21:11 ` Linus Torvalds 2007-02-21 21:21 ` Shawn O. Pearce 0 siblings, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2007-02-21 21:11 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: J. Bruce Fields, Michael Hendricks, git On Wed, 21 Feb 2007, Shawn O. Pearce wrote: > > One idea Junio and I kicked around on #git a short while ago > was to arrange for a pipe between the current Git process > and git-fast-import, where the pipe was used from within > write_sha1_file() rather than creating the loose object. The probnlem there is that most conversion scripts that use "write_sha1_file()" will want to *read* that file later. If git-fast-import hasn't generated the pack yet (because it's still waiting for more data), that will not work at all. So then you basically force the conversion script to keep remembering all the old object data (using something like pretend_sha1_file), or you limit it to things that just always re-write the whole object and never need any old object references that they might have written. A lot of conversions tend to be incremental, ie they will depend on the data they converted previously. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 21:11 ` Linus Torvalds @ 2007-02-21 21:21 ` Shawn O. Pearce 2007-10-09 20:58 ` Bill Lear 0 siblings, 1 reply; 25+ messages in thread From: Shawn O. Pearce @ 2007-02-21 21:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: J. Bruce Fields, Michael Hendricks, git Linus Torvalds <torvalds@linux-foundation.org> wrote: > The probnlem there is that most conversion scripts that use > "write_sha1_file()" will want to *read* that file later. If > git-fast-import hasn't generated the pack yet (because it's still waiting > for more data), that will not work at all. Yes, indeed... > So then you basically force the conversion script to keep remembering all > the old object data (using something like pretend_sha1_file), or you limit > it to things that just always re-write the whole object and never need any > old object references that they might have written. > > A lot of conversions tend to be incremental, ie they will depend on the > data they converted previously. Which is why I was actually thinking of flipping this on its head. Libify git-apply and embed that into fast-import, then one of the native input formats might just be an mbox, or something close enough that a simple C/perl/sed prefilter could make an mbox into the input. fast-import can (and does if necessary) go back to access the packfile it is writing. It has the index data held in memory and uses only OBJ_OFS_REF so that sha1_file.c can unpack deltas just fine, even though we lack an index file and have not completely checksummed the pack itself. So although no other Git process can use the packfile, it is usuable from within fast-import... -- Shawn. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 21:21 ` Shawn O. Pearce @ 2007-10-09 20:58 ` Bill Lear 2007-10-09 21:02 ` J. Bruce Fields 2007-10-10 14:41 ` Johannes Schindelin 0 siblings, 2 replies; 25+ messages in thread From: Bill Lear @ 2007-10-09 20:58 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Linus Torvalds, J. Bruce Fields, Michael Hendricks, git I'm resurrecting this old thread, as we have come across a similar need and I could not tell if this has been settled. More below... On Wednesday, February 21, 2007 at 16:21:30 (-0500) Shawn O. Pearce writes: >Linus Torvalds <torvalds@linux-foundation.org> wrote: >> The probnlem there is that most conversion scripts that use >> "write_sha1_file()" will want to *read* that file later. If >> git-fast-import hasn't generated the pack yet (because it's still waiting >> for more data), that will not work at all. > >Yes, indeed... > >> So then you basically force the conversion script to keep remembering all >> the old object data (using something like pretend_sha1_file), or you limit >> it to things that just always re-write the whole object and never need any >> old object references that they might have written. >> >> A lot of conversions tend to be incremental, ie they will depend on the >> data they converted previously. > >Which is why I was actually thinking of flipping this on its head. >Libify git-apply and embed that into fast-import, then one of the >native input formats might just be an mbox, or something close enough >that a simple C/perl/sed prefilter could make an mbox into the input. > >fast-import can (and does if necessary) go back to access the >packfile it is writing. It has the index data held in memory and >uses only OBJ_OFS_REF so that sha1_file.c can unpack deltas just >fine, even though we lack an index file and have not completely >checksummed the pack itself. > >So although no other Git process can use the packfile, it is usuable >from within fast-import... As I understand this thread, it does not appear that a resolution was reached. Our company has content in our central git repository that we need to remove per a contractual obligation. I believe the content in question is limited to one sub-directory, that has existed since (or near to) the beginning of the repo, if that matters. We obviously would just like to issue a "git nuke" operation and be done with it, if that is available. Barring that, we could probably follow reasonably simple steps to purge the content and rebuild the repo. So, what options do we have at present? Bill ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-10-09 20:58 ` Bill Lear @ 2007-10-09 21:02 ` J. Bruce Fields 2007-10-09 22:25 ` Bill Lear 2007-10-10 14:41 ` Johannes Schindelin 1 sibling, 1 reply; 25+ messages in thread From: J. Bruce Fields @ 2007-10-09 21:02 UTC (permalink / raw) To: Bill Lear; +Cc: Shawn O. Pearce, Linus Torvalds, Michael Hendricks, git On Tue, Oct 09, 2007 at 03:58:57PM -0500, Bill Lear wrote: > As I understand this thread, it does not appear that a resolution > was reached. Our company has content in our central git repository > that we need to remove per a contractual obligation. I believe the > content in question is limited to one sub-directory, that has existed > since (or near to) the beginning of the repo, if that matters. We > obviously would just like to issue a "git nuke" operation and be done > with it, if that is available. Barring that, we could probably follow > reasonably simple steps to purge the content and rebuild the repo. > > So, what options do we have at present? Have you looked at git-filter-branch in a recent version of git? The man page has some good examples. --b. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-10-09 21:02 ` J. Bruce Fields @ 2007-10-09 22:25 ` Bill Lear 0 siblings, 0 replies; 25+ messages in thread From: Bill Lear @ 2007-10-09 22:25 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Shawn O. Pearce, Linus Torvalds, Michael Hendricks, git On Tuesday, October 9, 2007 at 17:02:35 (-0400) J. Bruce Fields writes: >On Tue, Oct 09, 2007 at 03:58:57PM -0500, Bill Lear wrote: >> As I understand this thread, it does not appear that a resolution >> was reached. Our company has content in our central git repository >> that we need to remove per a contractual obligation. I believe the >> content in question is limited to one sub-directory, that has existed >> since (or near to) the beginning of the repo, if that matters. We >> obviously would just like to issue a "git nuke" operation and be done >> with it, if that is available. Barring that, we could probably follow >> reasonably simple steps to purge the content and rebuild the repo. >> >> So, what options do we have at present? > >Have you looked at git-filter-branch in a recent version of git? The >man page has some good examples. Ah, no, though I will do so. It is apparently not in the version I have (1.5.2.4), but it is in 1.5.3.1. We'll give this a shot and complain if we can't handle it. Thank you. Bill ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-10-09 20:58 ` Bill Lear 2007-10-09 21:02 ` J. Bruce Fields @ 2007-10-10 14:41 ` Johannes Schindelin 1 sibling, 0 replies; 25+ messages in thread From: Johannes Schindelin @ 2007-10-10 14:41 UTC (permalink / raw) To: Bill Lear Cc: Shawn O. Pearce, Linus Torvalds, J. Bruce Fields, Michael Hendricks, git Hi, On Tue, 9 Oct 2007, Bill Lear wrote: > Our company has content in our central git repository that we need to > remove per a contractual obligation. I believe the content in question > is limited to one sub-directory, that has existed since (or near to) the > beginning of the repo, if that matters. We obviously would just like to > issue a "git nuke" operation and be done with it, if that is available. > Barring that, we could probably follow reasonably simple steps to purge > the content and rebuild the repo. > > So, what options do we have at present? git filter-branch. I suggest using the index filter. There is even a nice example in the man page of git filter-branch. Which reminds me that I have some TODOs left in filter-branch... Ciao, Dscho ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 16:45 removing content from git history Michael Hendricks 2007-02-21 16:56 ` Shawn O. Pearce @ 2007-02-21 17:14 ` Linus Torvalds 2007-02-21 18:02 ` Nicolas Pitre ` (2 more replies) 1 sibling, 3 replies; 25+ messages in thread From: Linus Torvalds @ 2007-02-21 17:14 UTC (permalink / raw) To: Michael Hendricks; +Cc: git On Wed, 21 Feb 2007, Michael Hendricks wrote: > > I assume that this question has already been addressed on the mailing > list, but I wasn't able to find anything about it in the archives. > > Is it possible to remove content entirely from git's history? It's been discussed. There are two options for doing it: - rewriting history. There are a few tools for this already, and for specific needs it would be fairly easy to resurrect git-convert-objects to do it for any kind of object. See "cg-admin-rewritehist" from cogito for an example of a tool that would do what you need done. In fact, it has this exact thing as the first example. (Btw, I think cg-admin-rewritehist is one of the few things that cogito had that was really a good idea. Not that people probably _used_ it much, but it's somethign that makes sense in the plumbing) - explicit support for "missing objects". We don't do it right now, but we could add it. It was discussed for things like limited history etc (the "shallow clone" kind of thing, before people actually added shallow clones), and it would support the notion of "we export all our history, but for internal reasons we cannot make certain objects available" kinds of workflows. So right now, rewriting history is an option that you can do. It will effectively create a totally new branch (which you can then make into a new repository) which has nothing in common with the old branch from the point where it was modified. So you can never really merge the two ever again, and you need to make sure that everybody who had the old repo contents will destroy it. But at least in theory, it wouldn't be impossible to extend on the ".git/grafts" kind of setup to say "this object has been consciously deleted", and that could in some circumstances be a better model. The biggest headache there would be the need to extend the native git protocol with a way to add such objects. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 17:14 ` Linus Torvalds @ 2007-02-21 18:02 ` Nicolas Pitre 2007-02-21 18:13 ` Linus Torvalds 2007-02-21 18:30 ` Michael Hendricks 2007-02-21 19:01 ` Junio C Hamano 2 siblings, 1 reply; 25+ messages in thread From: Nicolas Pitre @ 2007-02-21 18:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: Michael Hendricks, git On Wed, 21 Feb 2007, Linus Torvalds wrote: > But at least in theory, it wouldn't be impossible to extend on the > ".git/grafts" kind of setup to say "this object has been consciously > deleted", and that could in some circumstances be a better model. The > biggest headache there would be the need to extend the native git protocol > with a way to add such objects. I think that would be a big security issue. Right now the GIT history can be validated and more importantly trusted from a single commit signature. If poking holes in that model is allowed by the graft mechanism, it must remain a local thing and a very conscious one otherwise the GIT trust model would be greatly weakened. If your goal is to remove content froma repository then the only sensible way is to rewrite history before publishing. It is pointless to add mechanisms to remove content after it has been distributed. Nicolas ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 18:02 ` Nicolas Pitre @ 2007-02-21 18:13 ` Linus Torvalds 2007-02-21 18:39 ` Nicolas Pitre 0 siblings, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2007-02-21 18:13 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Michael Hendricks, git On Wed, 21 Feb 2007, Nicolas Pitre wrote: > > If your goal is to remove content froma repository then the only > sensible way is to rewrite history before publishing. It is pointless > to add mechanisms to remove content after it has been distributed. I'm not entirely in disagreement, but I can see the model where some company wants to make their work available (with the same history as their own internal stuff), but doesn't want to make a single file available for some reason. So they'd have an external thing that just has the file excised. Now, arguably, it's a lot better to use a "supermodule" approach for something like this: have two separate git trees, publish the public one, and use an internal supermodule that ties the public and internal trees together. So supermodules might be a way to solve it in a better (and safer - the "remove objects from the public tree" thing is very error prone, since if you *ever* expose the object by mistake, its now public) way. But I don't think the "filter out objects" thing is necessarily fundamentally flawed as an approach. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 18:13 ` Linus Torvalds @ 2007-02-21 18:39 ` Nicolas Pitre 0 siblings, 0 replies; 25+ messages in thread From: Nicolas Pitre @ 2007-02-21 18:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: Michael Hendricks, git On Wed, 21 Feb 2007, Linus Torvalds wrote: > > > On Wed, 21 Feb 2007, Nicolas Pitre wrote: > So supermodules might be a way to solve it in a better (and safer - the > "remove objects from the public tree" thing is very error prone, since if > you *ever* expose the object by mistake, its now public) way. But I don't > think the "filter out objects" thing is necessarily fundamentally flawed > as an approach. Well if you really wanted to do such a thing then you could use a new object type that only serves as a stub pretending to be another object which SHA1 would have been xyz. When referenced this object would generate a warning indicating to the user that given object has been excised out, but otherwise the whole reachability validation would still work as usual. And since this object would be distributed through standard mechanisms then there would be no need for protocol extensions. I don't know if this could help creating SHA1 collisions though. We've dismissed them as highly improbable because the likelihood of a collision to hide compromised material would most probably require a binary blob somewhere to balance the hash and would hardly be compilable/undetected. But with object stubs with the ability to pretend having any possible SHA1 is in fact a nice way to hide 20-byte binary blobs in the hash chain possibly making it "easier" to create "useful" collisions. This is where I see a weakening of the trust model. Nicolas ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 17:14 ` Linus Torvalds 2007-02-21 18:02 ` Nicolas Pitre @ 2007-02-21 18:30 ` Michael Hendricks 2007-02-21 18:37 ` Shawn O. Pearce ` (2 more replies) 2007-02-21 19:01 ` Junio C Hamano 2 siblings, 3 replies; 25+ messages in thread From: Michael Hendricks @ 2007-02-21 18:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: git On Wed, Feb 21, 2007 at 09:14:44AM -0800, Linus Torvalds wrote: > > See "cg-admin-rewritehist" from cogito for an example of a tool that > would do what you need done. In fact, it has this exact thing as the > first example. That's just what I was looking for. Thanks. > So right now, rewriting history is an option that you can do. It will > effectively create a totally new branch (which you can then make into a > new repository) which has nothing in common with the old branch from the > point where it was modified. So you can never really merge the two ever > again, and you need to make sure that everybody who had the old repo > contents will destroy it. What's a decent way to make a branch into a new repository? My first inclination is to "cp -a" the existing repository, checkout the branch, delete all other branches and repack. That seems to have worked in my quick test, but is there a better way? -- Michael ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 18:30 ` Michael Hendricks @ 2007-02-21 18:37 ` Shawn O. Pearce 2007-02-21 18:47 ` Linus Torvalds 2007-02-21 18:52 ` Nicolas Pitre 2 siblings, 0 replies; 25+ messages in thread From: Shawn O. Pearce @ 2007-02-21 18:37 UTC (permalink / raw) To: Michael Hendricks; +Cc: Linus Torvalds, git Michael Hendricks <michael@ndrix.org> wrote: > What's a decent way to make a branch into a new repository? My first > inclination is to "cp -a" the existing repository, checkout the branch, > delete all other branches and repack. That seems to have worked in my > quick test, but is there a better way? Don't "cp -a" the repository, use git-clone. And actually, if you just want to pull one branch out into its own repository you can do something like this: mkdir ../theonebranch cd ../theonebranch git init git fetch ../oldstuff theonebranch:master and you have just the content of `theonebranch` from ../oldstuff stored here, as master. Optionally if you now want to actually see the files, you would do: git checkout -- Shawn. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 18:30 ` Michael Hendricks 2007-02-21 18:37 ` Shawn O. Pearce @ 2007-02-21 18:47 ` Linus Torvalds 2007-02-21 18:56 ` Linus Torvalds 2007-02-21 18:52 ` Nicolas Pitre 2 siblings, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2007-02-21 18:47 UTC (permalink / raw) To: Michael Hendricks; +Cc: git On Wed, 21 Feb 2007, Michael Hendricks wrote: > > What's a decent way to make a branch into a new repository? My first > inclination is to "cp -a" the existing repository, checkout the branch, > delete all other branches and repack. That seems to have worked in my > quick test, but is there a better way? That works. As does just "clone repo, delete all unwanted branches, and prune" (of course, if you don't want the old repo, you can skip the "clone" part, and just do the "delete all unwanted branches and prune" thing). In some ways, a more straightforward approach may be to just create a new repo, and populate it with just one branch (I say "more straightforward", not "easier", because I just think it's conceptually simpler): mkdir new-repo cd new-repo git init git pull old-repo <branch> (add "--bare" and "--shared" to taste - with bare repos yu can also do it the other way by doing a push into it from outside after you've created it, which can be the "logical" way to do it if you want to just publish the end result on some shared site) Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 18:47 ` Linus Torvalds @ 2007-02-21 18:56 ` Linus Torvalds 0 siblings, 0 replies; 25+ messages in thread From: Linus Torvalds @ 2007-02-21 18:56 UTC (permalink / raw) To: Michael Hendricks; +Cc: git On Wed, 21 Feb 2007, Linus Torvalds wrote: > > > On Wed, 21 Feb 2007, Michael Hendricks wrote: > > > > What's a decent way to make a branch into a new repository? My first > > inclination is to "cp -a" the existing repository, checkout the branch, > > delete all other branches and repack. That seems to have worked in my > > quick test, but is there a better way? > > That works. Btw, when I say "works", I do mean that "yeah, 'cp -a' works, but generally you're better off cloning". When you use 'cp -a' you have to re-build the index at the very least. It so happens that since you checked out the branch explicitly, that will do it for you anyway, but it's still often a good idea to just *not* use the regular "copy everything by hand" approach. If you want to be really efficient, there are actually better ways. For example, since you want to avoid having any of the old objects even reachable by mistake), you're probably better off with an explicit pull of the explicit branch, if only because that also involves a re-pack of only the reachable objects, and you know that there won't be any reflogs etc that might still make the object you try to remove be accessible to people who can access the resulting repository directly. (Yeah, the "cp -a" is faster than the "git pull", but since you want to do the packing that git pull does for you *anyway* to get rid of the old objects, "git pull" actually ends up being better). Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 18:30 ` Michael Hendricks 2007-02-21 18:37 ` Shawn O. Pearce 2007-02-21 18:47 ` Linus Torvalds @ 2007-02-21 18:52 ` Nicolas Pitre 2 siblings, 0 replies; 25+ messages in thread From: Nicolas Pitre @ 2007-02-21 18:52 UTC (permalink / raw) To: Michael Hendricks; +Cc: Linus Torvalds, git On Wed, 21 Feb 2007, Michael Hendricks wrote: > What's a decent way to make a branch into a new repository? My first > inclination is to "cp -a" the existing repository, checkout the branch, > delete all other branches and repack. That seems to have worked in my > quick test, but is there a better way? Like Shawn said the better way is simply to fetch that branch into a new repo. If you do a cp -a and delete unwanted branches it'll work as well of course, but repacking won't get rid of all the data from the believed to be deleted branches since some reflog, the HEAD reflog in particular, will most probably have references to commits from the removed branches. Therefore the pack will still contain that data, at least untill the reflog entries expire and get pruned. Of course if you want to publish just the wanted branch and perform a push to a public place then only those objects for that branch will be sent like for the fetch case. Nicolas ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 17:14 ` Linus Torvalds 2007-02-21 18:02 ` Nicolas Pitre 2007-02-21 18:30 ` Michael Hendricks @ 2007-02-21 19:01 ` Junio C Hamano 2007-02-21 19:33 ` Nicolas Pitre 2 siblings, 1 reply; 25+ messages in thread From: Junio C Hamano @ 2007-02-21 19:01 UTC (permalink / raw) To: Linus Torvalds; +Cc: Michael Hendricks, git Linus Torvalds <torvalds@linux-foundation.org> writes: > - explicit support for "missing objects". We don't do it right now, but > we could add it. It was discussed for things like limited history etc > (the "shallow clone" kind of thing, before people actually added > shallow clones), and it would support the notion of "we export all our > history, but for internal reasons we cannot make certain objects > available" kinds of workflows. > ... > But at least in theory, it wouldn't be impossible to extend on the > ".git/grafts" kind of setup to say "this object has been consciously > deleted", and that could in some circumstances be a better model. The > biggest headache there would be the need to extend the native git protocol > with a way to add such objects. While I agree in principle to the argument that there is no taking it back what's already published, I've heard people wanting to just stop distributing further, without worrying about copies already out there. 'missing objects' support would help us in such a situation. Supporting 'missing objects' in general would be painful, when they contain pointers to other objects (i.e. tags, commits, and trees). Thinking aloud... * missing blob: we can have 'stub blob' objects. Probably the object header for such an object would look like: stub <length> NUL ----------------- object <object name of the real blob object> type blob Hashing a 'stub' object (along with its header as usual, in write_sha1_file_prepare()) would instead just report the object name recorded there. When packing (this applies both to local repacking and push/fetch object transfer to other repositories), the stub object is included. delta algorithm would probably not to delta other objects with it. * missing commit and tag: 'stub object' needs to be extended to include these object types, and we would also need 'stub commit' and 'stub tag' objects, that copy the structural fields from the corresponding true object. So a stub commit would probably look like: stub <length> NUL ----------------- object <object name of the real commit object> type commit tree <object name of the tree contained in the real commit object> parent <object name of the first parent in the real commit object> parent <object name of the first second in the real commit object> * missing tree would only be useful to conceal pathnames recorded in the real tree object. I am not sure if that is needed. * fsck and verify-pack needs to be taught about 'stub' objects, so that they know that their filenames (or the data pointed at by pack .idx) do not match the result of hashing them. If we were to do this, I suspect we can probably do nothing but 'missing blob' first to cover a lot of ground, but we would eventually need 'missing commit' to replace real commit objects that has sensitive information in its log message. As Nico pointed out, this has serious security implications. We would need a separate list of objects that are Ok to be stubbed out, with probably explanation of why they are stubbed out, and fsck should compare the stub objects found in the repository against that list. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 19:01 ` Junio C Hamano @ 2007-02-21 19:33 ` Nicolas Pitre 2007-02-21 20:22 ` Junio C Hamano 0 siblings, 1 reply; 25+ messages in thread From: Nicolas Pitre @ 2007-02-21 19:33 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Michael Hendricks, git On Wed, 21 Feb 2007, Junio C Hamano wrote: > While I agree in principle to the argument that there is no > taking it back what's already published, I've heard people > wanting to just stop distributing further, without worrying > about copies already out there. 'missing objects' support would > help us in such a situation. I still think this is a "put your head in the sand and pretend that some sensitive data never existed in the wild" attitude. And I really don't see the point of supporting that illusion in GIT with technical means. Either you care about published data or you don't. If you do then you are screwed anyway irrespective of any missing object support we might implement. There will always be someone somewhere with the real thing, and we all know how faster forbidden material does travel on the Internet. If you don't then it is just better to rewrite history and have a clean and unambiguous repository. And because you don't care about existing copies you shouldn't bother with the fact that the rewritten repo is not compatible with the previously published one. Sure rewriting history is a potentially expensive operation depending on the size and nature of the change, but it is done only once. And actually it can't be _that_ much expensive than a git-repack -a -f. I think it is much better to provide a tool to properly rewrite history than adding support for missing objects and be stuck with them forever. Nicolas ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 19:33 ` Nicolas Pitre @ 2007-02-21 20:22 ` Junio C Hamano 2007-02-21 20:49 ` Nicolas Pitre 0 siblings, 1 reply; 25+ messages in thread From: Junio C Hamano @ 2007-02-21 20:22 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Michael Hendricks, git Nicolas Pitre <nico@cam.org> writes: > On Wed, 21 Feb 2007, Junio C Hamano wrote: > >> While I agree in principle to the argument that there is no >> taking it back what's already published, I've heard people >> wanting to just stop distributing further, without worrying >> about copies already out there. 'missing objects' support would >> help us in such a situation. > > I still think this is a "put your head in the sand and pretend that some > sensitive data never existed in the wild" attitude. And I really don't > see the point of supporting that illusion in GIT with technical means. Well, I think we are in agreement (and that is why I said "I've heard people wanting"). But it is entirely possible that somebody has a project that is internal to a company managed for a long time with git, that he wants to go open source, with (almost) full history. And the project may have some proprietary add-on bit which cannot be published, while building the public bits does not require that part. Stubbing things out may help that kind of situation. The development team can keep going forward, internally using the real objects, while pushing stub objects out to the public repository, without having to rewrite the history and re-partition the project. But after having thought about that, I think it would not buy us much. You would want to re-partition the project sooner or later in such a situation *anyway*, so our time is better spent on giving better support to split existing projects. It may already be sufficient in the form of admin-rewritehist, in which case we can worry about other things ;-). ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: removing content from git history 2007-02-21 20:22 ` Junio C Hamano @ 2007-02-21 20:49 ` Nicolas Pitre 0 siblings, 0 replies; 25+ messages in thread From: Nicolas Pitre @ 2007-02-21 20:49 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Michael Hendricks, git On Wed, 21 Feb 2007, Junio C Hamano wrote: > Well, I think we are in agreement (and that is why I said "I've > heard people wanting"). > > But it is entirely possible that somebody has a project that is > internal to a company managed for a long time with git, that he > wants to go open source, with (almost) full history. And the > project may have some proprietary add-on bit which cannot be > published, while building the public bits does not require that > part. Stubbing things out may help that kind of situation. It might help, or it might create a management nightmare. It would be really easy to accidentally push the real objects out since a repo with them would be indistinguishable from a repo with stubs (that's the point of stub objects isn't it?), and because of the distributed nature of GIT the leak could come from anyone with access to the private objects. In such a scenario I think it is still more sensible to rewrite the repo history before going open source. You need only to worry about isolating the proprietary stuff once. Nicolas ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2007-10-10 14:41 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-02-21 16:45 removing content from git history Michael Hendricks 2007-02-21 16:56 ` Shawn O. Pearce 2007-02-21 17:17 ` J. Bruce Fields 2007-02-21 18:02 ` Linus Torvalds 2007-02-21 18:24 ` Linus Torvalds 2007-02-21 21:00 ` Shawn O. Pearce 2007-02-21 21:11 ` Linus Torvalds 2007-02-21 21:21 ` Shawn O. Pearce 2007-10-09 20:58 ` Bill Lear 2007-10-09 21:02 ` J. Bruce Fields 2007-10-09 22:25 ` Bill Lear 2007-10-10 14:41 ` Johannes Schindelin 2007-02-21 17:14 ` Linus Torvalds 2007-02-21 18:02 ` Nicolas Pitre 2007-02-21 18:13 ` Linus Torvalds 2007-02-21 18:39 ` Nicolas Pitre 2007-02-21 18:30 ` Michael Hendricks 2007-02-21 18:37 ` Shawn O. Pearce 2007-02-21 18:47 ` Linus Torvalds 2007-02-21 18:56 ` Linus Torvalds 2007-02-21 18:52 ` Nicolas Pitre 2007-02-21 19:01 ` Junio C Hamano 2007-02-21 19:33 ` Nicolas Pitre 2007-02-21 20:22 ` Junio C Hamano 2007-02-21 20:49 ` Nicolas Pitre
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.