All of lore.kernel.org
 help / color / mirror / Atom feed
* removing content from git history
@ 2007-02-21 16:45 Michael Hendricks
  2007-02-21 16:56 ` Shawn O. Pearce
  2007-02-21 17:14 ` Linus Torvalds
  0 siblings, 2 replies; 25+ messages in thread
From: Michael Hendricks @ 2007-02-21 16:45 UTC (permalink / raw)
  To: git

I assume that this question has already been addressed on the mailing
list, but I wasn't able to find anything about it in the archives.

Is it possible to remove content entirely from git's history?  I have a
client who does not use git for version control.  A couple months ago
they committed some sensitive client information which should never have
been committed.  Recently, they realized the mistake and now want to
remove all traces of the mistake from history.

I would like to migrate them to git at some point.  However, if they had
been using git for version control already, I'm not sure how I would
solved this particular problem.  Any suggestions?

-- 
Michael

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 16:45 removing content from git history Michael Hendricks
@ 2007-02-21 16:56 ` Shawn O. Pearce
  2007-02-21 17:17   ` J. Bruce Fields
  2007-02-21 17:14 ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Shawn O. Pearce @ 2007-02-21 16:56 UTC (permalink / raw)
  To: Michael Hendricks; +Cc: git

Michael Hendricks <michael@ndrix.org> wrote:
> Is it possible to remove content entirely from git's history?

No, not once it has been published around to another repository.
Since every developer has a copy of the repository its very difficult
to remove something, as it must be removed from every developer's
repository, and each developer must perform an action to agree to
that removal.  So just one hold-out will keep the bad content around.

> I have a
> client who does not use git for version control.  A couple months ago
> they committed some sensitive client information which should never have
> been committed.  Recently, they realized the mistake and now want to
> remove all traces of the mistake from history.
> 
> I would like to migrate them to git at some point.  However, if they had
> been using git for version control already, I'm not sure how I would
> solved this particular problem.  Any suggestions?

The *only* way to do this in Git is to completely recreate every
commit after that point.  This changes all commit IDs and basically
forks the project into two completely different histories: the
one with the bad thing in it, and the one without the bad thing.
Users who have the bad thing will continue to have the bad thing
until they take explicit action to throw away all of that history
and switch to the other one.

Now this is actually not a huge deal if you do it on your local
repository and go "whoops, I should not have committed that".  If you
have not yet pushed the commit to another repository (and someone
has not yet fetched it from you either) you can use git-rebase to
discard it.  But once its been pushed/fetched the genie is out of
the bottle, and its not going back in.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 16:45 removing content from git history Michael Hendricks
  2007-02-21 16:56 ` Shawn O. Pearce
@ 2007-02-21 17:14 ` Linus Torvalds
  2007-02-21 18:02   ` Nicolas Pitre
                     ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-02-21 17:14 UTC (permalink / raw)
  To: Michael Hendricks; +Cc: git



On Wed, 21 Feb 2007, Michael Hendricks wrote:
>
> I assume that this question has already been addressed on the mailing
> list, but I wasn't able to find anything about it in the archives.
> 
> Is it possible to remove content entirely from git's history?

It's been discussed.

There are two options for doing it:

 - rewriting history. There are a few tools for this already, and for 
   specific needs it would be fairly easy to resurrect git-convert-objects 
   to do it for any kind of object.

   See "cg-admin-rewritehist" from cogito for an example of a tool that 
   would do what you need done. In fact, it has this exact thing as the 
   first example.

   (Btw, I think cg-admin-rewritehist is one of the few things that cogito 
   had that was really a good idea. Not that people probably _used_ it 
   much, but it's somethign that makes sense in the plumbing)

 - explicit support for "missing objects". We don't do it right now, but 
   we could add it. It was discussed for things like limited history etc 
   (the "shallow clone" kind of thing, before people actually added 
   shallow clones), and it would support the notion of "we export all our 
   history, but for internal reasons we cannot make certain objects 
   available" kinds of workflows.

So right now, rewriting history is an option that you can do. It will 
effectively create a totally new branch (which you can then make into a 
new repository) which has nothing in common with the old branch from the 
point where it was modified. So you can never really merge the two ever 
again, and you need to make sure that everybody who had the old repo 
contents will destroy it.

But at least in theory, it wouldn't be impossible to extend on the 
".git/grafts" kind of setup to say "this object has been consciously 
deleted", and that could in some circumstances be a better model. The 
biggest headache there would be the need to extend the native git protocol 
with a way to add such objects.

			Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 16:56 ` Shawn O. Pearce
@ 2007-02-21 17:17   ` J. Bruce Fields
  2007-02-21 18:02     ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: J. Bruce Fields @ 2007-02-21 17:17 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Michael Hendricks, git

On Wed, Feb 21, 2007 at 11:56:36AM -0500, Shawn O. Pearce wrote:
> Now this is actually not a huge deal if you do it on your local
> repository and go "whoops, I should not have committed that".  If you
> have not yet pushed the commit to another repository (and someone
> has not yet fetched it from you either) you can use git-rebase to
> discard it.

Also it can't have done any (non-fast-forward) merges since then.

Reconstructing history with a bunch of merges seems like something that
could be a huge pain.  (Though with some tools it might be doable.)

--b.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 17:17   ` J. Bruce Fields
@ 2007-02-21 18:02     ` Linus Torvalds
  2007-02-21 18:24       ` Linus Torvalds
  2007-02-21 21:00       ` Shawn O. Pearce
  0 siblings, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-02-21 18:02 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Shawn O. Pearce, Michael Hendricks, git



On Wed, 21 Feb 2007, J. Bruce Fields wrote:
> 
> Reconstructing history with a bunch of merges seems like something that
> could be a huge pain.  (Though with some tools it might be doable.)

It's not actually that painful, but it *is* expensive.

I wrote git-convert-cache (now "git convert-objects") back when we did the 
SHA1/compression switchover changes and the date format translation, so 
we've actually had a tool that can do history rewriting pretty much since 
day 1 (well, "day 14", to be exact, but still.. April 2005).

BUT:

 - I'm not guaranteeing that it works any more. We haven't changed the 
   fundamental object format since, so that particular program has never 
   gotten any testing. It still compiles, but does it work? I dunno.

   I actually tested it on git itself. It converted the top of the git 
   tree successfully, and generated a *new* git history. Why? Because it 
   will actually rewrite the old git tree entries that have permission 
   0664 into 0644: the *data* will be identical (and no git tools except 
   for "git fsck --pedantic" will even notice the difference), but the 
   converted tree avoids one of the legacy decisions that we never fixed 
   in the git repository itself.

   So it works at least to *some* degree, but I would suggest you be very 
   very careful!

 - it can be slow. For something like git, which isn't *that* big, and 
   where we actually don't need to do a lot of rewriting (ie all the blobs 
   stay the same, and only a few trees have to be rewritten, and so it's 
   really just rewriting commits), it's not that bad. It actyally 
   converted the whole git history in less than ten seconds for me.

   But if you have a *huge* tree, and you actually convert objects too 
   (say, you started using git on Windows before the "autocrlf" thing, and 
   want to convert the old blobs from CRLF -> LF), it would

    (a) require some extensions to convert-object.c to do the blob 
        conversion
    (b) be *much* slower
    (c) generate tons of unpacked objects (because git-convert-objects 
        doesn't know to pack in between, and doesn't use anything 
        newfangled like "git-fast-import" to do anything clever)

   For the kernel, it took 2 minutes, but again, it was exactly the same 
   thing: just a few old tree objects that it rewrote, and as a result, 
   every single commit SHA1 changed. Still, it was almost _only_ commits 
   (it generated 49521 new objects, 49332 of which was the new commit 
   history)

   If you want to rewrite a *lot* (ie somethign that exists in more than 
   just a few trees), and you have lots of history, it can be very 
   expensive indeed.

 - It currently doesn't convert the SHA1 numbers that show up in commit 
   messages. It could, and it should. But it doesn't. So once you convert 
   a git project, it doesn't do the nice "gitk does links from the SHA1 
   text in a commit message to the commit it talks about" any more.

   Somebody should fix that.

Anyway, git-convert-objects does kind of give you a starting point. It 
should be fixed to use "git-fast-import" or repack once in a while (so 
that it doesn't leave tons and tons of unpacked objects), and it should be 
fixed to fix up any commit messages that mention SHA1's that it has 
already converted to something else, but it seems to still work. It would 
not be impossible at all to extend the tree-rewriting logic to remove some 
file or a particular SHA1 object you want to replace.

			Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 17:14 ` Linus Torvalds
@ 2007-02-21 18:02   ` Nicolas Pitre
  2007-02-21 18:13     ` Linus Torvalds
  2007-02-21 18:30   ` Michael Hendricks
  2007-02-21 19:01   ` Junio C Hamano
  2 siblings, 1 reply; 25+ messages in thread
From: Nicolas Pitre @ 2007-02-21 18:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Michael Hendricks, git

On Wed, 21 Feb 2007, Linus Torvalds wrote:

> But at least in theory, it wouldn't be impossible to extend on the 
> ".git/grafts" kind of setup to say "this object has been consciously 
> deleted", and that could in some circumstances be a better model. The 
> biggest headache there would be the need to extend the native git protocol 
> with a way to add such objects.

I think that would be a big security issue.  Right now the GIT history 
can be validated and more importantly trusted from a single commit 
signature.  If poking holes in that model is allowed by the graft 
mechanism, it must remain a local thing and a very conscious one 
otherwise the GIT trust model would be greatly weakened.

If your goal is to remove content froma repository then the only 
sensible way is to rewrite history before publishing.  It is pointless 
to add mechanisms to remove content after it has been distributed.


Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 18:02   ` Nicolas Pitre
@ 2007-02-21 18:13     ` Linus Torvalds
  2007-02-21 18:39       ` Nicolas Pitre
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2007-02-21 18:13 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Michael Hendricks, git



On Wed, 21 Feb 2007, Nicolas Pitre wrote:
> 
> If your goal is to remove content froma repository then the only 
> sensible way is to rewrite history before publishing.  It is pointless 
> to add mechanisms to remove content after it has been distributed.

I'm not entirely in disagreement, but I can see the model where some 
company wants to make their work available (with the same history as their 
own internal stuff), but doesn't want to make a single file available for 
some reason.

So they'd have an external thing that just has the file excised.

Now, arguably, it's a lot better to use a "supermodule" approach for 
something like this: have two separate git trees, publish the public one, 
and use an internal supermodule that ties the public and internal trees 
together.

So supermodules might be a way to solve it in a better (and safer - the 
"remove objects from the public tree" thing is very error prone, since if 
you *ever* expose the object by mistake, its now public) way. But I don't 
think the "filter out objects" thing is necessarily fundamentally flawed 
as an approach.

			Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 18:02     ` Linus Torvalds
@ 2007-02-21 18:24       ` Linus Torvalds
  2007-02-21 21:00       ` Shawn O. Pearce
  1 sibling, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-02-21 18:24 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Shawn O. Pearce, Michael Hendricks, git



On Wed, 21 Feb 2007, Linus Torvalds wrote:
> 
>    For the kernel, it took 2 minutes, but again, it was exactly the same 
>    thing: just a few old tree objects that it rewrote, and as a result, 
>    every single commit SHA1 changed. Still, it was almost _only_ commits 
>    (it generated 49521 new objects, 49332 of which was the new commit 
>    history)

Side note: I wasn't entirelyaccurate. The kernel had trees with file mode 
0644 for all the early commits, because my umask is 0022. So everything up 
to commit 4bfa437cf1 is shared after the conversion.

But the next one (commit 5dfa9c1b4f) introduced the file 
include/asm-mips/vr41xx/pci.h with file mode 0664, and I'm not 100% sure 
why that one happened with that file mode, but as a result, every single 
commit ever after will have a different SHA1, because the tree got 
rewritten (and subsequent commits - even if their trees did *not* get 
rewritten - will obviously have different parent SHA1's).

So 56 commits are shared, and "only" 49276 commits were rewritten (and 
apparently 245 trees).

			Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 17:14 ` Linus Torvalds
  2007-02-21 18:02   ` Nicolas Pitre
@ 2007-02-21 18:30   ` Michael Hendricks
  2007-02-21 18:37     ` Shawn O. Pearce
                       ` (2 more replies)
  2007-02-21 19:01   ` Junio C Hamano
  2 siblings, 3 replies; 25+ messages in thread
From: Michael Hendricks @ 2007-02-21 18:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On Wed, Feb 21, 2007 at 09:14:44AM -0800, Linus Torvalds wrote:
> 
>    See "cg-admin-rewritehist" from cogito for an example of a tool that 
>    would do what you need done. In fact, it has this exact thing as the 
>    first example.

That's just what I was looking for.  Thanks.

> So right now, rewriting history is an option that you can do. It will 
> effectively create a totally new branch (which you can then make into a 
> new repository) which has nothing in common with the old branch from the 
> point where it was modified. So you can never really merge the two ever 
> again, and you need to make sure that everybody who had the old repo 
> contents will destroy it.

What's a decent way to make a branch into a new repository?  My first
inclination is to "cp -a" the existing repository, checkout the branch,
delete all other branches and repack.  That seems to have worked in my
quick test, but is there a better way?

-- 
Michael

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 18:30   ` Michael Hendricks
@ 2007-02-21 18:37     ` Shawn O. Pearce
  2007-02-21 18:47     ` Linus Torvalds
  2007-02-21 18:52     ` Nicolas Pitre
  2 siblings, 0 replies; 25+ messages in thread
From: Shawn O. Pearce @ 2007-02-21 18:37 UTC (permalink / raw)
  To: Michael Hendricks; +Cc: Linus Torvalds, git

Michael Hendricks <michael@ndrix.org> wrote:
> What's a decent way to make a branch into a new repository?  My first
> inclination is to "cp -a" the existing repository, checkout the branch,
> delete all other branches and repack.  That seems to have worked in my
> quick test, but is there a better way?

Don't "cp -a" the repository, use git-clone.

And actually, if you just want to pull one branch out into its
own repository you can do something like this:

	mkdir ../theonebranch
	cd ../theonebranch
	git init
	git fetch ../oldstuff theonebranch:master

and you have just the content of `theonebranch` from ../oldstuff
stored here, as master.

Optionally if you now want to actually see the files, you would do:

	git checkout

-- 
Shawn.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 18:13     ` Linus Torvalds
@ 2007-02-21 18:39       ` Nicolas Pitre
  0 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2007-02-21 18:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Michael Hendricks, git

On Wed, 21 Feb 2007, Linus Torvalds wrote:

> 
> 
> On Wed, 21 Feb 2007, Nicolas Pitre wrote:
> So supermodules might be a way to solve it in a better (and safer - the 
> "remove objects from the public tree" thing is very error prone, since if 
> you *ever* expose the object by mistake, its now public) way. But I don't 
> think the "filter out objects" thing is necessarily fundamentally flawed 
> as an approach.

Well if you really wanted to do such a thing then you could use a new 
object type that only serves as a stub pretending to be another object 
which SHA1 would have been xyz.  When referenced this object would 
generate a warning indicating to the user that given object has been 
excised out, but otherwise the whole reachability validation would still 
work as usual.

And since this object would be distributed through standard mechanisms 
then there would be no need for protocol extensions.

I don't know if this could help creating SHA1 collisions though.  We've 
dismissed them as highly improbable because the likelihood of a 
collision to hide compromised material would most probably require a 
binary blob somewhere to balance the hash and would hardly be 
compilable/undetected.  But with object stubs with the ability to 
pretend having any possible SHA1 is in fact a nice way to hide 20-byte 
binary blobs in the hash chain possibly making it "easier" to create 
"useful" collisions.  This is where I see a weakening of the trust 
model.


Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 18:30   ` Michael Hendricks
  2007-02-21 18:37     ` Shawn O. Pearce
@ 2007-02-21 18:47     ` Linus Torvalds
  2007-02-21 18:56       ` Linus Torvalds
  2007-02-21 18:52     ` Nicolas Pitre
  2 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2007-02-21 18:47 UTC (permalink / raw)
  To: Michael Hendricks; +Cc: git



On Wed, 21 Feb 2007, Michael Hendricks wrote:
> 
> What's a decent way to make a branch into a new repository?  My first
> inclination is to "cp -a" the existing repository, checkout the branch,
> delete all other branches and repack.  That seems to have worked in my
> quick test, but is there a better way?

That works.

As does just "clone repo, delete all unwanted branches, and prune" (of 
course, if you don't want the old repo, you can skip the "clone" part, and 
just do the "delete all unwanted branches and prune" thing).

In some ways, a more straightforward approach may be to just create a new 
repo, and populate it with just one branch (I say "more straightforward", 
not "easier", because I just think it's conceptually simpler):

	mkdir new-repo
	cd new-repo
	git init
	git pull old-repo <branch>

(add "--bare" and "--shared" to taste - with bare repos yu can also do it 
the other way by doing a push into it from outside after you've created 
it, which can be the "logical" way to do it if you want to just publish 
the end result on some shared site)

		Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 18:30   ` Michael Hendricks
  2007-02-21 18:37     ` Shawn O. Pearce
  2007-02-21 18:47     ` Linus Torvalds
@ 2007-02-21 18:52     ` Nicolas Pitre
  2 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2007-02-21 18:52 UTC (permalink / raw)
  To: Michael Hendricks; +Cc: Linus Torvalds, git

On Wed, 21 Feb 2007, Michael Hendricks wrote:

> What's a decent way to make a branch into a new repository?  My first
> inclination is to "cp -a" the existing repository, checkout the branch,
> delete all other branches and repack.  That seems to have worked in my
> quick test, but is there a better way?

Like Shawn said the better way is simply to fetch that branch into a new 
repo.

If you do a cp -a and delete unwanted branches it'll work as well of 
course, but repacking won't get rid of all the data from the believed to 
be deleted branches since some reflog, the HEAD reflog in particular, 
will most probably have references to commits from the removed branches. 
Therefore the pack will still contain that data, at least untill the 
reflog entries expire and get pruned.

Of course if you want to publish just the wanted branch and perform a 
push to a public place then only those objects for that branch will be 
sent like for the fetch case.


Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 18:47     ` Linus Torvalds
@ 2007-02-21 18:56       ` Linus Torvalds
  0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-02-21 18:56 UTC (permalink / raw)
  To: Michael Hendricks; +Cc: git



On Wed, 21 Feb 2007, Linus Torvalds wrote:

> 
> 
> On Wed, 21 Feb 2007, Michael Hendricks wrote:
> > 
> > What's a decent way to make a branch into a new repository?  My first
> > inclination is to "cp -a" the existing repository, checkout the branch,
> > delete all other branches and repack.  That seems to have worked in my
> > quick test, but is there a better way?
> 
> That works.

Btw, when I say "works", I do mean that "yeah, 'cp -a' works, but 
generally you're better off cloning".

When you use 'cp -a' you have to re-build the index at the very least. It 
so happens that since you checked out the branch explicitly, that will do 
it for you anyway, but it's still often a good idea to just *not* use the 
regular "copy everything by hand" approach.

If you want to be really efficient, there are actually better ways. For 
example, since you want to avoid having any of the old objects even 
reachable by mistake), you're probably better off with an explicit pull of 
the explicit branch, if only because that also involves a re-pack of only 
the reachable objects, and you know that there won't be any reflogs etc 
that might still make the object you try to remove be accessible to people 
who can access the resulting repository directly.

(Yeah, the "cp -a" is faster than the "git pull", but since you want to do 
the packing that git pull does for you *anyway* to get rid of the old 
objects, "git pull" actually ends up being better).

			Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 17:14 ` Linus Torvalds
  2007-02-21 18:02   ` Nicolas Pitre
  2007-02-21 18:30   ` Michael Hendricks
@ 2007-02-21 19:01   ` Junio C Hamano
  2007-02-21 19:33     ` Nicolas Pitre
  2 siblings, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2007-02-21 19:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Michael Hendricks, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

>  - explicit support for "missing objects". We don't do it right now, but 
>    we could add it. It was discussed for things like limited history etc 
>    (the "shallow clone" kind of thing, before people actually added 
>    shallow clones), and it would support the notion of "we export all our 
>    history, but for internal reasons we cannot make certain objects 
>    available" kinds of workflows.
> ...
> But at least in theory, it wouldn't be impossible to extend on the 
> ".git/grafts" kind of setup to say "this object has been consciously 
> deleted", and that could in some circumstances be a better model. The 
> biggest headache there would be the need to extend the native git protocol 
> with a way to add such objects.

While I agree in principle to the argument that there is no
taking it back what's already published, I've heard people
wanting to just stop distributing further, without worrying
about copies already out there.  'missing objects' support would
help us in such a situation.

Supporting 'missing objects' in general would be painful, when
they contain pointers to other objects (i.e. tags, commits, and
trees).

Thinking aloud...

 * missing blob: we can have 'stub blob' objects.  Probably the
   object header for such an object would look like:

	stub <length> NUL
	-----------------
        object <object name of the real blob object>
        type blob

   Hashing a 'stub' object (along with its header as usual, in
   write_sha1_file_prepare()) would instead just report the
   object name recorded there.

   When packing (this applies both to local repacking and
   push/fetch object transfer to other repositories), the stub
   object is included.  delta algorithm would probably not to
   delta other objects with it.

 * missing commit and tag: 'stub object' needs to be extended to
   include these object types, and we would also need 'stub
   commit' and 'stub tag' objects, that copy the structural
   fields from the corresponding true object.  So a stub commit
   would probably look like:

	stub <length> NUL
	-----------------
        object <object name of the real commit object>
        type commit
        tree <object name of the tree contained in the real commit object>
        parent <object name of the first parent in the real commit object>
        parent <object name of the first second in the real commit object>

 * missing tree would only be useful to conceal pathnames
   recorded in the real tree object.  I am not sure if that is
   needed.

 * fsck and verify-pack needs to be taught about 'stub' objects,
   so that they know that their filenames (or the data pointed
   at by pack .idx) do not match the result of hashing them.

If we were to do this, I suspect we can probably do nothing but
'missing blob' first to cover a lot of ground, but we would
eventually need 'missing commit' to replace real commit objects
that has sensitive information in its log message.

As Nico pointed out, this has serious security implications.  We
would need a separate list of objects that are Ok to be stubbed
out, with probably explanation of why they are stubbed out, and
fsck should compare the stub objects found in the repository
against that list.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 19:01   ` Junio C Hamano
@ 2007-02-21 19:33     ` Nicolas Pitre
  2007-02-21 20:22       ` Junio C Hamano
  0 siblings, 1 reply; 25+ messages in thread
From: Nicolas Pitre @ 2007-02-21 19:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Michael Hendricks, git

On Wed, 21 Feb 2007, Junio C Hamano wrote:

> While I agree in principle to the argument that there is no
> taking it back what's already published, I've heard people
> wanting to just stop distributing further, without worrying
> about copies already out there.  'missing objects' support would
> help us in such a situation.

I still think this is a "put your head in the sand and pretend that some 
sensitive data never existed in the wild" attitude.  And I really don't 
see the point of supporting that illusion in GIT with technical means.

Either you care about published data or you don't.

If you do then you are screwed anyway irrespective of any missing object 
support we might implement.  There will always be someone somewhere with 
the real thing, and we all know how faster forbidden material does travel 
on the Internet.

If you don't then it is just better to rewrite history and have a clean 
and unambiguous repository.  And because you don't care about existing 
copies you shouldn't bother with the fact that the rewritten repo is not 
compatible with the previously published one.

Sure rewriting history is a potentially expensive operation depending on 
the size and nature of the change, but it is done only once.  And 
actually it can't be _that_ much expensive than a git-repack -a -f.

I think it is much better to provide a tool to properly rewrite history 
than adding support for missing objects and be stuck with them forever.


Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 19:33     ` Nicolas Pitre
@ 2007-02-21 20:22       ` Junio C Hamano
  2007-02-21 20:49         ` Nicolas Pitre
  0 siblings, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2007-02-21 20:22 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Michael Hendricks, git

Nicolas Pitre <nico@cam.org> writes:

> On Wed, 21 Feb 2007, Junio C Hamano wrote:
>
>> While I agree in principle to the argument that there is no
>> taking it back what's already published, I've heard people
>> wanting to just stop distributing further, without worrying
>> about copies already out there.  'missing objects' support would
>> help us in such a situation.
>
> I still think this is a "put your head in the sand and pretend that some 
> sensitive data never existed in the wild" attitude.  And I really don't 
> see the point of supporting that illusion in GIT with technical means.

Well, I think we are in agreement (and that is why I said "I've
heard people wanting").

But it is entirely possible that somebody has a project that is
internal to a company managed for a long time with git, that he
wants to go open source, with (almost) full history.  And the
project may have some proprietary add-on bit which cannot be
published, while building the public bits does not require that
part.  Stubbing things out may help that kind of situation.  The
development team can keep going forward, internally using the
real objects, while pushing stub objects out to the public
repository, without having to rewrite the history and re-partition
the project.

But after having thought about that, I think it would not buy us
much.  You would want to re-partition the project sooner or
later in such a situation *anyway*, so our time is better spent
on giving better support to split existing projects.  It may
already be sufficient in the form of admin-rewritehist, in which
case we can worry about other things ;-).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 20:22       ` Junio C Hamano
@ 2007-02-21 20:49         ` Nicolas Pitre
  0 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2007-02-21 20:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Michael Hendricks, git

On Wed, 21 Feb 2007, Junio C Hamano wrote:

> Well, I think we are in agreement (and that is why I said "I've
> heard people wanting").
> 
> But it is entirely possible that somebody has a project that is
> internal to a company managed for a long time with git, that he
> wants to go open source, with (almost) full history.  And the
> project may have some proprietary add-on bit which cannot be
> published, while building the public bits does not require that
> part.  Stubbing things out may help that kind of situation.

It might help, or it might create a management nightmare.  It would be 
really easy to accidentally push the real objects out since a repo with 
them would be indistinguishable from a repo with stubs (that's the 
point of stub objects isn't it?), and because of the distributed nature 
of GIT the leak could come from anyone with access to the private 
objects.

In such a scenario I think it is still more sensible to rewrite the repo 
history before going open source.  You need only to worry about 
isolating the proprietary stuff once.


Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 18:02     ` Linus Torvalds
  2007-02-21 18:24       ` Linus Torvalds
@ 2007-02-21 21:00       ` Shawn O. Pearce
  2007-02-21 21:11         ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Shawn O. Pearce @ 2007-02-21 21:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: J. Bruce Fields, Michael Hendricks, git

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Anyway, git-convert-objects does kind of give you a starting point. It 
> should be fixed to use "git-fast-import" or repack once in a while (so 
> that it doesn't leave tons and tons of unpacked objects), and it should be 
> fixed to fix up any commit messages that mention SHA1's that it has 
> already converted to something else, but it seems to still work. It would 
> not be impossible at all to extend the tree-rewriting logic to remove some 
> file or a particular SHA1 object you want to replace.

One idea Junio and I kicked around on #git a short while ago
was to arrange for a pipe between the current Git process
and git-fast-import, where the pipe was used from within
write_sha1_file() rather than creating the loose object.

This way an existing process like git-apply or git-convert-objects
could easily spew hundreds of thousands of objects without needing
to worry about repacking in the middle; nor would we need to worry
about the complexity of trying to disentagle the multiobject packing
parts of fast-import into some sort of library.

Obviously this is only a good idea if we are going to be making
enough objects to warrant using a packfile; small 10-20 bursts
of objects from a git-apply doesn't really justify a packfile.
But applying 100s of patches in a row might, if we could keep them
all fed through the same git-fast-import backend (and thus into
the same packfile).

-- 
Shawn.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 21:00       ` Shawn O. Pearce
@ 2007-02-21 21:11         ` Linus Torvalds
  2007-02-21 21:21           ` Shawn O. Pearce
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2007-02-21 21:11 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: J. Bruce Fields, Michael Hendricks, git



On Wed, 21 Feb 2007, Shawn O. Pearce wrote:
> 
> One idea Junio and I kicked around on #git a short while ago
> was to arrange for a pipe between the current Git process
> and git-fast-import, where the pipe was used from within
> write_sha1_file() rather than creating the loose object.

The probnlem there is that most conversion scripts that use 
"write_sha1_file()" will want to *read* that file later. If 
git-fast-import hasn't generated the pack yet (because it's still waiting 
for more data), that will not work at all.

So then you basically force the conversion script to keep remembering all 
the old object data (using something like pretend_sha1_file), or you limit 
it to things that just always re-write the whole object and never need any 
old object references that they might have written.

A lot of conversions tend to be incremental, ie they will depend on the 
data they converted previously.

			Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 21:11         ` Linus Torvalds
@ 2007-02-21 21:21           ` Shawn O. Pearce
  2007-10-09 20:58             ` Bill Lear
  0 siblings, 1 reply; 25+ messages in thread
From: Shawn O. Pearce @ 2007-02-21 21:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: J. Bruce Fields, Michael Hendricks, git

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> The probnlem there is that most conversion scripts that use 
> "write_sha1_file()" will want to *read* that file later. If 
> git-fast-import hasn't generated the pack yet (because it's still waiting 
> for more data), that will not work at all.

Yes, indeed...
 
> So then you basically force the conversion script to keep remembering all 
> the old object data (using something like pretend_sha1_file), or you limit 
> it to things that just always re-write the whole object and never need any 
> old object references that they might have written.
> 
> A lot of conversions tend to be incremental, ie they will depend on the 
> data they converted previously.

Which is why I was actually thinking of flipping this on its head.
Libify git-apply and embed that into fast-import, then one of the
native input formats might just be an mbox, or something close enough
that a simple C/perl/sed prefilter could make an mbox into the input.

fast-import can (and does if necessary) go back to access the
packfile it is writing.  It has the index data held in memory and
uses only OBJ_OFS_REF so that sha1_file.c can unpack deltas just
fine, even though we lack an index file and have not completely
checksummed the pack itself.

So although no other Git process can use the packfile, it is usuable
from within fast-import...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-02-21 21:21           ` Shawn O. Pearce
@ 2007-10-09 20:58             ` Bill Lear
  2007-10-09 21:02               ` J. Bruce Fields
  2007-10-10 14:41               ` Johannes Schindelin
  0 siblings, 2 replies; 25+ messages in thread
From: Bill Lear @ 2007-10-09 20:58 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Linus Torvalds, J. Bruce Fields, Michael Hendricks, git

I'm resurrecting this old thread, as we have come across a similar need and
I could not tell if this has been settled.  More below...

On Wednesday, February 21, 2007 at 16:21:30 (-0500) Shawn O. Pearce writes:
>Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> The probnlem there is that most conversion scripts that use 
>> "write_sha1_file()" will want to *read* that file later. If 
>> git-fast-import hasn't generated the pack yet (because it's still waiting 
>> for more data), that will not work at all.
>
>Yes, indeed...
> 
>> So then you basically force the conversion script to keep remembering all 
>> the old object data (using something like pretend_sha1_file), or you limit 
>> it to things that just always re-write the whole object and never need any 
>> old object references that they might have written.
>> 
>> A lot of conversions tend to be incremental, ie they will depend on the 
>> data they converted previously.
>
>Which is why I was actually thinking of flipping this on its head.
>Libify git-apply and embed that into fast-import, then one of the
>native input formats might just be an mbox, or something close enough
>that a simple C/perl/sed prefilter could make an mbox into the input.
>
>fast-import can (and does if necessary) go back to access the
>packfile it is writing.  It has the index data held in memory and
>uses only OBJ_OFS_REF so that sha1_file.c can unpack deltas just
>fine, even though we lack an index file and have not completely
>checksummed the pack itself.
>
>So although no other Git process can use the packfile, it is usuable
>from within fast-import...

As I understand this thread, it does not appear that a resolution
was reached.  Our company has content in our central git repository
that we need to remove per a contractual obligation.  I believe the
content in question is limited to one sub-directory, that has existed
since (or near to) the beginning of the repo, if that matters.  We
obviously would just like to issue a "git nuke" operation and be done
with it, if that is available.  Barring that, we could probably follow
reasonably simple steps to purge the content and rebuild the repo.

So, what options do we have at present?


Bill

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-10-09 20:58             ` Bill Lear
@ 2007-10-09 21:02               ` J. Bruce Fields
  2007-10-09 22:25                 ` Bill Lear
  2007-10-10 14:41               ` Johannes Schindelin
  1 sibling, 1 reply; 25+ messages in thread
From: J. Bruce Fields @ 2007-10-09 21:02 UTC (permalink / raw)
  To: Bill Lear; +Cc: Shawn O. Pearce, Linus Torvalds, Michael Hendricks, git

On Tue, Oct 09, 2007 at 03:58:57PM -0500, Bill Lear wrote:
> As I understand this thread, it does not appear that a resolution
> was reached.  Our company has content in our central git repository
> that we need to remove per a contractual obligation.  I believe the
> content in question is limited to one sub-directory, that has existed
> since (or near to) the beginning of the repo, if that matters.  We
> obviously would just like to issue a "git nuke" operation and be done
> with it, if that is available.  Barring that, we could probably follow
> reasonably simple steps to purge the content and rebuild the repo.
> 
> So, what options do we have at present?

Have you looked at git-filter-branch in a recent version of git?  The
man page has some good examples.

--b.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-10-09 21:02               ` J. Bruce Fields
@ 2007-10-09 22:25                 ` Bill Lear
  0 siblings, 0 replies; 25+ messages in thread
From: Bill Lear @ 2007-10-09 22:25 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Shawn O. Pearce, Linus Torvalds, Michael Hendricks, git

On Tuesday, October 9, 2007 at 17:02:35 (-0400) J. Bruce Fields writes:
>On Tue, Oct 09, 2007 at 03:58:57PM -0500, Bill Lear wrote:
>> As I understand this thread, it does not appear that a resolution
>> was reached.  Our company has content in our central git repository
>> that we need to remove per a contractual obligation.  I believe the
>> content in question is limited to one sub-directory, that has existed
>> since (or near to) the beginning of the repo, if that matters.  We
>> obviously would just like to issue a "git nuke" operation and be done
>> with it, if that is available.  Barring that, we could probably follow
>> reasonably simple steps to purge the content and rebuild the repo.
>> 
>> So, what options do we have at present?
>
>Have you looked at git-filter-branch in a recent version of git?  The
>man page has some good examples.

Ah, no, though I will do so.  It is apparently not in the version
I have (1.5.2.4), but it is in 1.5.3.1.  We'll give this a shot
and complain if we can't handle it.

Thank you.


Bill

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: removing content from git history
  2007-10-09 20:58             ` Bill Lear
  2007-10-09 21:02               ` J. Bruce Fields
@ 2007-10-10 14:41               ` Johannes Schindelin
  1 sibling, 0 replies; 25+ messages in thread
From: Johannes Schindelin @ 2007-10-10 14:41 UTC (permalink / raw)
  To: Bill Lear
  Cc: Shawn O. Pearce, Linus Torvalds, J. Bruce Fields, Michael Hendricks, git

Hi,

On Tue, 9 Oct 2007, Bill Lear wrote:

> Our company has content in our central git repository that we need to 
> remove per a contractual obligation.  I believe the content in question 
> is limited to one sub-directory, that has existed since (or near to) the 
> beginning of the repo, if that matters.  We obviously would just like to 
> issue a "git nuke" operation and be done with it, if that is available.  
> Barring that, we could probably follow reasonably simple steps to purge 
> the content and rebuild the repo.
> 
> So, what options do we have at present?

git filter-branch.  I suggest using the index filter.  There is even a 
nice example in the man page of git filter-branch.

Which reminds me that I have some TODOs left in filter-branch...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2007-10-10 14:41 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-21 16:45 removing content from git history Michael Hendricks
2007-02-21 16:56 ` Shawn O. Pearce
2007-02-21 17:17   ` J. Bruce Fields
2007-02-21 18:02     ` Linus Torvalds
2007-02-21 18:24       ` Linus Torvalds
2007-02-21 21:00       ` Shawn O. Pearce
2007-02-21 21:11         ` Linus Torvalds
2007-02-21 21:21           ` Shawn O. Pearce
2007-10-09 20:58             ` Bill Lear
2007-10-09 21:02               ` J. Bruce Fields
2007-10-09 22:25                 ` Bill Lear
2007-10-10 14:41               ` Johannes Schindelin
2007-02-21 17:14 ` Linus Torvalds
2007-02-21 18:02   ` Nicolas Pitre
2007-02-21 18:13     ` Linus Torvalds
2007-02-21 18:39       ` Nicolas Pitre
2007-02-21 18:30   ` Michael Hendricks
2007-02-21 18:37     ` Shawn O. Pearce
2007-02-21 18:47     ` Linus Torvalds
2007-02-21 18:56       ` Linus Torvalds
2007-02-21 18:52     ` Nicolas Pitre
2007-02-21 19:01   ` Junio C Hamano
2007-02-21 19:33     ` Nicolas Pitre
2007-02-21 20:22       ` Junio C Hamano
2007-02-21 20:49         ` Nicolas Pitre

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.