WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)

From: Linus Torvalds <torvalds@osdl.org>
To: "H. Peter Anvin" <hpa@zytor.com>, Git Mailing List <git@vger.kernel.org>
Cc: Chris Mason <mason@suse.com>
Subject: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)
Date: Wed, 20 Apr 2005 02:08:26 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.58.0504200144260.6467@ppc970.osdl.org> (raw)
In-Reply-To: <42660708.60109@zytor.com>

I converted my git archives (kernel and git itself) to do the SHA1 hash 
_before_ the compression phase.

So I'll just have to publically admit that everybody who complained about 
that particular design decision was right. Oh, well.

On Wed, 20 Apr 2005, H. Peter Anvin wrote:
> Linus Torvalds wrote:
> > 
> > So I'll see if I can turn the current fsck into a "convert into
> > uncompressed format", and do a nice clean format conversion. 
> > 
> 
> Just let me know what you want to do, and I can trivially change the 
> conversion scripts I've already written to do what you want.

I actually wrote a trivial converter myself, and while I have to say that 
this object database conversion is a bit painful, the nice thing is that I 
tried very hard to make it so that the "git" programs will work with both 
a pre-conversion and a post-conversion database.

The only program where that isn't true is "fsck-cache", since fsck-cache
for obvious reasons is very very unhappy if the sha1 of a file doesn't
match what it should be. But even there, a post-conversion fsck will eat
old objects, it will just warn about a sha1 mismatch (and eventually it
will refuse to touch them).

Anyway, what this means is that you should be actually able to get my
already-converted git database even using an older version of git: fsck
will complain mightily, so don't run it.

What I've done is to just switch the SHA1 calculation and the compression
around, but I've left all other data structures in their original format,
including the low-level object details like the fact that all objects are
tagged with their type and length.

As a result, the _only_ thing that breaks is that a new object will not
have a SHA1 that matches the expectations of an old git, but since
_checking_ the SHA1 is only done by fsck, not normal operations, all
normal ops should work fine.

So to convert your old git setup to a new git setup, do the following:

 - save your old setup. Just in case. I've converted my whole kernel tree 
   this way, so it's actually tested and I felt comfortable enough with it 
   to blow the old one away, but never take risks.

 - do _not_ update to my new version first. Instead, while you still have 
   an fsck that is happy with your old archive, make sure to fsck 
   everything you have with

	fsck-cache --unreachable $(cat .git/HEAD)

   and it shouldn't complain about anything. Use "git-prune-script" to 
   remove dangling objects if you want.

   (If you read this after you already updated, no worries - everything 
   should still work. It's just a good idea to verify your old repo first)

 - update to my new git tools. checkout, build, install

 - convert your git object database with

	convert-cache $(cat .git/HEAD)

   which will give you a new head object. Just for fun, you can 
   double-check that "re-converting" that head object should always result
   in the same head object. If it doesn't, something is wrong.

 - take the new head object, and make it your new head:

	echo xxxxxx > .git/HEAD

 - run the new "fsck-cache". It should complain about "sha1 mismatch" for 
   all your old objects, and they should all be unreachable (and you 
   should have two root objects: your old root and your new root)

 - run "git-prune-script" to remove all the unreachable objects (which are 
   all old).

 - run "fsck-cache --unreachable $(cat .git/HEAD)" with the new fsck
   again, just to check that it is now quiet.

 - blow your old index file away by re-reading your HEAD tree:

	cat-file commit $(cat .git/HEAD)
	read-tree .....

 - "update-cache --refresh"

Doing this on the git repository is nearly instantaneous. Doing it on the
kernel takes maybe a minute or so, depending on how fast your machine is.

Sorry about this, but it's a hell of a lot simpler to do it now than it
will be after we have lots of users, and I've really tried to make the
conversion be as simple and painless as possible.

And while it doesn't matter right now (since git still does exactly the
same - I did the minimal changes necessary to get the new hashes, and
that's it), this _will_ allow us to notice existing objects before we
compress them, and we can now play with different compression levels
without it being horribly painful.

				Linus