linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: more git updates..
@ 2005-04-10 22:07 Luck, Tony
  2005-04-10 22:11 ` Petr Baudis
  0 siblings, 1 reply; 23+ messages in thread
From: Luck, Tony @ 2005-04-10 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Petr Baudis, Randy.Dunlap, Ross Vandegrift, Kernel Mailing List

>Also, I did actually debate that issue with myself, and decided that even
>if we do have tons of files per directory, git doesn't much care. The
>reason? Git never _searches_ for them. Assuming you have enough memory to
>cache the tree, you just end up doing a "lookup", and inside the kernel
>that's done using an efficient hash, which doesn't actually care _at_all_
>about how many files there are per directory.

So long as the hash *is* efficient when the directory is packed full of
38 character filenames made only of [0-9a-f] ... which might not match
the test cases under which the hash was picked :-)  When there are some
full-sized kernel git images, someone should do a sanity check.

>Hey, I may end up being wrong, and yes, maybe I should have done a 
>two-level one. The good news is that we can trivially fix it later (even 
>dynamically - we can make the "sha1 object tree layout" be a per-tree 
>config option, and there would be no real issue, so you could make small 
>projects use a flat version and big projects use a very deep structure 
>etc). You'd just have to script some renames to move the files around.

It depends on how many eco-system shell scripts get built that need to
know about the layout ... if some shell/perl "libraries" encode this
filename layout (and people use them) ... then switching later would
indeed be painless.

-Tony

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RE: more git updates..
  2005-04-10 22:07 more git updates Luck, Tony
@ 2005-04-10 22:11 ` Petr Baudis
  0 siblings, 0 replies; 23+ messages in thread
From: Petr Baudis @ 2005-04-10 22:11 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Linus Torvalds, Randy.Dunlap, Ross Vandegrift, Kernel Mailing List

Dear diary, on Mon, Apr 11, 2005 at 12:07:37AM CEST, I got a letter
where "Luck, Tony" <tony.luck@intel.com> told me that...
..snip..
> >Hey, I may end up being wrong, and yes, maybe I should have done a 
> >two-level one. The good news is that we can trivially fix it later (even 
> >dynamically - we can make the "sha1 object tree layout" be a per-tree 
> >config option, and there would be no real issue, so you could make small 
> >projects use a flat version and big projects use a very deep structure 
> >etc). You'd just have to script some renames to move the files around.
> 
> It depends on how many eco-system shell scripts get built that need to
> know about the layout ... if some shell/perl "libraries" encode this
> filename layout (and people use them) ... then switching later would
> indeed be painless.

FWIW, my short-term plans include support for monotone-like hash ID
shortening - it's enough to use the shortest leading unique part of the
ID to identify the revision. I will poke to the object repository for
that. I also already have Randy Dunlap's git lsobj, which will list all
objects of a specified type (very useful especially when looking for
orphaned commits and such rather lowlevel work).

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-13  1:10                       ` Linus Torvalds
  2005-04-13 10:59                         ` Andrea Arcangeli
@ 2005-04-13 20:44                         ` Matt Mackall
  1 sibling, 0 replies; 23+ messages in thread
From: Matt Mackall @ 2005-04-13 20:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, David Eger, Petr Baudis, Randy.Dunlap,
	Ross Vandegrift, Kernel Mailing List

On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> > 
> > I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
> > the CVS/SCCS format as storage may be more appealing than the current
> > git format.
> 
> Go wild. I did mine in six days, and you've been whining about other 
> peoples SCM's for three years.

I wrote a hack to do efficient delta storage with O(1) seeks for
lookup and append last week, I believe it's been integrated into the
latest Bazaar-NG. I expect it'll give better compression and
performance than BK. Of course it ends up being O(revisions) for
modifications or insertions (but that is probably a non-issue for the
SCM models we're looking at).

The git model is obviously very different, but I worry about the slop
space implied. With 200k file revision and an average of 2k slop per
file, that's 400MB of slop, or almost the size of an equivalent delta
compressed kernel repo.

Now if you can assume that blobs never change and are never deleted,
you can simply append them all onto a log, and then index them with a
separate file containing an htree of (sha1, offset, length) or the
like. Since the key is already a strong hash, this is an excellent
match and avoids rehashing in the kernel's directory lookup. And it'll
save an inode, a directory entry, and about half a data block per
entry. "Open" will also be cheaper as there's no per-revision inode to
grab.

I could hack on this if you think it fits with the git model,
otherwise I'll go back to my other experiments..

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-13  9:30                     ` Russell King
  2005-04-13 10:20                       ` Andrea Arcangeli
@ 2005-04-13 14:43                       ` Linus Torvalds
  1 sibling, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2005-04-13 14:43 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, David Eger, Petr Baudis, Randy.Dunlap,
	Ross Vandegrift, Kernel Mailing List



On Wed, 13 Apr 2005, Russell King wrote:
> 
> And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
> is more dense than CVS.
> 
> BK is also a lot better than CVS.  So _your_ point is?

Hey, anybody who wants to argue that BK is getter than GIT won't be 
getting any counter-arguments from me.

The fact is, I have constraints. Like needing something to work within a
few days. If somebody comes up with a ultra-fast, replicatable, space
efficient SCM in three days, I'm all over it. 

In the meantime, I'd suggest people who worry about network bandwidth try 
to work out a synchronization protocol that allows you to send "diff 
updates" between git repositories. The git model doesn't preclude looking 
at the objects and sending diffs instead (and re-creating the objects on 
the other side). But my time-constraints _do_.

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-13  1:10                       ` Linus Torvalds
@ 2005-04-13 10:59                         ` Andrea Arcangeli
  2005-04-13 20:44                         ` Matt Mackall
  1 sibling, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2005-04-13 10:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Eger, Petr Baudis, Randy.Dunlap, Ross Vandegrift,
	Kernel Mailing List

On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote:
> Go wild. I did mine in six days, and you've been whining about other 
> peoples SCM's for three years.

Even if I spend 6 days doing git, you'd never have thrown away BK in
exchange for git.

> In other words - go and _do_ something instead of whining. I'm not 
> interested.

CVS and SVN are already an order of magnitude more efficient than git at
storing and exporting the data and they shouldn't annoy you during the
checkins either, they have a backend much more efficient than git too,
and yet you seem not to care about them.

My suggestion was simply to at least change git to coalesce the diffs
like CVS/SCCS, I'm only making a suggestion to give git a chance to have
a backend at least as efficient as the one that CVS uses and to avoid
running rsync on a 2.8G uncompressible blob. I don't have enough spare
time to do something myself, my spare time would be too short anyway to
make a difference in SCM space, so I'd rather spend it all in more
innovative space where it might have a slight change to make a
difference.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-13  9:30                     ` Russell King
@ 2005-04-13 10:20                       ` Andrea Arcangeli
  2005-04-13 14:43                       ` Linus Torvalds
  1 sibling, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2005-04-13 10:20 UTC (permalink / raw)
  To: Linus Torvalds, David Eger, Petr Baudis, Randy.Dunlap,
	Ross Vandegrift, Kernel Mailing List

On Wed, Apr 13, 2005 at 10:30:52AM +0100, Russell King wrote:
> And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
> is more dense than CVS.

Yep, this is why I mentioned SCCS format too, I didn't know it was even
smaller, but I expected a similar density from SCCS.

> Note: I'm _not_ arguing with your sentiments towards CVS.  However, I
> think the space usage point still stands.

If it wasn't for network synchronization it almost wouldn't matter, but
fetching 2.8G uncompressible when I could simply fetch 220MB
compressible (that will compress with zlib at little cost during rsync
to less than 78M), sounds a bit overkill.

> What is the space usage behaviour when you have multiple git trees?

Multiple trees in the sense of pulls from multiple developers aren't
more costly than a normal checkin, due the "soft hardlink" property of
the hashes. It's just every checkin taking lots of space, and generating
a new uncompressible blobs every time a changeset touches one file.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-12 23:45                   ` Linus Torvalds
  2005-04-13  0:14                     ` Andrea Arcangeli
@ 2005-04-13  9:30                     ` Russell King
  2005-04-13 10:20                       ` Andrea Arcangeli
  2005-04-13 14:43                       ` Linus Torvalds
  1 sibling, 2 replies; 23+ messages in thread
From: Russell King @ 2005-04-13  9:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, David Eger, Petr Baudis, Randy.Dunlap,
	Ross Vandegrift, Kernel Mailing List

On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote:
> On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> > At the rate of 9M for every 198 changeset checkins, that means I'll have
> > to download 2.7G _uncompressible_ (i.e. already compressed with a bad
> > per-file ratio due the too-small files) for a whole pack including all
> > changesets without accounting the original 111MB of the original tree,
> > with rsync -z of git.  That compares with 514M _compressible_ with CVS
> > format on-disk, and with ~79M of the CVS-network download with rsync -z of
> > the CVS repository (assuming default gzip compression level).
> 
> Yes. CVS is much denser.
> 
> CVS is also total crap. So your point is?

And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
is more dense than CVS.

BK is also a lot better than CVS.  So _your_ point is?

8)

Note: I'm _not_ arguing with your sentiments towards CVS.  However, I
think the space usage point still stands.

What is the space usage behaviour when you have multiple git trees?
Do we need a git relink command in git-pasky? 8)

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-13  0:14                     ` Andrea Arcangeli
@ 2005-04-13  1:10                       ` Linus Torvalds
  2005-04-13 10:59                         ` Andrea Arcangeli
  2005-04-13 20:44                         ` Matt Mackall
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2005-04-13  1:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Eger, Petr Baudis, Randy.Dunlap, Ross Vandegrift,
	Kernel Mailing List



On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> 
> I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
> the CVS/SCCS format as storage may be more appealing than the current
> git format.

Go wild. I did mine in six days, and you've been whining about other 
peoples SCM's for three years.

In other words - go and _do_ something instead of whining. I'm not 
interested.

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-12 23:45                   ` Linus Torvalds
@ 2005-04-13  0:14                     ` Andrea Arcangeli
  2005-04-13  1:10                       ` Linus Torvalds
  2005-04-13  9:30                     ` Russell King
  1 sibling, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2005-04-13  0:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Eger, Petr Baudis, Randy.Dunlap, Ross Vandegrift,
	Kernel Mailing List

On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote:
> Yes. CVS is much denser.
>
> CVS is also total crap. So your point is?

I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
the CVS/SCCS format as storage may be more appealing than the current
git format. I guess I should have said RCS instead of CVS, sorry if that
created any confusion. The arch/darcs approach of pratically storing
patches would also be much denser but it has no efficient way of doing
"rcs up -p 1.x" on a file, that doesn't involve potentially unpacking
tons of unrelated changesets.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-12 22:36                 ` David Eger
@ 2005-04-12 23:48                   ` Panagiotis Issaris
  0 siblings, 0 replies; 23+ messages in thread
From: Panagiotis Issaris @ 2005-04-12 23:48 UTC (permalink / raw)
  To: David Eger
  Cc: Linus Torvalds, Petr Baudis, Randy.Dunlap, Ross Vandegrift,
	Kernel Mailing List

Hi David,

On Tue, Apr 12, 2005 at 06:36:23PM -0400, David Eger wrote:
> > No. A tree is not the full data. A tree contains enough information
> > to 
> > _recreate_ the full data, but the tree itself just tells you _how_
> > to do 
> > that. It doesn't contain very much of the data itself at all.
> 
> Perhaps I'd understand this if you tell me what "recreate" means.
> If a have a SHA1 hash of a file, and I have the file, I can verify
> that said
> file has the SHA1 hash it's supposed to have, but I can't generate the
> file
> from it's hash...

But, but if you have that hexified SHA1 hash of a particular file you
want to access, there would be a file with a filename equal to that
hexified SHA1 hash which contained the compressed contents of the file
you're looking for.

At least, that's how I understood it...

With friendly regards,
Takis

-- 
OpenPGP key: http://lumumba.luc.ac.be/takis/takis_public_key.txt
fingerprint: 6571 13A3 33D9 3726 F728  AA98 F643 B12E ECF3 E029

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-12 23:40                 ` Andrea Arcangeli
@ 2005-04-12 23:45                   ` Linus Torvalds
  2005-04-13  0:14                     ` Andrea Arcangeli
  2005-04-13  9:30                     ` Russell King
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2005-04-12 23:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Eger, Petr Baudis, Randy.Dunlap, Ross Vandegrift,
	Kernel Mailing List



On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> 
> At the rate of 9M for every 198 changeset checkins, that means I'll have
> to download 2.7G _uncompressible_ (i.e. already compressed with a bad
> per-file ratio due the too-small files) for a whole pack including all
> changesets without accounting the original 111MB of the original tree,
> with rsync -z of git.  That compares with 514M _compressible_ with CVS
> format on-disk, and with ~79M of the CVS-network download with rsync -z of
> the CVS repository (assuming default gzip compression level).

Yes. CVS is much denser.

CVS is also total crap. So your point is?

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-12 21:21               ` Linus Torvalds
  2005-04-12 22:36                 ` David Eger
@ 2005-04-12 23:40                 ` Andrea Arcangeli
  2005-04-12 23:45                   ` Linus Torvalds
  1 sibling, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2005-04-12 23:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Eger, Petr Baudis, Randy.Dunlap, Ross Vandegrift,
	Kernel Mailing List

On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote:
> The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
> and a test-run of 198 patches from Andrew) is 111MB. In other words,
> adding 198 "full" new kernels only grew the archive by 9MB (that's all
> "actual disk usage" btw - the files themselves are smaller, but since they
> all end up taking up a full disk block..)

reiserfs can do tail packing, plus the disk block is meaningless when
fetching the data from the network which is the real cost to worry about
when synchronizing and downloading (disk cost isn't a big deal).

The pagecache cost sounds a very minor one too, since you don't need
the whole data in ram, not even all dentries need to be in cache.  This
is one of the reasons why you don't need to run readdir, and why you can
discard the old trees anytime.

At the rate of 9M for every 198 changeset checkins, that means I'll have
to download 2.7G _uncompressible_ (i.e. already compressed with a bad
per-file ratio due the too-small files) for a whole pack including all
changesets without accounting the original 111MB of the original tree,
with rsync -z of git.  That compares with 514M _compressible_ with CVS
format on-disk, and with ~79M of the CVS-network download with rsync -z of
the CVS repository (assuming default gzip compression level).

What BKCVS provided with 79M of rsync -z, now is provided with 2.8G of
rsync -z, with a network-bound slowdown of -97.2%. Similar slowdowns
should be expected for synchronizations over time while fetching new
blobs etc...

Ok, BKCVS has less than 60000 checkins due the linearization and
coalescing of pulls that couldn't be represented losslessy in CVS, so
the network-bound slowdown is less than -97.2%, my math is
approximative, but the order of magnitude should remain the same.

Clearly one can write an ad-hoc network protocol instead of using
rsync/wget, but the server will need quite a bit of cpu and ram to do a
checkout/update/sync efficiently to unpack all data and create all
changesets to gzip and transfer.

Anyway git simplicity and immutable hashes robustness certainly makes it
an ideal interim format (and it may even be a very pratical local
live format on-disk, except for the backups), I'm only unsure if it's a
wise idea to build an SCM on top of the current git format or if it's
better to use something like SCCS or CVS to coalesce all diffs of a
single file together and to save space and make rsync -z very efficient
too (or an approach like arch and darcs that stores changesets per file,
i.e. patches).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-12 21:21               ` Linus Torvalds
@ 2005-04-12 22:36                 ` David Eger
  2005-04-12 23:48                   ` Panagiotis Issaris
  2005-04-12 23:40                 ` Andrea Arcangeli
  1 sibling, 1 reply; 23+ messages in thread
From: David Eger @ 2005-04-12 22:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Petr Baudis, Randy.Dunlap, Ross Vandegrift, Kernel Mailing List

On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote:
> 
> Yes. A tree is defined by the blobs it references (and the subtrees) but 
> it doesn't _contain_ them. It just contains a pointer to them.

A pointer to them?  You mean a SHA1 hash of them? or what?
Where is the *real* data stored?  The real files, the real patches?
Are these somewhere completely outside of git?

> > Therefore, "TREE" must be the *full* data, and since we have the following
> > definition for CHANGESET:
> 
> No. A tree is not the full data. A tree contains enough information to 
> _recreate_ the full data, but the tree itself just tells you _how_ to do 
> that. It doesn't contain very much of the data itself at all.

Perhaps I'd understand this if you tell me what "recreate" means.
If a have a SHA1 hash of a file, and I have the file, I can verify that said
file has the SHA1 hash it's supposed to have, but I can't generate the file
from it's hash...

Sorry for being stubbornly dumb, but you'll have a couple of us puzzling 
at the README ;-)

-dte

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-12 20:44             ` David Eger
@ 2005-04-12 21:21               ` Linus Torvalds
  2005-04-12 22:36                 ` David Eger
  2005-04-12 23:40                 ` Andrea Arcangeli
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2005-04-12 21:21 UTC (permalink / raw)
  To: David Eger
  Cc: Petr Baudis, Randy.Dunlap, Ross Vandegrift, Kernel Mailing List



On Tue, 12 Apr 2005, David Eger wrote:
> 
> The reason I am questioning this point is the GIT README file.
> 
> Linus makes explicit that a "blob" is just the "file contents," and that
> really, a "blob" is not just the SHA1 of the "blob":
> 
> > In particular, the "current directory cache" certainly does not need to
> > be consistent with the current directory contents, but it has two very
> > important attributes:
> > 
> > (a) it can re-generate the full state it caches (not just the directory
> >     structure: through the "blob" object it can regenerate the data too)
> 
> And he defines "TREE" with the same name: blob

Yes. A tree is defined by the blobs it references (and the subtrees) but 
it doesn't _contain_ them. It just contains a pointer to them.

> Therefore, "TREE" must be the *full* data, and since we have the following
> definition for CHANGESET:

No. A tree is not the full data. A tree contains enough information to 
_recreate_ the full data, but the tree itself just tells you _how_ to do 
that. It doesn't contain very much of the data itself at all.

> That each changeset remembers *everything* for *each point in the tree*.

But only BY REFERENCE. A "commit" is usually very small. For example, the
top-of-tree commit-file for my currest kernel test is literally 401
_bytes_ in size. Because it just references a tree (20 bytes of
_reference_).

> Linus, if you actually mean to differentiate between the full data
> and a SHA1 of the data

There is no differentiation. The sha1 _is_ the data as far as git is 
concerned. 

It's only confusing if you think they are different. 

> Also, the details of just what data constitutes a 'changeset' would be
> lovely... i.e. a precise spec of what Pat is describing below...

	torvalds@ppc970:~/test-tools/linux-2.6.12-rc2> cat-file commit `cat .git/HEAD `
	tree cf9fd295d3048cd84c65d5e1a5a6b606bf4fddc6
	parent c7a1a189dd0fe2c6ecd0aa33f2bd2f414c7892a0
	author NeilBrown <neilb@cse.unsw.edu.au> Tue Apr 12 08:27:08 2005
	committer Linus Torvalds <torvalds@ppc970.osdl.org> Tue Apr 12 08:27:08 2005

	[PATCH] md: remove a number of misleading calls to MD_BUG

	The conditions that cause these calls to MD_BUG are not kernel bugs, just
	oddities in what userspace is asking for.

	Also convert analyze_sbs to return void, and the value it returned was
	always 0.

	Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
	Signed-off-by: Andrew Morton <akpm@osdl.org>
	Signed-off-by: Linus Torvalds <torvalds@osdl.org>

That's it. In all it's glory. Compressed and tagged it's 401 bytes. 

The tree it references is 677 bytes in size. That in turn references a 
number of subtrees, but almost all of the sub-trees are shared with 
_other_ tree commits, so their size is spread out over all the commits.

The full archive of the 2.6.12-rc2 kernel that I used for testing (only
_one_ version) is 102MB in size. That's about half of what the kernel is
uncompressed.

The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
and a test-run of 198 patches from Andrew) is 111MB. In other words,
adding 198 "full" new kernels only grew the archive by 9MB (that's all
"actual disk usage" btw - the files themselves are smaller, but since they
all end up taking up a full disk block..)

Basically, the whole point of git is that objects are equated with their 
sha1 name, and that you can thus "include" an object by just referring to 
its name. The two are equivalent. 

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-12  8:16           ` Petr Baudis
@ 2005-04-12 20:44             ` David Eger
  2005-04-12 21:21               ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: David Eger @ 2005-04-12 20:44 UTC (permalink / raw)
  To: Petr Baudis
  Cc: Linus Torvalds, Randy.Dunlap, Ross Vandegrift, Kernel Mailing List


The reason I am questioning this point is the GIT README file.

Linus makes explicit that a "blob" is just the "file contents," and that
really, a "blob" is not just the SHA1 of the "blob":

> In particular, the "current directory cache" certainly does not need to
> be consistent with the current directory contents, but it has two very
> important attributes:
> 
> (a) it can re-generate the full state it caches (not just the directory
>     structure: through the "blob" object it can regenerate the data too)

And he defines "TREE" with the same name: blob

> TREE: The next hierarchical object type is the "tree" object.  A tree
> object is a list of permission/name/blob data, sorted by name.

Therefore, "TREE" must be the *full* data, and since we have the following
definition for CHANGESET:

> A "changeset" is defined by the tree-object that it results in, the
> parent changesets (zero, one or more) that led up to that point, and a
> comment on what happened.

That each changeset remembers *everything* for *each point in the tree*.

Linus, if you actually mean to differentiate between the full data
and a SHA1 of the data, *please please please* say "blob" in one place
and "SHA1 of the blob" elsewhere.  It's quite confusing, to me at least.

Also, the details of just what data constitutes a 'changeset' would be
lovely... i.e. a precise spec of what Pat is describing below...

-dte 

> where David Eger <eger@havoc.gtf.org> told me that...
> > So with git, *every* changeset is an entire (compressed) copy of the
> > kernel.  Really?  Every patch you accept adds 37 MB to your hard disk?
> > 
> > Am I missing something here?
> 
> Yes. Only changes files re-appear. The unchanged files keep the same
> SHA1 hash, therefore they don't re-appear in the repository.
> 
> So, if Linus gets a patch which sanitizes drivers/char/selection.c,
> only these new objects appear in the repository:
> 
> 	drivers/char/selection.c
> 	drivers/char
> 	drivers
> 	. (project root)
> 	commit message
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-12  4:05         ` David Eger
@ 2005-04-12  8:16           ` Petr Baudis
  2005-04-12 20:44             ` David Eger
  0 siblings, 1 reply; 23+ messages in thread
From: Petr Baudis @ 2005-04-12  8:16 UTC (permalink / raw)
  To: David Eger
  Cc: Linus Torvalds, Randy.Dunlap, Ross Vandegrift, Kernel Mailing List

Dear diary, on Tue, Apr 12, 2005 at 06:05:19AM CEST, I got a letter
where David Eger <eger@havoc.gtf.org> told me that...
> So with git, *every* changeset is an entire (compressed) copy of the
> kernel.  Really?  Every patch you accept adds 37 MB to your hard disk?
> 
> Am I missing something here?

Yes. Only changes files re-appear. The unchanged files keep the same
SHA1 hash, therefore they don't re-appear in the repository.

So, if Linus gets a patch which sanitizes drivers/char/selection.c,
only these new objects appear in the repository:

	drivers/char/selection.c
	drivers/char
	drivers
	. (project root)
	commit message

Kind regards,

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-11 15:49                 ` Randy.Dunlap
@ 2005-04-11 18:30                   ` Petr Baudis
  0 siblings, 0 replies; 23+ messages in thread
From: Petr Baudis @ 2005-04-11 18:30 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: Linus Torvalds, pj, junkio, ross, linux-kernel

Dear diary, on Mon, Apr 11, 2005 at 05:49:31PM CEST, I got a letter
where "Randy.Dunlap" <rddunlap@osdl.org> told me that...
> On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote:
..snip..
> | Yes. Crappy old tree, but it can still read my git.git directory, so you 
> | can use it to update to my current source base.
> 
> Please go into a little more detail about how to do this step...
> that seems to be the most basic concept that I am missing.
> i.e., how to find the "latest/current" tree (version/commit)
> and check it out (read-tree, checkout-cache, etc.).

Well, its ID is by convention kept in .dircache/HEAD. But that is really
only a convention, no "core git" tool reads it directly, and you need to
update it manually after you do commit-tree.

First, you need to get the accompanying tree's id. git-pasky's shortcut
is $(tree-id), but manually you can do it by

	$(cat-file commit $(cat .dircache/HEAD)) | egrep '^tree'

Note that if you ever forgot to update HEAD or if you have multiple
branches in your repository, you can list all "head commits" (that is,
commits which have no other commits referencing them as parents) by
doing fsck-cache.

Now, you need to populate the directory cache by the tree (see Paul
Jackson's diagram):

	read-tree $tree_id

And now you want to update your working tree from the cache:

	checkout-cache -a -f

This will bring your tree in sync with the cache (it won't remove any
stale files, though). That means it will overwrite your local changes
too - turn that off by omitting the "-f". If you want to update only
some files, omit the "-a" and list them.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-10 23:14             ` Paul Jackson
  2005-04-10 23:38               ` Linus Torvalds
@ 2005-04-11  0:10               ` Petr Baudis
  1 sibling, 0 replies; 23+ messages in thread
From: Petr Baudis @ 2005-04-11  0:10 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Linus Torvalds, junkio, rddunlap, ross, linux-kernel

Dear diary, on Mon, Apr 11, 2005 at 01:14:57AM CEST, I got a letter
where Paul Jackson <pj@engr.sgi.com> told me that...
> Useful explanation - thanks, Linus.
> 
> Is this picture and description accurate:
> 
> ==============================================================
> 
> 
>              < working directory files (foo.c) >
>                            ^
>   ^                        |
>   |  upward ops            |            downward ops  |
>   |  ----------            |            ------------  |
>   | checkout-cache         |            update-cache  |
>   | show-diff              |                          v
>                            v
>         < current directory cache (".dircache/index") >
>                            ^
>   ^                        |
>   |  upward ops            |            downward ops  |
>   |  ----------            |            ------------  |
>   |   read-tree            |             write-tree   |
>   |                        |            commit-tree   |
>                            |                          v
>                            v
> < git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) >

Well, except that from purely technical standpoint commit-tree has
nothing to do in this picture - it creates new object in the git
filesystem based on its input data, but regardless to the directory
cache or current tree. It probably still belongs where it is from the
workflow standpoint, though.

..snip..
> Minor question:
> 
>   I must have an old version - I got 'git-0.03', but
>   it doesn't have 'checkout-cache', and its 'read-tree'
>   directly writes my working files.
>  
>   How do I get a current version?  Well, one way I see,
>   and that's to pick up Pasky's:
>     
>     http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
>  
>   Perhaps that's the best way?

You can take mine, and do:

	git pull pasky
	git pull linus
	cp .dircache/HEAD .dircache/HEAD.local

Now, your tree and git filesystem is up to date.

	git track local

Now, when you do git pull pasky, your working tree will not be updated
automatically anymore.

	git track linus

Now, you start tracking Linus' tree instead. Note that the initial
update will blow away the scripts in your current tree, so before you do
the last two steps you will probably want to clone the tree and set PATH
to the one still tracking me, so you get all the comfort. ;-)

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-10 18:42             ` Christopher Li
@ 2005-04-10 22:30               ` Petr Baudis
  0 siblings, 0 replies; 23+ messages in thread
From: Petr Baudis @ 2005-04-10 22:30 UTC (permalink / raw)
  To: Christopher Li; +Cc: Paul Jackson, torvalds, rddunlap, ross, linux-kernel

Dear diary, on Sun, Apr 10, 2005 at 08:42:53PM CEST, I got a letter
where Christopher Li <lkml@chrisli.org> told me that...
> I totally agree that odds is really really small.
> That is why it is not worthy to handle the case. People hit that
> can just add a new line or some thing to avoid it, if
> it happen after all.
> 
> It is the little peace of mind to know for sure that did
> not happen. I am just paranoid. 

BTW, I've merged the check to git-pasky some time ago, you can disable
it in the Makefile. It is by default on now, until someone convinces me
it actually affects performance measurably.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-10  9:28         ` Junio C Hamano
@ 2005-04-10  9:48           ` Petr Baudis
  0 siblings, 0 replies; 23+ messages in thread
From: Petr Baudis @ 2005-04-10  9:48 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christopher Li, Linus Torvalds, Randy.Dunlap, Ross Vandegrift,
	Kernel Mailing List

Dear diary, on Sun, Apr 10, 2005 at 11:28:54AM CEST, I got a letter
where Junio C Hamano <junkio@cox.net> told me that...
> >>>>> "CL" == Christopher Li <lkml@chrisli.org> writes:
> 
> CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> >> 
> >> But I am wondering what your plans are to handle renames---or
> >> does git already represent them?
> >> 
> 
> CL> Rename should just work.  It will create a new tree object and you
> CL> will notice that in the entry that changed, the hash for the blob
> CL> object is the same.
> 
> Sorry, I was unclear.  But doesn't that imply that a SCM built
> on top of git storage needs to read all the commit and tree
> records up to the common ancestor to show tree diffs between two
> forked tree?

No. See diff-tree output and
http://pasky.or.cz/~pasky/dev/git/gitdiff-do for how it's done.
Basically, you just take the two trees and compare them linearily (do a
normal diff on them, essentialy). Then the differences you spot this way
are everything what needs to appear in the patch.

> I suspect that another problem is that noticing the move of the
> same SHA1 hash from one pathname to another and recognizing that
> as a rename would not always work in the real world, because
> sometimes people move files *and* make small changes at the same
> time.  If git is meant to be an intermediate format to suck
> existing kernel history out of BK so that the history can be
> converted for the next SCM chosen for the kernel work, I would
> imagine that there needs to be a way to represent such a case.
> Maybe convert a file rename as two git trees (one tree for pure
> move which immediately followed by another tree for edit) if it
> is not a pure move?

Actually, this could be possible too I think. We will have to make
diff-tree two-pass, but it is already so blinding fast that I guess that
doesn't hurt too much. I might try to get my hands on that.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-10  5:53       ` Christopher Li
  2005-04-10  9:28         ` Junio C Hamano
@ 2005-04-10  9:41         ` Petr Baudis
  2005-04-10  7:09           ` Christopher Li
  1 sibling, 1 reply; 23+ messages in thread
From: Petr Baudis @ 2005-04-10  9:41 UTC (permalink / raw)
  To: Christopher Li
  Cc: Junio C Hamano, Linus Torvalds, Randy.Dunlap, Ross Vandegrift,
	Kernel Mailing List

Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
where Christopher Li <lkml@chrisli.org> told me that...
> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> > 
> > But I am wondering what your plans are to handle renames---or
> > does git already represent them?
> >
> 
> Rename should just work.  It will create a new tree object and you
> will notice that in the entry that changed, the hash for the blob
> object is the same.

Which is of course wrong when you want to do proper merging, examine
per-file history, etc. One solution which springs to my mind is to have
a UUID accompany each blob and tree; that will take relatively lot of
space though, and I'm not sure it is really worth it.

How many renames were there in the 64k commits so far anyway?

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-10  9:41         ` Petr Baudis
@ 2005-04-10  7:09           ` Christopher Li
  0 siblings, 0 replies; 23+ messages in thread
From: Christopher Li @ 2005-04-10  7:09 UTC (permalink / raw)
  To: Petr Baudis
  Cc: Junio C Hamano, Linus Torvalds, Randy.Dunlap, Ross Vandegrift,
	Kernel Mailing List

On Sun, Apr 10, 2005 at 11:41:53AM +0200, Petr Baudis wrote:
> Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
> where Christopher Li <lkml@chrisli.org> told me that...
> > On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> > > 
> > > But I am wondering what your plans are to handle renames---or
> > > does git already represent them?
> > >
> > 
> > Rename should just work.  It will create a new tree object and you
> > will notice that in the entry that changed, the hash for the blob
> > object is the same.
> 
> Which is of course wrong when you want to do proper merging, examine
> per-file history, etc. One solution which springs to my mind is to have
> a UUID accompany each blob and tree; that will take relatively lot of
> space though, and I'm not sure it is really worth it.

It should just use the rename + change two step then it is tractable
with git now.

Chris

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: more git updates..
  2005-04-09 23:31       ` Linus Torvalds
@ 2005-04-10  2:41         ` Petr Baudis
  2005-04-10  6:53         ` Christopher Li
  2005-04-12  4:05         ` David Eger
  2 siblings, 0 replies; 23+ messages in thread
From: Petr Baudis @ 2005-04-10  2:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Randy.Dunlap, Ross Vandegrift, Kernel Mailing List

Dear diary, on Sun, Apr 10, 2005 at 01:31:10AM CEST, I got a letter
where Linus Torvalds <torvalds@osdl.org> told me that...
> On Sat, 9 Apr 2005, Linus Torvalds wrote:
> > 
> > Actually, I guess I wouldn't have to change the format. I could just 
> > extend the existing "tree" object to be able to point to other trees, and 
> > that's it.
> 
> Done, and pushed out. The current git.git repository seems to do all of 
> this correctly.
..snip..

Ok, so now I can dare announce it, I hope. I hacked my branch of git
somewhat, kept in sync with Linus, and now I have something to show.
Please see it at

	http://pasky.or.cz/~pasky/dev/git/

It is basically a set of (still rather crude) shell scripts upon Linus'
git, which make it sanely usable by mere humans for actual version
tracking. Its usage _is_ going to change, so don't get too used to it
(that'd be hard anyway, I suspect), but it should be working nicely.

I have described most of the interesting parts and some basic usage in
the README at that page. It wraps commits, supports log retrieval and
comfortable diffing between any two trees. And on top of that, it can do
some basic remote repositories - it will pull (rsync) from them and it
can make the local copy track them - on pull, it will be updated
accordingly (and your local commits on the tracked branch will get
orphaned).

I didn't attach a patch against Linus since I think it's pretty much
useless now. It's available as against-linus.patch on the web, and
you can apply it to the latest git tree (NOT 0.03). But it's probably
better idea to wget my tree. You can then watch us making progress by

	gitpull.sh linus
	gitpull.sh pasky

and see where we differ by:

	gitdiff.sh linus pasky

(This is how the against-linus.patch was generated. I'd easily generate
even 0.03 patch this way, but I forgot to merge the fsck at that time,
so it would suck.)

(Note that the tree you wget is set up to track my branch. If you want
to stop tracking it (basically necessary now if you want to do local
commits), do:

	cp .dircache/HEAD .dircache/HEAD.local
	gittrack.sh

The cp says that something like "I want to pick up where the tracked
branch left off". Otherwise, untracking would return you to your "local"
branch, which is just some ancient predecessor of the pasky branch here
anyway.)

Note that I didn't really test it on anything but git itself yet, so I'm
not sure how will it cope especially with directories - I tried to make
it aware of them though. I will do some more practical testing tomorrow.

Otherwise, I will probably try to consolidate the usage and
documentation now, and beautify the scripts. I might start pondering
some merging too. Oh, and gitpatch.sh. :-)

Have fun and please share your opinions,

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2005-04-13 20:45 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-10 22:07 more git updates Luck, Tony
2005-04-10 22:11 ` Petr Baudis
  -- strict thread matches above, loose matches on Subject: below --
2005-04-09 19:45 Linus Torvalds
2005-04-09 20:07 ` Petr Baudis
2005-04-09 21:00   ` Linus Torvalds
2005-04-09 21:08     ` Linus Torvalds
2005-04-09 23:31       ` Linus Torvalds
2005-04-10  2:41         ` Petr Baudis
2005-04-10  6:53         ` Christopher Li
2005-04-10 19:23           ` Paul Jackson
2005-04-10 18:42             ` Christopher Li
2005-04-10 22:30               ` Petr Baudis
2005-04-12  4:05         ` David Eger
2005-04-12  8:16           ` Petr Baudis
2005-04-12 20:44             ` David Eger
2005-04-12 21:21               ` Linus Torvalds
2005-04-12 22:36                 ` David Eger
2005-04-12 23:48                   ` Panagiotis Issaris
2005-04-12 23:40                 ` Andrea Arcangeli
2005-04-12 23:45                   ` Linus Torvalds
2005-04-13  0:14                     ` Andrea Arcangeli
2005-04-13  1:10                       ` Linus Torvalds
2005-04-13 10:59                         ` Andrea Arcangeli
2005-04-13 20:44                         ` Matt Mackall
2005-04-13  9:30                     ` Russell King
2005-04-13 10:20                       ` Andrea Arcangeli
2005-04-13 14:43                       ` Linus Torvalds
2005-04-10  7:51     ` Junio C Hamano
2005-04-10  5:53       ` Christopher Li
2005-04-10  9:28         ` Junio C Hamano
2005-04-10  9:48           ` Petr Baudis
2005-04-10  9:41         ` Petr Baudis
2005-04-10  7:09           ` Christopher Li
2005-04-10 15:44       ` Linus Torvalds
2005-04-10 18:50         ` Paul Jackson
2005-04-10 20:57           ` Linus Torvalds
2005-04-10 23:14             ` Paul Jackson
2005-04-10 23:38               ` Linus Torvalds
2005-04-11 15:49                 ` Randy.Dunlap
2005-04-11 18:30                   ` Petr Baudis
2005-04-11  0:10               ` Petr Baudis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).