linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matt Mackall <mpm@selenic.com>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Bill Davidsen <davidsen@tmr.com>,
	Morten Welinder <mwelinder@gmail.com>,
	Sean <seanlkml@sympatico.ca>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	git@vger.kernel.org
Subject: Re: Mercurial 0.4b vs git patchbomb benchmark
Date: Mon, 2 May 2005 17:00:12 -0700	[thread overview]
Message-ID: <20050503000011.GA22038@waste.org> (raw)
In-Reply-To: <Pine.LNX.4.58.0505021540070.3594@ppc970.osdl.org>

On Mon, May 02, 2005 at 03:49:49PM -0700, Linus Torvalds wrote:
> > >  - you can drop old objects.
> > 
> > You can't drop old objects without dropping all the changesets that
> > refer to them or otherwise being prepared to deal with the broken
> > links.

[...]
 
> I could write this up in ten minutes. It's really simple.

It's still simple in Mercurial, but more importantly Mercurial _won't
need it_. Dropping history is a work-around, not a feature.
 
> > > delta models very fundamentally don't support this. 
> > 
> > The latter can be done in a pretty straightforward manner in mercurial
> > with one pass over the data. But I have a goal to make keeping the
> > whole history cheap enough that no one balks at it.
> 
> With delta's, you have two choices:
> 
>  - change all the sha1 names (ie a pruned tree would no longer be 
>    compatible with a non-pruned one)
>  - make the delta part not show up as part of the sha1 name (which means 
>    that it's unprotected).
> 
> which one would you have?

Umm.. I am _not_ calculating the SHA of the delta itself. That'd be
silly.

There are an arbitrary number of ways to calculate a delta between two
files. Similarly, there are an arbitrary number of ways to compress a
file (gzip has at least 9, not counting all the permutations of
flush). The only sensible thing to do is store a hash of the raw text
and check it against the fully restored text, because that's what you
care about being correct.

In Mercurial, deltas are just a storage detail and are effectively
completely hidden from everything except the innermost part of the
back-end. What's important is that Mercurial knows that A is a
revision of B in the backend and thus has enough information to
opportunistically attempt to calculate a delta.

So if the day ever comes when I want to prune the head of a log, I
simply reconstruct the first version to keep, store it in a new file,
then append all the deltas, unmodified. And fix up the offsets in the
indices. None of the hashes change.

> > What is a tree re-linker? Finds duplicate files and hard-links them?
> > Ok, that makes some sense. But it's a win on one machine and a lose
> > everywhere else.
> 
> Where would it be a loss? Esepcially since with git, it's cheap (you don't 
> need to compare content to find objects to link - you can just compare 
> filename listings).

Git repositories will be 10x larger than Mercurial everywhere that
doesn't benefit from this linking of unrelated trees. That is, folks
who aren't running gitbits.net.

> > I've added an "hg verify" command to Mercurial. It doesn't attempt to
> > fix anything up yet, but it can catch a couple things that git
> > probably can't (like file revisions that aren't owned by any
> > changeset), namely because there's more metadata around to look at.
> 
> git-fsck-cache catches exactly those kinds of things. And since it checks
> pretty much every _single_ assumption in git (which is not a lot, since
> git doesn't have a lot of assumptions), I guarantee you that you can't
> find any more than it does (the filename ordering is the big missing
> piece: I _still_ don't verify that trees are ordered. I've been mentioning
> it since the beginning, but I'm lazy).
> 
> In other words, your verifier can't verify anything more. It's entirely 
> possible that more things can go _wrong_, since you have more indexes, so 
> your verifier will have more to check, but that's not an advantage, that's 
> a downside.

Uh, no. It's just like a filesystem. Redundancy is what lets you
recover.

The extra indices are also very useful in their own right:

- they let you do easily do delta storage
- they let you efficiently do delta transmission
- they let you find past revisions of a file in O(1)
- they let you efficiently do "annotate"
- they let you do smarter merge

At least the first four seem fairly critical to me.

As various people have pointed out, you can hack delta transmission
and file revision indexing on top of git. But to do that, you'll need
to build the same indices that Mercurial has. And you'll need to check
their integrity.

Unfortunately, since the git back-end refuses to know anything about
the relation between file revisions, this will all happen in the front
end, and you'll have done almost all the work needed to do delta
storage without actually getting it. How sad.

You'll also likely end up with something quite a bit more complicated
than Mercurial because of the extra layering. This all strongly suggests
to me that the git back-end is just a little bit too simple.

-- 
Mathematics is the supreme nostalgia of our time.

  reply	other threads:[~2005-05-03  0:00 UTC|newest]

Thread overview: 106+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-26  0:41 Mercurial 0.3 vs git benchmarks Matt Mackall
2005-04-26  1:49 ` Daniel Phillips
2005-04-26  2:08 ` Linus Torvalds
2005-04-26  2:30   ` Mike Taht
2005-04-26  3:04     ` Linus Torvalds
2005-04-26  4:00       ` Linus Torvalds
2005-04-26 11:13         ` Chris Mason
2005-04-26 15:09           ` Magnus Damm
2005-04-26 15:38             ` Chris Mason
2005-04-26 16:23               ` Magnus Damm
2005-04-26 18:18                 ` Chris Mason
2005-04-26 20:56                 ` Andrew Morton
2005-04-26 21:07                   ` Linus Torvalds
2005-04-26 22:50                     ` H. Peter Anvin
2005-04-26 22:56                     ` Andrew Morton
2005-04-26 23:43                       ` H. Peter Anvin
2005-04-27 15:01                         ` Florian Weimer
2005-04-27 15:13                           ` Thomas Glanzmann
2005-04-27 18:54                             ` H. Peter Anvin
2005-04-27 19:01                               ` Thomas Glanzmann
2005-04-27 19:57                                 ` Theodore Ts'o
2005-04-27 20:06                                   ` Thomas Glanzmann
2005-04-27 20:35                                 ` H. Peter Anvin
2005-04-27 20:39                                   ` Thomas Glanzmann
2005-04-27 20:47                                   ` Florian Weimer
2005-04-27 20:55                                 ` Florian Weimer
2005-04-27 21:04                                   ` H. Peter Anvin
2005-04-27 21:06                                     ` Florian Weimer
2005-04-27 21:32                                       ` Theodore Ts'o
2005-04-27 19:55                       ` Theodore Ts'o
2005-04-27  6:34                   ` Ingo Molnar
2005-04-27 21:10                     ` Bill Davidsen
2005-04-27 21:39                       ` Linus Torvalds
2005-04-26 16:42           ` Linus Torvalds
2005-04-26 17:39             ` Chris Mason
2005-04-26 19:52               ` Chris Mason
2005-04-26 18:15         ` H. Peter Anvin
2005-04-26 20:30           ` Bill Davidsen
2005-04-26 16:11       ` Bill Davidsen
2005-04-26  4:01   ` Matt Mackall
2005-04-26  4:20     ` Linus Torvalds
2005-04-26  4:09   ` Chris Wedgwood
2005-04-26  4:22     ` Andreas Gal
2005-04-26  4:22     ` Linus Torvalds
2005-04-29  6:01   ` Mercurial 0.4b vs git patchbomb benchmark Matt Mackall
2005-04-29  6:40     ` Sean
2005-04-29  7:40       ` Matt Mackall
2005-04-29  8:40         ` Sean
2005-04-29 14:34         ` Linus Torvalds
2005-04-29 15:18           ` Morten Welinder
2005-04-29 16:52             ` Matt Mackall
2005-05-02 16:10               ` Bill Davidsen
2005-05-02 19:02                 ` Sean
2005-05-02 22:02                 ` Linus Torvalds
2005-05-02 22:30                   ` Matt Mackall
2005-05-02 22:49                     ` Linus Torvalds
2005-05-03  0:00                       ` Matt Mackall [this message]
2005-05-03  2:48                         ` Linus Torvalds
2005-05-03  3:29                           ` Matt Mackall
2005-05-03  4:18                             ` Linus Torvalds
2005-05-03  4:24                         ` Linus Torvalds
2005-05-03  4:27                           ` Matt Mackall
2005-05-03  8:45                           ` Chris Wedgwood
2005-04-29 15:44           ` Tom Lord
2005-04-29 15:58             ` Linus Torvalds
2005-04-29 17:34               ` Tom Lord
2005-04-29 17:56                 ` Linus Torvalds
2005-04-29 18:08                   ` Tom Lord
2005-04-29 18:33                     ` Sean
2005-04-29 18:54                       ` Tom Lord
2005-04-29 19:13                         ` Sean
2005-05-02 16:15                           ` Bill Davidsen
2005-04-29 16:37           ` Matt Mackall
2005-04-29 17:09             ` Linus Torvalds
2005-04-29 19:12               ` Matt Mackall
2005-04-29 19:50                 ` Linus Torvalds
2005-04-29 20:23                   ` Matt Mackall
2005-04-29 20:49                     ` Linus Torvalds
2005-04-29 21:20                       ` Matt Mackall
2005-04-29 16:46           ` Bill Davidsen
2005-04-29 20:19       ` Andrea Arcangeli
2005-04-29 22:30         ` Olivier Galibert
2005-04-29 22:47           ` Andrea Arcangeli
2005-04-29 20:30     ` Andrea Arcangeli
2005-04-29 20:39       ` Matt Mackall
2005-04-30  2:52         ` Andrea Arcangeli
2005-04-30 15:20           ` Matt Mackall
2005-04-30 16:37             ` Andrea Arcangeli
2005-05-02 15:49           ` Bill Davidsen
2005-05-02 16:14             ` Valdis.Kletnieks
2005-05-03 17:40               ` Bill Davidsen
2005-05-04  2:10                 ` Mercurial 0.4b vs git patchbomb benchmark (/usr/bin/env again) David A. Wheeler
2005-05-02 16:17             ` Mercurial 0.4b vs git patchbomb benchmark Andrea Arcangeli
2005-05-02 16:31             ` Linus Torvalds
2005-05-02 17:18               ` Daniel Jacobowitz
2005-05-02 17:32                 ` Linus Torvalds
2005-05-02 20:54                 ` Sam Ravnborg
2005-05-02 17:20               ` Ryan Anderson
2005-05-02 17:31                 ` Linus Torvalds
2005-05-02 21:17               ` Kyle Moffett
2005-05-03 17:43               ` Bill Davidsen
     [not found] <3YQn9-8qX-5@gated-at.bofh.it>
     [not found] ` <3ZLEF-56n-1@gated-at.bofh.it>
     [not found]   ` <3ZM7L-5ot-13@gated-at.bofh.it>
     [not found]     ` <3ZN3P-69A-9@gated-at.bofh.it>
     [not found]       ` <3ZNdz-6gK-9@gated-at.bofh.it>
2005-05-03  1:16         ` Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org>
2005-05-03  1:29           ` Matt Mackall
2005-05-03 16:22             ` Bill Davidsen
2005-05-03 17:14               ` Rene Scharfe
2005-05-04 17:51                 ` Bill Davidsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050503000011.GA22038@waste.org \
    --to=mpm@selenic.com \
    --cc=davidsen@tmr.com \
    --cc=git@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mwelinder@gmail.com \
    --cc=seanlkml@sympatico.ca \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).