linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@osdl.org>
To: Jeff Garzik <jgarzik@pobox.com>
Cc: Matthias-Christian Ott <matthias.christian@tiscali.de>,
	Andrea Arcangeli <andrea@suse.de>, Chris Wedgwood <cw@f00f.org>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: Kernel SCM saga..
Date: Fri, 8 Apr 2005 11:47:10 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.58.0504081114220.28951@ppc970.osdl.org> (raw)
In-Reply-To: <4256C0F8.6030008@pobox.com>



On Fri, 8 Apr 2005, Jeff Garzik wrote:
> 
> Well...  it took me over 30 seconds just to 'rm -rf' the unpacked 
> tarballs of git and sparse-git, over my LAN's NFS.

Don't use NFS for development. It sucks for BK too. 

That said, normal _use_ should actually be pretty efficient even over NFS.  
It will "stat" a hell of a lot of files to do the "show-diff", but that
part you really can't avoid unless you depend on all the tools marking
their changes somewhere. Which BK does, actually, but that was pretty
painful, and means that bk needed to re-implement all the normal ops (ie
"bk patch").

What's also nice is that exactly because "git" depends on totally 
immutable files, they actually cache very well over NFS. Even if you were 
to share a database across machines (which is _not_ what git is meant to 
do, but it's certainly possible). 

So I actually suspect that if you actually _work_ with a tree in "git", 
you will find performance very good indeed. The fact that it creates a 
number of files when you pull in a new repository is a different thing.

> Granted that this sort of stuff works well with (a) rsync and (b) 
> hardlinks, but it's still punishment on the i/dcache.

Actually, it's not. Not once it is set up. Exactly because "git" doesn't
actually access those files unless it literally needs the data in one
file, and then it's always set up so that it needs either none or _all_ of
the file. There is no data sharing anywhere, so you are never in the
situation where it needs "ten bytes from file X" and "25 bytes from file
Y".

For example, if you don't have any changes in your tree, there is exactly 
_one_ file that a "show-diff" will read: the .dircache/index file. That's 
it. After that, it will "stat()" exactly the files you are tracking, and 
nothing more. It will not touch any internal "git" data AT ALL. That 
"stat" will be somewhat expensive unless your client caches stat data too, 
but that's it.

And if it turns out that you have changed a file (or even just touched it, 
so that the data is the same, but the index file can no longer guarantee 
it with just a single "stat()"), then git will open have exactly _one_ 
file (no searching, no messing around), which contains absolutely nothing 
except for the compressed (and SHA1-signed) old contents of the file. It 
obviously _has_ to do that, because in order to know whether you've 
changed it, it needs to now compare it to the original.

IOW, "git" will literally touch the minimum IO necessary, and absolutely 
minimum cache-footprint.

The fact is, when tracking the 17,000 files in the kernel directory, most
of them are never actually changed. They literally are "free". They aren't
brought into the cache by "git" - not the file itself, not the backing
store. You set up the index file once, and you never ever touch them
again.  You could put the sha1 files on a tape, for all git cares.

The one exception obviously being when you actually instantiate the git 
archive for the first time (or when you throw it away). At that time you 
do touch all of the data, but that should be the only time.

THAT is what git is good at. It optimized for the "not a lot of changes"  
things, and pretty much all the operations are O(n) in the "size of
change", not in "size of repo".

That includes even things like "give me the diff between the top of tree
and the tree 10 days ago". If you know what your head was 10 days ago,
"git" will open exactly _four_ small files for this operation (the current
"top"  commit, the commit file of ten days ago, and the two "tree" files
associated with those). It will then need to open the backing store files 
for the files that are different between the two versions, but IT WILL 
NEVER EVEN LOOK at the files that it immediately sees are the same.

And that's actually true whether we're talking about the top-of-tree or
not. If I had the kernel history in git format (I don't - I estimate that
it would be about 1.5GB - 2GB in size, and would take me about ten days to
extract from BK ;), I could do a diff between _any_ tagged version (and I
mention "tagged" only as a way to look up the commit ID - it doesn't have
to be tagged if you know it some other way) in O(n) where 'n' is the
number of files that have changed between the revisions.

Number of changesets doesn't matter. Number of files doesn't matter. The
_only_ thing that matters is the size of the change.

Btw, I don't actually have a git command to do this yet. A bit of
scripting required to do it, but it's pretty trivial: you open the two
"commit" files that are the beginning/end of the thing, you look up what
the tree state was at each point, you open up the two tree files involved,
and you ignore all entries that match.

Since the tree files are already sorted, that "ignoring matches" is
basically free (technically that's O(n) in the number of files described,
but we're talking about something that even a slow machine can do so fast
you probably can't even time it with a stop-watch). You now have the 
complete list of files that have been changed (removed, added or "exists 
in both trees, but different contents"), and you can thus trivially create 
the whole tree with opening up _only_ the indexes for those files.

Ergo: O(n) in size of change. Both in work and in disk/cache access (where
the latter is often the more important one). Absolutely _zero_ indirection
anywhere apart from the initial stage to go from "commit" to "tree", so
there's no seeking except to actually read the files once you know what
they are (and since you know them up-front and there are no dependencies
at that point, you could even tell the OS to prefetch them if you really
cared about getting minimal disk seeks).

		Linus

  reply	other threads:[~2005-04-08 18:45 UTC|newest]

Thread overview: 201+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-06 15:42 Kernel SCM saga Linus Torvalds
2005-04-06 16:00 ` Greg KH
2005-04-07 16:40   ` Rik van Riel
2005-04-08  0:53     ` Jesse Barnes
2005-04-06 16:09 ` Daniel Phillips
2005-04-06 19:07 ` Jon Smirl
2005-04-06 19:24   ` Matan Peled
2005-04-06 19:49     ` Jon Smirl
2005-04-06 20:34       ` Hua Zhong
2005-04-07  1:31       ` Christoph Lameter
2005-04-06 19:39 ` Paul P Komkoff Jr
2005-04-07  1:40   ` Martin Pool
2005-04-07  1:47     ` Jeff Garzik
2005-04-07  2:26       ` Martin Pool
2005-04-07  2:32         ` David Lang
2005-04-07  5:38           ` Martin Pool
2005-04-07 23:27             ` Linus Torvalds
2005-04-08  5:56               ` Martin Pool
2005-04-08  6:41                 ` Linus Torvalds
2005-04-08  8:38                   ` Andrea Arcangeli
2005-04-08 23:38                     ` Daniel Phillips
2005-04-09  2:54                       ` Andrea Arcangeli
2005-04-09  0:12                     ` Linus Torvalds
2005-04-09  2:27                       ` Andrea Arcangeli
2005-04-09  2:32                         ` David Lang
2005-04-09  3:08                         ` Brian Gerst
2005-04-09  3:15                           ` Andrea Arcangeli
2005-04-09  5:45                         ` Linus Torvalds
2005-04-09 22:55                           ` David S. Miller
2005-04-09 23:13                             ` Linus Torvalds
2005-04-10  0:14                               ` Chris Wedgwood
2005-04-10  1:56                                 ` Paul Jackson
2005-04-10 12:03                                   ` Ingo Molnar
2005-04-10 17:38                                     ` Paul Jackson
2005-04-10 17:46                                       ` Ingo Molnar
2005-04-10 17:56                                         ` Paul Jackson
2005-04-10  0:22                             ` Paul Jackson
2005-04-10 11:33                             ` Ingo Molnar
2005-04-10 17:55                         ` Matthias Andree
2005-04-09 16:33                       ` Roman Zippel
2005-04-09 23:31                         ` Tupshin Harper
2005-04-10 17:24                         ` Code snippet to reconstruct ancestry graph from bk repo Paul P Komkoff Jr
2005-04-10 18:19                           ` Roman Zippel
2005-04-08 16:46                   ` Kernel SCM saga Catalin Marinas
2005-04-07  8:14           ` Magnus Damm
2005-04-07  7:53       ` Zwane Mwaikambo
2005-04-07  3:35     ` Daniel Phillips
2005-04-07 15:08       ` Daniel Phillips
2005-04-07  6:36   ` bert hubert
2005-04-06 23:22 ` Jon Masters
2005-04-07  6:51 ` Paul Mackerras
2005-04-07  7:48   ` Arjan van de Ven
2005-04-07 15:10   ` Linus Torvalds
2005-04-07 17:00     ` Daniel Phillips
2005-04-07 17:38       ` Linus Torvalds
2005-04-07 17:47         ` Chris Wedgwood
2005-04-07 18:06         ` Magnus Damm
2005-04-07 18:36         ` Daniel Phillips
2005-04-08  3:35         ` Jeff Garzik
2005-04-07 19:56       ` Sam Ravnborg
2005-04-07 23:21     ` Dave Airlie
2005-04-07  7:18 ` David Woodhouse
2005-04-07  8:50   ` Andrew Morton
2005-04-07  9:20     ` Paul Mackerras
2005-04-07  9:46       ` Andrew Morton
2005-04-07 11:17         ` Paul Mackerras
2005-04-07 10:41       ` Geert Uytterhoeven
2005-04-07  9:25     ` David Woodhouse
2005-04-07  9:49       ` Andrew Morton
2005-04-07  9:55       ` Russell King
2005-04-07 10:11         ` David Woodhouse
2005-04-07  9:40     ` David Vrabel
2005-04-07  9:24   ` Sergei Organov
2005-04-07 10:30     ` Matthias Andree
2005-04-07 10:54       ` Andrew Walrond
2005-04-09 16:17       ` David Roundy
2005-04-10  9:24         ` Giuseppe Bilotta
2005-04-10 13:51           ` David Roundy
2005-04-07 15:32   ` Linus Torvalds
2005-04-07 17:09     ` Daniel Phillips
2005-04-07 17:10     ` Al Viro
2005-04-07 17:47       ` Linus Torvalds
2005-04-07 18:04         ` Jörn Engel
2005-04-07 18:27           ` Daniel Phillips
2005-04-07 20:54           ` Arjan van de Ven
2005-04-08  3:41         ` Jeff Garzik
2005-04-07 17:52       ` Bartlomiej Zolnierkiewicz
2005-04-07 17:54       ` Daniel Phillips
2005-04-07 18:13         ` Dmitry Yusupov
2005-04-07 18:29           ` Daniel Phillips
2005-04-10 22:33             ` Troy Benjegerdes
2005-04-11  0:00               ` Christian Parpart
2005-04-08 17:24         ` Jon Masters
2005-04-08 22:05           ` Daniel Phillips
2005-04-08 22:52     ` Roman Zippel
2005-04-08 23:46       ` Tupshin Harper
2005-04-09  1:00         ` Roman Zippel
2005-04-09  1:23           ` Tupshin Harper
2005-04-09 16:52       ` Eric D. Mudama
2005-04-09 17:40         ` Roman Zippel
2005-04-09 18:56           ` Ray Lee
2005-04-07  7:44 ` Jan Hudec
2005-04-08  6:14   ` Matthias Urlichs
2005-04-09  1:01   ` Marcin Dalecki
2005-04-09  8:32     ` Jan Hudec
2005-04-11  2:26     ` Miles Bader
2005-04-11  2:56       ` Marcin Dalecki
2005-04-11  6:36         ` Jan Hudec
2005-04-07 10:56 ` Andrew Walrond
2005-04-08  0:57 ` Ian Wienand
2005-04-08  4:13 ` Chris Wedgwood
2005-04-08  4:42   ` Linus Torvalds
2005-04-08  5:04     ` Chris Wedgwood
2005-04-08  5:14       ` H. Peter Anvin
2005-04-08  7:05         ` Rogan Dawes
2005-04-08  7:21           ` Daniel Phillips
2005-04-08  7:49             ` H. Peter Anvin
2005-04-08  7:14     ` Andrea Arcangeli
2005-04-08 12:02       ` Matthias Andree
2005-04-08 12:21         ` Florian Weimer
2005-04-08 14:26       ` Linus Torvalds
2005-04-08 16:15         ` Matthias-Christian Ott
2005-04-08 17:14           ` Linus Torvalds
2005-04-08 17:15             ` Chris Wedgwood
2005-04-08 17:46               ` Linus Torvalds
2005-04-08 18:05                 ` Chris Wedgwood
2005-04-08 19:03                   ` Linus Torvalds
2005-04-08 19:16                     ` Chris Wedgwood
2005-04-08 19:38                       ` Florian Weimer
2005-04-08 19:48                         ` Chris Wedgwood
2005-04-08 19:39                       ` Linus Torvalds
2005-04-08 20:11                         ` Uncached stat performace [ Was: Re: Kernel SCM saga.. ] Ragnar Kjørstad
2005-04-08 20:14                           ` Chris Wedgwood
2005-04-08 20:50                       ` Kernel SCM saga Luck, Tony
2005-04-08 21:27                         ` Linus Torvalds
2005-04-09 17:14                           ` Roman Zippel
2005-04-09  7:20                     ` Willy Tarreau
2005-04-09 15:15                     ` Paul Jackson
2005-04-08 17:25             ` Matthias-Christian Ott
2005-04-08 18:14               ` Linus Torvalds
2005-04-08 18:28                 ` Jon Smirl
2005-04-08 18:58                   ` Florian Weimer
2005-04-09  1:11                   ` Marcin Dalecki
2005-04-09  1:50                     ` David Lang
2005-04-09 22:12                       ` Florian Weimer
2005-04-08 19:16                 ` Matthias-Christian Ott
2005-04-08 19:32                   ` Linus Torvalds
2005-04-08 19:44                     ` Matthias-Christian Ott
2005-04-09  1:09                 ` Marcin Dalecki
2005-04-08 17:35             ` Jeff Garzik
2005-04-08 18:47               ` Linus Torvalds [this message]
2005-04-08 18:56                 ` Chris Wedgwood
2005-04-09  7:37                   ` Willy Tarreau
2005-04-09  7:47                     ` Neil Brown
2005-04-09  8:00                       ` Willy Tarreau
2005-04-09  9:34                         ` Neil Brown
2005-04-09 15:40                 ` Paul Jackson
2005-04-09 16:16                   ` Linus Torvalds
2005-04-09 17:15                     ` Paul Jackson
2005-04-09 17:35                     ` Paul Jackson
2005-04-09  1:04             ` Marcin Dalecki
2005-04-09 15:42               ` Paul Jackson
2005-04-09 18:45                 ` Marcin Dalecki
2005-04-09  1:00           ` Marcin Dalecki
2005-04-09  1:09             ` Chris Wedgwood
2005-04-09  1:21               ` Marcin Dalecki
2005-04-08  7:17     ` ross
2005-04-08 15:50       ` Linus Torvalds
2005-04-09  2:53         ` Petr Baudis
2005-04-09  7:08           ` Randy.Dunlap
2005-04-09 18:06             ` [PATCH] " Petr Baudis
2005-04-10  1:01           ` Phillip Lougher
2005-04-10  1:42             ` Petr Baudis
2005-04-10  1:57               ` Phillip Lougher
2005-04-09 15:50         ` Paul Jackson
2005-04-09 16:26           ` Linus Torvalds
2005-04-09 17:08             ` Paul Jackson
2005-04-10  3:41             ` Paul Jackson
2005-04-10  8:39             ` David Lang
2005-04-10  9:40               ` Junio C Hamano
2005-04-10 16:46                 ` Bill Davidsen
2005-04-10 17:50                   ` Paul Jackson
2005-04-12 23:20                     ` Pavel Machek
2005-04-08  7:34     ` Marcel Lanz
2005-04-08  9:23       ` Geert Uytterhoeven
2005-04-08  8:38     ` Matt Johnston
2005-04-12  7:14     ` Kernel SCM saga.. (bk license?) Kedar Sovani
2005-04-12  9:34       ` Catalin Marinas
2005-04-13  4:04       ` Ricky Beam
2005-04-08 11:42   ` Kernel SCM saga Catalin Marinas
     [not found] <Pine.LNX.4.58.0504060800280.2215 () ppc970 ! osdl ! org>
2005-04-06 21:13 ` kfogel
2005-04-06 22:39   ` Jeff Garzik
2005-04-09  1:00   ` Marcin Dalecki
2005-04-08 22:27 Rajesh Venkatasubramanian
2005-04-08 23:29 ` Linus Torvalds
2005-04-09  0:29   ` Linus Torvalds
2005-04-09 16:20   ` Paul Jackson
2005-04-09  4:06 Walter Landry
2005-04-09 11:02 Samium Gromoff
2005-04-09 11:29 Samium Gromoff
2005-04-10  4:20 Albert Cahalan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.58.0504081114220.28951@ppc970.osdl.org \
    --to=torvalds@osdl.org \
    --cc=andrea@suse.de \
    --cc=cw@f00f.org \
    --cc=jgarzik@pobox.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matthias.christian@tiscali.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).