All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs
       [not found] <20100211234753.22574.48799.reportbug@gibbs.hungrycats.org>
@ 2010-02-12  0:27 ` Jonathan Nieder
  2010-02-12  1:23   ` Zygo Blaxell
  2010-02-14  1:36   ` mmap with MAP_PRIVATE is useless (was Re: Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs) Paolo Bonzini
  0 siblings, 2 replies; 84+ messages in thread
From: Jonathan Nieder @ 2010-02-12  0:27 UTC (permalink / raw)
  To: git, 569505-forwarded; +Cc: Zygo Blaxell

Hi gitsters,

Zygo Blaxell reported through http://bugs.debian.org/569505 that ‘git
update-index’ has some issues when the files it is adding change under
its feet:

My thoughts:

 - Low-hanging fruit: it should be possible for update-index to check
   the stat information to see if the file has changed between when it
   first opens it and when it finishes.

 - Zygo reported suppress that ‘git gc’ didn’t notice the problem.
   Should ‘git gc’ imply a ‘git fsck --no-full’?

 - Recovering from this kind of mistake in early history is indeed
   hard.  Any tricks for doing this?  Maybe fast-export | fast-import
   can do something with this, or maybe replace + filter-branch once
   it learns to be a little smarter.

 - How do checkout-index and cat-file blob react to a blob whose
   contents do not reflect its object name?  Are they behaving
   appropriately?  I would want cat-file blob to be able to retrieve
   such a broken blob’s contents, checkout-index not so much.

I imagine there are other things to learn, too.  The report and
reproduction recipe follow.

Thoughts?
Jonathan

Package: git-core
Version: 1:1.6.6.1-1
Severity: important

'git add' will happily corrupt a git repo if it is run while files in
the working directory are being modified.  A blob is added to the index
with contents that do not match its SHA1 hash.  If the index is then
committed, the corrupt blob cannot be checked out (or is checked out
with incorrect contents, depending on which tool you use to try to get
the file out of git) in the future.

Surprisingly, it's possible to clone, fetch, push, pull, and sometimes
even gc the corrupted repo several times before anyone notices the
corruption.  If the affected commit is included in a merge with history
from other git users, the only way to fix it is to rebase (or come up
with a blob whose contents match the affected SHA1 hash somehow).

It is usually possible to retrieve data committed before the corruption
by simply checking out an earlier tree in the affected branch's history.

The following shell code demonstrates this problem.  It runs a thread
which continuously modifies a file, and another thread that does
'git commit -am' and 'git fsck' in a continuous loop until corruption
is detected.  This might take up to 20 seconds on a slow machine.

	#!/bin/sh
	set -e

	# Create an empty git repo in /tmp/git-test
	rm -fr /tmp/git-test
	mkdir /tmp/git-test
	cd /tmp/git-test
	git init

	# Create a file named foo and add it to the repo
	touch foo
	git add foo

	# Thread 1:  continuously modify foo:
	while echo -n .; do
		dd if=/dev/urandom of=foo count=1024 bs=1k conv=notrunc >/dev/null 2>&1
	done &

	# Thread 2:  loop until the repo is corrupted
	while git fsck; do
		# Note the implied 'git add' in 'commit -a'
		# It will do the same with explicit 'git add'
		git commit -a -m'Test'
	done

	# Kill thread 1, we don't need it any more
	kill $!

	# Success!  Well, sort of.
	echo Repository is corrupted.  Have a nice day.

I discovered this bug accidentally when I was using inotifywait (from
the inotify-tools package) to automatically commit snapshots of a working
directory triggered by write events.

I tested this with a number of kernel versions from 2.6.27 to 2.6.31.
All of them reproduced this problem.  I checked this because strace
shows 'git add' doing a mmap(..., MAP_PRIVATE, ...) of its input file,
so I was wondering if there might have been a recent change in mmap()
behavior in either git or the kernel.

git 1.5.6.5 has this problem too, but some of the error messages are
different, and the problem sometimes manifests itself as silent corruption
of other objects (e.g. if someone checks out a corrupt tree and then does
'git add -u' or 'git commit -a', they will include the corrupt data in
their commit).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs
  2010-02-12  0:27 ` Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs Jonathan Nieder
@ 2010-02-12  1:23   ` Zygo Blaxell
  2010-02-13 12:12     ` Jonathan Nieder
  2010-02-14  1:36   ` mmap with MAP_PRIVATE is useless (was Re: Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs) Paolo Bonzini
  1 sibling, 1 reply; 84+ messages in thread
From: Zygo Blaxell @ 2010-02-12  1:23 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, 569505-forwarded

On Thu, Feb 11, 2010 at 06:27:41PM -0600, Jonathan Nieder wrote:
> Zygo Blaxell reported through http://bugs.debian.org/569505 that ???git
> update-index??? has some issues when the files it is adding change under
> its feet:
> 
> My thoughts:
> 
>  - Low-hanging fruit: it should be possible for update-index to check
>    the stat information to see if the file has changed between when it
>    first opens it and when it finishes.

I don't think this is a good idea.  stat() is very coarse-grained, and
provides accuracy of only a second on a lot of file systems where git
working directories might be found.  If you run the test script on an
ext3 filesystem on a modern machine the stat() data won't change at all
even though the file contents change completely many times.

What would be a good idea is to make sure that the code that copies a
file into the index and calculates its hash does both in a single pass
over the same input data.  That might require replacing a simple mmap()
of the input file with a read-hash-copy loop.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: 'git add' corrupts repository if the working directory is modified as it runs
  2010-02-12  1:23   ` Zygo Blaxell
@ 2010-02-13 12:12     ` Jonathan Nieder
  2010-02-13 13:39       ` Ilari Liusvaara
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Nieder @ 2010-02-13 12:12 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: git

Zygo Blaxell wrote:
> On Thu, Feb 11, 2010 at 06:27:41PM -0600, Jonathan Nieder wrote:

>>  - Low-hanging fruit: it should be possible for update-index to check
>>    the stat information to see if the file has changed between when it
>>    first opens it and when it finishes.
>
> I don't think this is a good idea.  stat() is very coarse-grained

You’re probably right.  For many file types, st_size is likely to
change (in this way your script is testing something unusual), but
that is no excuse to behave poorly when it doesn’t.

> What would be a good idea is to make sure that the code that copies a
> file into the index and calculates its hash does both in a single pass
> over the same input data.  That might require replacing a simple mmap()
> of the input file with a read-hash-copy loop.

This leaves me nervous about speed.  Consider the following simple
case: someone the file to be added is already in the object
repository somewhere (maybe the user has tried this code before, or
a file was renamed with 'mv', or a patch applied with 'patch', or an
unmount and remount dirtied the stat information).

With the current code, write_sha1_file() will hash the file, notice
that object is already in .git/objects, and return.  With a
read-hash-copy loop, git would have to store a (compressed or
uncompressed) copy of the file somewhere in the meantime.

But I’d be happy to see code appear that proves me wrong. ;-)  One
simple benchmark to try is running the git test suite.

Cheers,
Jonathan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: 'git add' corrupts repository if the working directory is modified as it runs
  2010-02-13 12:12     ` Jonathan Nieder
@ 2010-02-13 13:39       ` Ilari Liusvaara
  2010-02-13 14:39         ` Thomas Rast
  0 siblings, 1 reply; 84+ messages in thread
From: Ilari Liusvaara @ 2010-02-13 13:39 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Zygo Blaxell, git

On Sat, Feb 13, 2010 at 06:12:38AM -0600, Jonathan Nieder wrote:
> 
> This leaves me nervous about speed.  Consider the following simple
> case: someone the file to be added is already in the object
> repository somewhere (maybe the user has tried this code before, or
> a file was renamed with 'mv', or a patch applied with 'patch', or an
> unmount and remount dirtied the stat information).
> 
> With the current code, write_sha1_file() will hash the file, notice
> that object is already in .git/objects, and return.  With a
> read-hash-copy loop, git would have to store a (compressed or
> uncompressed) copy of the file somewhere in the meantime.

It could be done by first reading the file and computing hash,
if the hash matches existing object, return that hash. Otherwise
read the file for object write, hashing it again and use that value
for object ID.

This would require two hash computations in non-existing case,
but SHA-1 is pretty fast. If the first computation produces match,
then it doesn't matter if file is modified as adding and modifying
in parallel results undefined contents anyway (just that it should
not corrupt the repository).

-Ilari

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: 'git add' corrupts repository if the working directory is modified as it runs
  2010-02-13 13:39       ` Ilari Liusvaara
@ 2010-02-13 14:39         ` Thomas Rast
  2010-02-13 16:29           ` Ilari Liusvaara
  0 siblings, 1 reply; 84+ messages in thread
From: Thomas Rast @ 2010-02-13 14:39 UTC (permalink / raw)
  To: Ilari Liusvaara; +Cc: Jonathan Nieder, Zygo Blaxell, git

On Saturday 13 February 2010 14:39:52 Ilari Liusvaara wrote:
> On Sat, Feb 13, 2010 at 06:12:38AM -0600, Jonathan Nieder wrote:
> > 
> > With the current code, write_sha1_file() will hash the file, notice
> > that object is already in .git/objects, and return.  With a
> > read-hash-copy loop, git would have to store a (compressed or
> > uncompressed) copy of the file somewhere in the meantime.
> 
> It could be done by first reading the file and computing hash,
> if the hash matches existing object, return that hash. Otherwise
> read the file for object write, hashing it again and use that value
> for object ID.

That is still racy.  The real problem is that the file is mmap()ed,
and git then first computes the SHA1 of that buffer, next it
compresses it.[*]

Due to the last sentence in the following snippet from mmap(2):

  MAP_PRIVATE
           Create a private copy-on-write mapping.  Updates to the map-
           ping are not visible to other  processes  mapping  the  same
           file,  and  are  not carried through to the underlying file.
           It is unspecified whether changes made to the file after the
           mmap() call are visible in the mapped region.

This is racy despite the use of MAP_PRIVATE: the mapped contents can
change at any time.

AFAICS there are only two possible solutions:

* Copy the file (possibly block-by-block) as we go, to make sure that
  the data we SHA1 is the same we compress.

* Unpack and re-hash the compressed data to verify that the SHA1 is
  correct.  In case of failure either retry (but you could have to do
  this infinitely often if the user just hates you!) or abort.

(Of course, in neither case does the user have any sort of guarantee
about what data ended up in the repository, but he never had that, we
only try to ensure repo consistency.)


[*] The "do we have this" check actually happens before the
compression, and that arm is thus race-free.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: 'git add' corrupts repository if the working directory is modified as it runs
  2010-02-13 14:39         ` Thomas Rast
@ 2010-02-13 16:29           ` Ilari Liusvaara
  2010-02-13 22:09             ` Dmitry Potapov
  0 siblings, 1 reply; 84+ messages in thread
From: Ilari Liusvaara @ 2010-02-13 16:29 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Jonathan Nieder, Zygo Blaxell, git

On Sat, Feb 13, 2010 at 03:39:53PM +0100, Thomas Rast wrote:
> On Saturday 13 February 2010 14:39:52 Ilari Liusvaara wrote:
> > On Sat, Feb 13, 2010 at 06:12:38AM -0600, Jonathan Nieder wrote:
> > > 
> > > With the current code, write_sha1_file() will hash the file, notice
> > > that object is already in .git/objects, and return.  With a
> > > read-hash-copy loop, git would have to store a (compressed or
> > > uncompressed) copy of the file somewhere in the meantime.
> > 
> > It could be done by first reading the file and computing hash,
> > if the hash matches existing object, return that hash. Otherwise
> > read the file for object write, hashing it again and use that value
> > for object ID.
> 
> That is still racy.  The real problem is that the file is mmap()ed,
> and git then first computes the SHA1 of that buffer, next it
> compresses it.[*]

Hmm... One needs to copy the data block at time into temporary buffer
and use that for feeding zlib and SHA-1. That ensures that whatever
SHA-1 hashes and zlib compresses are consistent.

-Ilari

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: 'git add' corrupts repository if the working directory is  modified as it runs
  2010-02-13 16:29           ` Ilari Liusvaara
@ 2010-02-13 22:09             ` Dmitry Potapov
  2010-02-13 22:37               ` Zygo Blaxell
  0 siblings, 1 reply; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-13 22:09 UTC (permalink / raw)
  To: Ilari Liusvaara; +Cc: Thomas Rast, Jonathan Nieder, Zygo Blaxell, git

On Sat, Feb 13, 2010 at 7:29 PM, Ilari Liusvaara
<ilari.liusvaara@elisanet.fi> wrote:
>
> Hmm... One needs to copy the data block at time into temporary buffer
> and use that for feeding zlib and SHA-1. That ensures that whatever
> SHA-1 hashes and zlib compresses are consistent.

If you want this then just compile Git with NO_MMAP = YesPlease
It should solve the described problem.

Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: 'git add' corrupts repository if the working directory is modified as it runs
  2010-02-13 22:09             ` Dmitry Potapov
@ 2010-02-13 22:37               ` Zygo Blaxell
  2010-02-14  1:18                 ` [PATCH] don't use mmap() to hash files Dmitry Potapov
  0 siblings, 1 reply; 84+ messages in thread
From: Zygo Blaxell @ 2010-02-13 22:37 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sun, Feb 14, 2010 at 01:09:23AM +0300, Dmitry Potapov wrote:
> On Sat, Feb 13, 2010 at 7:29 PM, Ilari Liusvaara
> <ilari.liusvaara@elisanet.fi> wrote:
> > Hmm... One needs to copy the data block at time into temporary buffer
> > and use that for feeding zlib and SHA-1. That ensures that whatever
> > SHA-1 hashes and zlib compresses are consistent.
> 
> If you want this then just compile Git with NO_MMAP = YesPlease
> It should solve the described problem.

Doesn't that also turn off mmap in other places where it's harmless or
even beneficial?  Otherwise, why use mmap in Git at all?  It also doesn't
solve the problem in cases when mmap support is compiled in.

There is a performance-robustness trade-off here.  If we do an extra
copy of the file data, we get always consistent repo contents but lose
speed (not very much speed, since sha1 and zlib are much slower than
a memory copy).  If we don't, we still get consistent repo contents
if--and only if--the files never happen to be modified during a git add.
I can live with that, as long as the limitation is documented and there's
a config switch somewhere that I can turn on for cases when I can't make
such assurances.

I imagine similar reasoning led to the existence of the
receive.fsckObjects option.  Personally, checking all objects fetched
from remote repos is not a feature I'd ever want to be able to turn off;
however, it's easy to think of cases where integrity matters less than
speed (e.g. build systems doing clone-build-destroy cycles from a trusted,
reliable server over a similarly trusted network).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH] don't use mmap() to hash files
  2010-02-13 22:37               ` Zygo Blaxell
@ 2010-02-14  1:18                 ` Dmitry Potapov
  2010-02-14  1:37                   ` Junio C Hamano
                                     ` (3 more replies)
  0 siblings, 4 replies; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-14  1:18 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

If a mmapped file is changed by another program during git-add, it
causes the repository corruption. Disabling mmap() in index_fd() does
not have any negative impact on the overall speed of Git. In fact, it
makes git hash-object to work slightly faster. Here is the best result
before and after patch based on 5 runs on the Linix kernel repository:

Before:

$ git ls-files | time git hash-object --stdin-path > /dev/null
2.15user 0.36system 0:02.52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+103248minor)pagefaults 0swaps

After:

$ git ls-files | time ../git/git hash-object --stdin-path > /dev/null
2.09user 0.33system 0:02.42elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1073minor)pagefaults 0swaps

Signed-off-by: Dmitry Potapov <dpotapov@gmail.com>
---

I think more people should test this change to see its impact on
performance. For me, it was positive. Here is my results:

Using mmap (git version 1.7.0)

2.18user 0.33system 0:02.52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+103248minor)pagefaults 0swaps
$ git ls-files | time git hash-object --stdin-path > /dev/null
2.23user 0.28system 0:02.53elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+103249minor)pagefaults 0swaps
$ git ls-files | time git hash-object --stdin-path > /dev/null
2.20user 0.31system 0:02.52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+103248minor)pagefaults 0swaps
$ git ls-files | time git hash-object --stdin-path > /dev/null
2.21user 0.30system 0:02.51elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+103248minor)pagefaults 0swaps
$ git ls-files | time git hash-object --stdin-path > /dev/null
2.15user 0.36system 0:02.52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+103248minor)pagefaults 0swaps

Using read() instead of mmap() (1.7.0 with the above patch)

$ git ls-files | time ../git/git hash-object --stdin-path > /dev/null
2.19user 0.24system 0:02.42elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1073minor)pagefaults 0swaps
$ git ls-files | time ../git/git hash-object --stdin-path > /dev/null
2.15user 0.26system 0:02.42elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1073minor)pagefaults 0swaps
$ git ls-files | time ../git/git hash-object --stdin-path > /dev/null
2.18user 0.24system 0:02.42elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1073minor)pagefaults 0swaps
$ git ls-files | time ../git/git hash-object --stdin-path > /dev/null
2.18user 0.25system 0:02.42elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1073minor)pagefaults 0swaps
$ git ls-files | time ../git/git hash-object --stdin-path > /dev/null
2.09user 0.33system 0:02.42elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1073minor)pagefaults 0swaps


 sha1_file.c |   22 +++++++---------------
 1 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 657825e..83f82a2 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2438,22 +2438,14 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 	     enum object_type type, const char *path)
 {
 	int ret;
-	size_t size = xsize_t(st->st_size);
+	struct strbuf sbuf = STRBUF_INIT;
 
-	if (!S_ISREG(st->st_mode)) {
-		struct strbuf sbuf = STRBUF_INIT;
-		if (strbuf_read(&sbuf, fd, 4096) >= 0)
-			ret = index_mem(sha1, sbuf.buf, sbuf.len, write_object,
-					type, path);
-		else
-			ret = -1;
-		strbuf_release(&sbuf);
-	} else if (size) {
-		void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
-		ret = index_mem(sha1, buf, size, write_object, type, path);
-		munmap(buf, size);
-	} else
-		ret = index_mem(sha1, NULL, size, write_object, type, path);
+	if (strbuf_read(&sbuf, fd, 4096) >= 0)
+		ret = index_mem(sha1, sbuf.buf, sbuf.len, write_object,
+				type, path);
+	else
+		ret = -1;
+	strbuf_release(&sbuf);
 	close(fd);
 	return ret;
 }
-- 
1.7.0

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* mmap with MAP_PRIVATE is useless (was Re: Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs)
  2010-02-12  0:27 ` Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs Jonathan Nieder
  2010-02-12  1:23   ` Zygo Blaxell
@ 2010-02-14  1:36   ` Paolo Bonzini
  2010-02-14  1:53     ` mmap with MAP_PRIVATE is useless Junio C Hamano
  1 sibling, 1 reply; 84+ messages in thread
From: Paolo Bonzini @ 2010-02-14  1:36 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, 569505-forwarded, Zygo Blaxell


> I tested this with a number of kernel versions from 2.6.27 to 2.6.31.
> All of them reproduced this problem.  I checked this because strace
> shows 'git add' doing a mmap(..., MAP_PRIVATE, ...) of its input file,
> so I was wondering if there might have been a recent change in mmap()
> behavior in either git or the kernel.

 From mmap(2): "it is unspecified whether changes made to the file after 
the mmap() call are visible in the mapped region".

You may think that doing a dummy "*p = *p" every 4096 bytes `fixes' it 
(because it causes copy-on-write for every page) but even that does not 
work because you can get a SIGBUS if the file is truncated while you 
have it mapped privately (e.g. by fopen ("file", "w") or open with O_TRUNC).

Testcase:

#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>

int
main ()
{
   system ("echo foo > file.test");
   int f = open ("file.test", O_RDONLY);
   char *p = mmap (NULL, 4096, PROT_READ, MAP_PRIVATE, f, 0);
   close (f);
   printf ("%s", p);       // prints "foo\n"

   f = open ("file.test", O_RDWR | O_CREAT | O_TRUNC, 0666);
   write (f, "bar\n", 4);  // comment out and next printf SIGSEGVs
   printf ("%s", p);       // prints "bar\n"
}

This means that MAP_PRIVATE is utterly useless.

Paolo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14  1:18                 ` [PATCH] don't use mmap() to hash files Dmitry Potapov
@ 2010-02-14  1:37                   ` Junio C Hamano
  2010-02-14  2:18                     ` Dmitry Potapov
  2010-02-14  1:53                   ` Johannes Schindelin
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2010-02-14  1:37 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Dmitry Potapov <dpotapov@gmail.com> writes:

> If a mmapped file is changed by another program during git-add, it
> causes the repository corruption. Disabling mmap() in index_fd() does
> not have any negative impact on the overall speed of Git. In fact, it
> makes git hash-object to work slightly faster....
> ...
> I think more people should test this change to see its impact on
> performance. For me, it was positive. Here is my results:

I wasn't particularly impressed by the original problem description, but
this is a very interesting result.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: mmap with MAP_PRIVATE is useless
  2010-02-14  1:36   ` mmap with MAP_PRIVATE is useless (was Re: Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs) Paolo Bonzini
@ 2010-02-14  1:53     ` Junio C Hamano
  2010-02-14  2:11       ` Paolo Bonzini
  0 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2010-02-14  1:53 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Jonathan Nieder, git, 569505-forwarded, Zygo Blaxell

Paolo Bonzini <bonzini@gnu.org> writes:

> This means that MAP_PRIVATE is utterly useless.

I do not think we ever used MAP_PRIVATE in order to protect outselves from
uncontrolled changes made by the outside world in the first place.  Back
when most of these mmap calls were written by Linus and myself, we weren't
interested in using MAP_PRIVATE, or any other trick for that matter, to
deal with the case where the user tells git to go index a file, and then
mucks with the file before git finishes and gives back control.

We do use mmap in read-write mode when reading from the index file, and we
use MAP_PRIVATE to protect the outside world from our writing into the
mapped memory.  As far as I know that is the only mmap for which
MAP_PRIVATE matters in the core git codebase.

Our calls to mmap() almost all have MAP_PRIVATE, even for read-only mmap,
but that is more or less from inertia, aka "an existing call to mmap is
with these options, I'll add another call imitating that".

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14  1:18                 ` [PATCH] don't use mmap() to hash files Dmitry Potapov
  2010-02-14  1:37                   ` Junio C Hamano
@ 2010-02-14  1:53                   ` Johannes Schindelin
  2010-02-14  2:00                     ` Junio C Hamano
  2010-02-14  2:42                     ` Dmitry Potapov
  2010-02-14  3:05                   ` [PATCH v2] " Dmitry Potapov
  2010-02-18  1:16                   ` [PATCH] Teach "git add" and friends to be paranoid Junio C Hamano
  3 siblings, 2 replies; 84+ messages in thread
From: Johannes Schindelin @ 2010-02-14  1:53 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Hi,

On Sun, 14 Feb 2010, Dmitry Potapov wrote:

> +	if (strbuf_read(&sbuf, fd, 4096) >= 0)

How certain are you at this point that all of fd's contents fit into your 
memory?

And even if you could be certain, a hint is missing that strbuf_read(), 
its name notwithstanding, does not read NUL-terminated strings. Oh, and 
the size is just a hint for the initial size, and it reads until EOF. That 
has to be said in the commit message.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14  1:53                   ` Johannes Schindelin
@ 2010-02-14  2:00                     ` Junio C Hamano
  2010-02-14  2:42                     ` Dmitry Potapov
  1 sibling, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2010-02-14  2:00 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> And even if you could be certain, a hint is missing that strbuf_read(), 
> its name notwithstanding, does not read NUL-terminated strings. Oh, and 
> the size is just a hint for the initial size, and it reads until EOF. That 
> has to be said in the commit message.

It is healthy to ask for explanation, especially when the patch cannot
justify itself by reading only the diff but reviewer needs to check the
surrounding code.

I however think you are asking a little too much in this particular case.
That strbuf_read() code is exactly the same as in used for "reading from
pipe and we don't know how big it is" case in the original code.  You can
see it in the lines deleted by the patch.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: mmap with MAP_PRIVATE is useless
  2010-02-14  1:53     ` mmap with MAP_PRIVATE is useless Junio C Hamano
@ 2010-02-14  2:11       ` Paolo Bonzini
  0 siblings, 0 replies; 84+ messages in thread
From: Paolo Bonzini @ 2010-02-14  2:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Nieder, git, Zygo Blaxell

On 02/14/2010 02:53 AM, Junio C Hamano wrote:
> Back when most of these mmap calls were written by Linus and myself,
> we weren't interested in using MAP_PRIVATE, or any other trick for
> that matter, to deal with the case where the user tells git to go
> index a file, and then mucks with the file before git finishes and
> gives back control.

Eh, this one in particular (in index_fd) is quite ancient...

commit e83c5163316f89bfbde7d9ab23ca2e25604af290
Author: Linus Torvalds <torvalds@ppc970.osdl.org>
Date:   Thu Apr 7 15:13:13 2005 -0700

     Initial revision of "git", the information manager from hell

:-)

There were three mmap calls -- in read_sha1_file, read_cache and 
index_fd -- and all three were of the same mmap (NULL, st.st_size, 
PROT_READ, MAP_PRIVATE, fd, 0) shape.

Paolo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14  1:37                   ` Junio C Hamano
@ 2010-02-14  2:18                     ` Dmitry Potapov
  2010-02-14  3:14                       ` Junio C Hamano
  0 siblings, 1 reply; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-14  2:18 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sat, Feb 13, 2010 at 05:37:21PM -0800, Junio C Hamano wrote:
> Dmitry Potapov <dpotapov@gmail.com> writes:
> 
> > If a mmapped file is changed by another program during git-add, it
> > causes the repository corruption. Disabling mmap() in index_fd() does
> > not have any negative impact on the overall speed of Git. In fact, it
> > makes git hash-object to work slightly faster....
> > ...
> > I think more people should test this change to see its impact on
> > performance. For me, it was positive. Here is my results:
> 
> I wasn't particularly impressed by the original problem description, but
> this is a very interesting result.

My initial reaction was to write that the whole problem is due to abuse
Git for the purposes that it was not intended. But then I decided to do
some testing to see what impact it has. And, because I do not see any
negative impact (in fact, slightly improve speed), and I decided to ask
other people (who are interested in this patch) to do more testing.


Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14  1:53                   ` Johannes Schindelin
  2010-02-14  2:00                     ` Junio C Hamano
@ 2010-02-14  2:42                     ` Dmitry Potapov
  2010-02-14 11:07                       ` Jakub Narebski
                                         ` (2 more replies)
  1 sibling, 3 replies; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-14  2:42 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sun, Feb 14, 2010 at 02:53:58AM +0100, Johannes Schindelin wrote:
> On Sun, 14 Feb 2010, Dmitry Potapov wrote:
> 
> > +	if (strbuf_read(&sbuf, fd, 4096) >= 0)
> 
> How certain are you at this point that all of fd's contents fit into your 
> memory?

You can't be sure... In fact, we know mmap() also may fail for huge
files, so can strbuf_read(). Perhaps, mmap() behaves better when you
want to hash a huge file that does not fit in free physical memory, but
I do not think it is an important use case for any VCS, which mostly
stores small text files and a few not so big binary files. Git is not
design to store your video collection. (probably, Git can be improved to
handle big files better but I leave that exercise to those who want to
store their media files in Git).

> 
> And even if you could be certain, a hint is missing that strbuf_read(), 
> its name notwithstanding, does not read NUL-terminated strings. Oh, and 
> the size is just a hint for the initial size, and it reads until EOF. That 
> has to be said in the commit message.

I did not add _any_ new code, including the above line. It was there
before my patch. I only removed a few lines for the case when we used
mmap() and left the code that used strbuf_read() (mmap() was used for
regular files and strbuf_read() for other type of descriptors).

The fact that I re-aligned those lines could not introduce any bug, so
if you think this code incorrect then it was before my patch, but I do
not see why. And, I see no reason to comment on the code that does not
change at all (including that the third parameter of strbuf_read() is
just a hint). I may agree with you that strbuf_read with 'hint' is a bit
confusing, but it has nothing to do with my patch...


Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v2] don't use mmap() to hash files
  2010-02-14  1:18                 ` [PATCH] don't use mmap() to hash files Dmitry Potapov
  2010-02-14  1:37                   ` Junio C Hamano
  2010-02-14  1:53                   ` Johannes Schindelin
@ 2010-02-14  3:05                   ` Dmitry Potapov
  2010-02-18  1:16                   ` [PATCH] Teach "git add" and friends to be paranoid Junio C Hamano
  3 siblings, 0 replies; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-14  3:05 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Zygo Blaxell,
	Ilari Liusvaara, Thomas Rast, Jonathan Nieder

If a mmapped file is changed by another program during git-add, it
causes the repository corruption. Disabling mmap() in index_fd() does
not have any negative impact on the overall speed of Git. In fact, it
makes git hash-object to work slightly faster. Here is the best result
before and after patch based on 5 runs on the Linix kernel repository:

Before:

$ git ls-files | time git hash-object --stdin-path > /dev/null
2.15user 0.36system 0:02.52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+103248minor)pagefaults 0swaps

After:

$ git ls-files | time ../git/git hash-object --stdin-path > /dev/null
2.09user 0.33system 0:02.42elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1073minor)pagefaults 0swaps

Signed-off-by: Dmitry Potapov <dpotapov@gmail.com>
---

In this version, I have improved the hint value for regular files to
avoid useless re-allocation and copy.

 sha1_file.c |   27 +++++++++++----------------
 1 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 657825e..26c6231 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2438,22 +2438,17 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 	     enum object_type type, const char *path)
 {
 	int ret;
-	size_t size = xsize_t(st->st_size);
-
-	if (!S_ISREG(st->st_mode)) {
-		struct strbuf sbuf = STRBUF_INIT;
-		if (strbuf_read(&sbuf, fd, 4096) >= 0)
-			ret = index_mem(sha1, sbuf.buf, sbuf.len, write_object,
-					type, path);
-		else
-			ret = -1;
-		strbuf_release(&sbuf);
-	} else if (size) {
-		void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
-		ret = index_mem(sha1, buf, size, write_object, type, path);
-		munmap(buf, size);
-	} else
-		ret = index_mem(sha1, NULL, size, write_object, type, path);
+	struct strbuf sbuf = STRBUF_INIT;
+	/* for regular files, we supply the real file size, otherwise
+	   `size' is just a hint */
+	size_t size = S_ISREG(st->st_mode) ? xsize_t(st->st_size) : 4096;
+
+	if (strbuf_read(&sbuf, fd, size) >= 0)
+		ret = index_mem(sha1, sbuf.buf, sbuf.len, write_object,
+				type, path);
+	else
+		ret = -1;
+	strbuf_release(&sbuf);
 	close(fd);
 	return ret;
 }
-- 
1.7.0

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14  2:18                     ` Dmitry Potapov
@ 2010-02-14  3:14                       ` Junio C Hamano
  2010-02-14 11:14                         ` Thomas Rast
  0 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2010-02-14  3:14 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Dmitry Potapov <dpotapov@gmail.com> writes:

> My initial reaction was to write that the whole problem is due to abuse
> Git for the purposes that it was not intended. But then I decided to do
> some testing to see what impact it has. And, because I do not see any
> negative impact (in fact, slightly improve speed), and I decided to ask
> other people (who are interested in this patch) to do more testing.

Yes, I know, and I greatly appreciate it.

Later, we might want to split the codepath again to:

 (0) see if it is huge or small if we are not reading from pipe;

 (1) if we do not know the size or if it is moderately tiny, keep doing
     what your code does;

 (2) if we know we are reading something huge with known size, then have a
     loop to read-a-bit-compress-and-write-it-out-while-hashing, and
     finally rename the loose resulting object to the final name.  Or we
     may even want to do that into a new pack on its own.

but obviously all of that can come after this initial round ;-)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14  2:42                     ` Dmitry Potapov
@ 2010-02-14 11:07                       ` Jakub Narebski
  2010-02-14 11:55                       ` Paolo Bonzini
  2010-02-14 18:10                       ` Johannes Schindelin
  2 siblings, 0 replies; 84+ messages in thread
From: Jakub Narebski @ 2010-02-14 11:07 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Johannes Schindelin, Zygo Blaxell, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

Dmitry Potapov <dpotapov@gmail.com> writes:
> On Sun, Feb 14, 2010 at 02:53:58AM +0100, Johannes Schindelin wrote:
> > On Sun, 14 Feb 2010, Dmitry Potapov wrote:
> > 
> > > +	if (strbuf_read(&sbuf, fd, 4096) >= 0)
> > 
> > How certain are you at this point that all of fd's contents fit into your 
> > memory?
> 
> You can't be sure... In fact, we know mmap() also may fail for huge
> files, so can strbuf_read(). Perhaps, mmap() behaves better when you
> want to hash a huge file that does not fit in free physical memory, but
> I do not think it is an important use case for any VCS, which mostly
> stores small text files and a few not so big binary files. Git is not
> design to store your video collection. (probably, Git can be improved to
> handle big files better but I leave that exercise to those who want to
> store their media files in Git).

Something like git-bigfiles: http://caca.zoy.org/wiki/git-bigfiles ?

(found via http://blog.bitquabit.com/2010/02/10/fightings-been-fun-and-all-its-time-shut-and-get-along/
 found via http://tomayko.com/, entry for 10 Feb 2010).
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14  3:14                       ` Junio C Hamano
@ 2010-02-14 11:14                         ` Thomas Rast
  2010-02-14 11:46                           ` Junio C Hamano
  0 siblings, 1 reply; 84+ messages in thread
From: Thomas Rast @ 2010-02-14 11:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara, Jonathan Nieder, git

On Sunday 14 February 2010 04:14:01 Junio C Hamano wrote:
> Later, we might want to split the codepath again to:
> 
>  (0) see if it is huge or small if we are not reading from pipe;
> 
>  (1) if we do not know the size or if it is moderately tiny, keep doing
>      what your code does;
> 
>  (2) if we know we are reading something huge with known size, then have a
>      loop to read-a-bit-compress-and-write-it-out-while-hashing, and
>      finally rename the loose resulting object to the final name.  Or we
>      may even want to do that into a new pack on its own.

There's a slight problem with that code (I tried to finish last
night's attempt but got stuck on this):

The create_tmpfile() and move_temp_to_file() duo goes to some lengths
to ensure that the file is created in the same directory that we want
it to end up in.  However, in the block-based scheme, you cannot know
which directory this will be before you have already written the
entire output.

So again I guess there are a few possible solutions:

* Try the cross-directory rename anyway, but if it doesn't work,
  copy&unlink.  This of course means that you may write the same
  object over network twice.

* Declare that keeping the memory usage near what it is today (the
  full output buffer plus a constant) is okay.

* Give up and stick with Dmitry's patch :-)

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14 11:14                         ` Thomas Rast
@ 2010-02-14 11:46                           ` Junio C Hamano
  0 siblings, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2010-02-14 11:46 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara, Jonathan Nieder, git

Thomas Rast <trast@student.ethz.ch> writes:

> * Give up and stick with Dmitry's patch :-)

You obviously didn't read the last line of my message before responding.

In any case, I have a suspicion that streaming to a single loose object
file would not buy us much (that is the only case the "cross directory
rename" could matter), exactly because we wouldn't want to leave an object
in loose form if it is so big that we do not want to slurp it in full into
memory anyway.  If we stream such a huge object directly to a new pack, on
the other hand, there won't be any cross directory rename issues.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14  2:42                     ` Dmitry Potapov
  2010-02-14 11:07                       ` Jakub Narebski
@ 2010-02-14 11:55                       ` Paolo Bonzini
  2010-02-14 18:10                       ` Johannes Schindelin
  2 siblings, 0 replies; 84+ messages in thread
From: Paolo Bonzini @ 2010-02-14 11:55 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Johannes Schindelin, Zygo Blaxell, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git, Junio C Hamano

On 02/14/2010 03:42 AM, Dmitry Potapov wrote:
> In fact, we know mmap() also may fail for huge files, so can
> strbuf_read().

On a 64-bit machine mmap should fail pretty much never, as it is limited
by address space and not available memory.

That said I can reproduce the result here too, and it's actually quite 
understandable from your "time" results: mmap has a minor page fault for 
every 4k you read (or at least that's the order of magnitude), read has 
basically none.  Furthermore, it's true that with read you touch the 
whole memory twice from beginning to end, while with mmap you touch a 
page twice and move on; however, a file's contents will almost always 
fit in the L2 cache, so it's not too expensive.

I tried madvise(buf, size, MADV_SEQUENTIAL) and MADV_WILLNEED but it has 
no effect.

I suspect mmap will be faster only if the data is not in cache (so the 
cost of a page fault is negligible compared to going to disk) and the 
average file size is a few megabytes.

Paolo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14  2:42                     ` Dmitry Potapov
  2010-02-14 11:07                       ` Jakub Narebski
  2010-02-14 11:55                       ` Paolo Bonzini
@ 2010-02-14 18:10                       ` Johannes Schindelin
  2010-02-14 19:06                         ` Dmitry Potapov
  2 siblings, 1 reply; 84+ messages in thread
From: Johannes Schindelin @ 2010-02-14 18:10 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Hi,

On Sun, 14 Feb 2010, Dmitry Potapov wrote:

> On Sun, Feb 14, 2010 at 02:53:58AM +0100, Johannes Schindelin wrote:
> > On Sun, 14 Feb 2010, Dmitry Potapov wrote:
> > 
> > > +	if (strbuf_read(&sbuf, fd, 4096) >= 0)
> > 
> > How certain are you at this point that all of fd's contents fit into 
> > your memory?
> 
> You can't be sure... In fact, we know mmap() also may fail for huge
> files, so can strbuf_read().

That's comparing oranges to apples. In one case, the address space runs 
out, in the other the available memory. The latter is much more likely.

> > And even if you could be certain, a hint is missing that 
> > strbuf_read(), its name notwithstanding, does not read NUL-terminated 
> > strings. Oh, and the size is just a hint for the initial size, and it 
> > reads until EOF. That has to be said in the commit message.
> 
> I did not add _any_ new code, including the above line. It was there
> before my patch.

But that explanation does not answer my question, does it? And my question 
was not unreasonable to ask, was it?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14 18:10                       ` Johannes Schindelin
@ 2010-02-14 19:06                         ` Dmitry Potapov
  2010-02-14 19:22                           ` Johannes Schindelin
  2010-02-14 23:13                           ` Avery Pennarun
  0 siblings, 2 replies; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-14 19:06 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sun, Feb 14, 2010 at 9:10 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> On Sun, 14 Feb 2010, Dmitry Potapov wrote:
>
>> On Sun, Feb 14, 2010 at 02:53:58AM +0100, Johannes Schindelin wrote:
>> > On Sun, 14 Feb 2010, Dmitry Potapov wrote:
>> >
>> > > + if (strbuf_read(&sbuf, fd, 4096) >= 0)
>> >
>> > How certain are you at this point that all of fd's contents fit into
>> > your memory?
>>
>> You can't be sure... In fact, we know mmap() also may fail for huge
>> files, so can strbuf_read().
>
> That's comparing oranges to apples. In one case, the address space runs
> out, in the other the available memory. The latter is much more likely.

"much more likely" is not a very qualitative characteristic... I would
prefer to see numbers. My gut feeling is that it is not a problem in
real use cases where Git is used as VCS and not storage for huge media
files.  In fact, we do not use mmap() on Windows or MacOS at all, and
I have not heard that users of those platforms suffered much more from
inability to store huge files than those who use Linux.  I do not want
to say that there is no difference, but it may not as large as you try
to portray. In any case, my patch was to let people to test it and to
see what impact it has and come up with some numbers.

BTW, probably, it is not difficult to stream a large file in chunks (and
it may be even much faster, because we work on CPU cache), but I suspect
it will not resolve all issues with huge files, because eventually we
need to store them in a pack file. So we need to develop some strategy
how to deal with them.

One way to deal with them is to stream directly into a separate pack.
Still, it does not resolve all problems, because each pack file should
be mapped into a memory, and this may be a problem for 32-bit system
(or even 64-bit systems where a sysadmin set limit on amount virtual
memory available a single program).

The other way to handle huge files is to split them into chunks.
http://article.gmane.org/gmane.comp.version-control.git/120112

Maybe there are other approaches. I heard some people tried to do
something about it, but i have never interested in big files to look
at this issue closely.

>
>> > And even if you could be certain, a hint is missing that
>> > strbuf_read(), its name notwithstanding, does not read NUL-terminated
>> > strings. Oh, and the size is just a hint for the initial size, and it
>> > reads until EOF. That has to be said in the commit message.
>>
>> I did not add _any_ new code, including the above line. It was there
>> before my patch.
>
> But that explanation does not answer my question, does it?

I believe it did, or I did not understand your question.

Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14 19:06                         ` Dmitry Potapov
@ 2010-02-14 19:22                           ` Johannes Schindelin
  2010-02-14 19:28                             ` Johannes Schindelin
  2010-02-14 19:55                             ` Dmitry Potapov
  2010-02-14 23:13                           ` Avery Pennarun
  1 sibling, 2 replies; 84+ messages in thread
From: Johannes Schindelin @ 2010-02-14 19:22 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Hi,

On Sun, 14 Feb 2010, Dmitry Potapov wrote:

> On Sun, Feb 14, 2010 at 9:10 PM, Johannes Schindelin 
> <Johannes.Schindelin@gmx.de> wrote:
> >
> > On Sun, 14 Feb 2010, Dmitry Potapov wrote:
> >
> >> On Sun, Feb 14, 2010 at 02:53:58AM +0100, Johannes Schindelin wrote:
> >> > On Sun, 14 Feb 2010, Dmitry Potapov wrote:
> >> >
> >> > > + if (strbuf_read(&sbuf, fd, 4096) >= 0)
> >> >
> >> > How certain are you at this point that all of fd's contents fit 
> >> > into your memory?
> >>
> >> You can't be sure... In fact, we know mmap() also may fail for huge 
> >> files, so can strbuf_read().
> >
> > That's comparing oranges to apples. In one case, the address space 
> > runs out, in the other the available memory. The latter is much more 
> > likely.
> 
> "much more likely" is not a very qualitative characteristic...

Git was touted as a "content tracker". So I use it as such.

Concrete example: in one of my repositories, the average file size is well 
over 2 gigabytes.

Go figure,
Dscho

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14 19:22                           ` Johannes Schindelin
@ 2010-02-14 19:28                             ` Johannes Schindelin
  2010-02-14 19:56                               ` Dmitry Potapov
  2010-02-14 19:55                             ` Dmitry Potapov
  1 sibling, 1 reply; 84+ messages in thread
From: Johannes Schindelin @ 2010-02-14 19:28 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Hi,

On Sun, 14 Feb 2010, Johannes Schindelin wrote:

> On Sun, 14 Feb 2010, Dmitry Potapov wrote:
> 
> > On Sun, Feb 14, 2010 at 9:10 PM, Johannes Schindelin 
> > <Johannes.Schindelin@gmx.de> wrote:
> > >
> > > On Sun, 14 Feb 2010, Dmitry Potapov wrote:
> > >
> > >> On Sun, Feb 14, 2010 at 02:53:58AM +0100, Johannes Schindelin 
> > >> wrote:
> > >> > On Sun, 14 Feb 2010, Dmitry Potapov wrote:
> > >> >
> > >> > > + if (strbuf_read(&sbuf, fd, 4096) >= 0)
> > >> >
> > >> > How certain are you at this point that all of fd's contents fit 
> > >> > into your memory?
> > >>
> > >> You can't be sure... In fact, we know mmap() also may fail for huge 
> > >> files, so can strbuf_read().
> > >
> > > That's comparing oranges to apples. In one case, the address space 
> > > runs out, in the other the available memory. The latter is much more 
> > > likely.
> > 
> > "much more likely" is not a very qualitative characteristic...
> 
> Git was touted as a "content tracker". So I use it as such.
> 
> Concrete example: in one of my repositories, the average file size is 
> well over 2 gigabytes.

Just to make extremely sure that you undertand the issue: adding these 
files on a computer with 512 megabyte RAM works at the moment. Can you 
guarantee that there is no regression in that respect _with_ your patch?

Git is in wide use. If you provide a patch, it is not good enough anymore 
if it Works For You(tm), when it Does Not Work Somewhere Else Anymore(tm).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14 19:22                           ` Johannes Schindelin
  2010-02-14 19:28                             ` Johannes Schindelin
@ 2010-02-14 19:55                             ` Dmitry Potapov
  1 sibling, 0 replies; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-14 19:55 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sun, Feb 14, 2010 at 10:22 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Git was touted as a "content tracker". So I use it as such.
>

Well... those who got their Git repository corrupted also used Git as a
"content tracker". But I believe Git is a content tracker primary for
source code, thus using it as a backup tool or to store your multi-media
collection is abuse of Git... Having said so, I have no objection that
Git could be used in those areas too... So, I would like to hear any
specific suggestion how to achieve those goals better...


Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14 19:28                             ` Johannes Schindelin
@ 2010-02-14 19:56                               ` Dmitry Potapov
  2010-02-14 23:52                                 ` Zygo Blaxell
                                                   ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-14 19:56 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sun, Feb 14, 2010 at 10:28 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>>
>> Concrete example: in one of my repositories, the average file size is
>> well over 2 gigabytes.
>
> Just to make extremely sure that you undertand the issue: adding these
> files on a computer with 512 megabyte RAM works at the moment. Can you
> guarantee that there is no regression in that respect _with_ your patch?

It may not work without enough swap space, and it will not pretty anyway
due to swapping. So, I see the following options:

1. to introduce a configuration parameter that will define whether to use
mmap() to hash files or not. It is a trivial change, but the real question
is what default value for this option (should we do some heuristic based
on filesize vs available memory?)

2. to stream files in chunks. It is better because it is faster, especially on
large files, as you calculate SHA-1 and zip data while they are in CPU
cache. However, it may be more difficult to implement, because we have
filters that should be apply to files that are put to the repository.

3. to improve Git to support huge files on computers with low memory.

I think #3 is a noble goal, but I do not have time for that. I can try to take
on #2, but it may take more time than I have now. As to #1, I am ready to
send the patch if we agree that is the right way to go...

I am open to any your suggestion. Maybe there are some options here..


Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14 19:06                         ` Dmitry Potapov
  2010-02-14 19:22                           ` Johannes Schindelin
@ 2010-02-14 23:13                           ` Avery Pennarun
  2010-02-15  4:16                             ` Nicolas Pitre
  1 sibling, 1 reply; 84+ messages in thread
From: Avery Pennarun @ 2010-02-14 23:13 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Johannes Schindelin, Zygo Blaxell, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Sun, Feb 14, 2010 at 2:06 PM, Dmitry Potapov <dpotapov@gmail.com> wrote:
> On Sun, Feb 14, 2010 at 9:10 PM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
>> That's comparing oranges to apples. In one case, the address space runs
>> out, in the other the available memory. The latter is much more likely.
>
> "much more likely" is not a very qualitative characteristic... I would
> prefer to see numbers.

Well, the numbers are rather easy to calculate of course.  On a 32-bit
machine, your (ideal) maximum address space size is 4GB.  On a 64-bit
machine, it's a heck of a lot bigger.  And in either case, a single
process consuming it all doesn't matter since it won't hurt other
processes.  But the available RAM is frequently less than 4GB and that
has to be shared between *all* your processes.

> BTW, probably, it is not difficult to stream a large file in chunks (and
> it may be even much faster, because we work on CPU cache), but I suspect
> it will not resolve all issues with huge files, because eventually we
> need to store them in a pack file. So we need to develop some strategy
> how to deal with them.

It definitely doesn't resolve all the issues.  There are different
ways of looking at this; one is to not bother make git-add work
smoothly with large files, because calculating the deltas will later
cause a disastrous meltdown anyway.  In fact, arguably you should
prevent git-add from adding large files at all, because at least then
you don't get the repository into a hard-to-recover-from state with
huge files.  (This happened at work a few months ago; most people have
no idea what to do in such a situation.)

The other way to look at it is that if we want git to *eventually*
work with huge files, we have to fix each bug one at a time, and we
can't go making things worse.

For my own situation, I think I'm more likely to (and I know people
who are more likely to) try storing huge files in git than I am likely
to modify a file *while* I'm trying to store it in git.

> One way to deal with them is to stream directly into a separate pack.
> Still, it does not resolve all problems, because each pack file should
> be mapped into a memory, and this may be a problem for 32-bit system
> (or even 64-bit systems where a sysadmin set limit on amount virtual
> memory available a single program).
>
> The other way to handle huge files is to split them into chunks.
> http://article.gmane.org/gmane.comp.version-control.git/120112

I have a bit of experience splitting files into chunks:
http://groups.google.com/group/bup-list/browse_thread/thread/812031efd4c5f7e4

It works.  Also note that the speed gain from mmap'ing packs appears
to be much less than the gain from mmap'ing indexes.  You could
probably sacrifice most or all of the former and never really notice.
Caching expanded deltas can be pretty valuable, though.  (bup
presently avoids that whole question by not using deltas.)

I can also confirm that streaming objects directly into packs is a
massive performance increase when dealing with big files.  However,
you then start to run into git's heuristics that often assume (for
example) that if an object is in a pack, it should never (or rarely)
be pruned.  This is normally a fine assumption, because if it was
likely to get pruned, it probably never would have been put into a
pack in the first place.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14 19:56                               ` Dmitry Potapov
@ 2010-02-14 23:52                                 ` Zygo Blaxell
  2010-02-15  5:05                                 ` Nicolas Pitre
  2010-02-15  7:48                                 ` Paolo Bonzini
  2 siblings, 0 replies; 84+ messages in thread
From: Zygo Blaxell @ 2010-02-14 23:52 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Johannes Schindelin, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sun, Feb 14, 2010 at 10:56:46PM +0300, Dmitry Potapov wrote:
> It may not work without enough swap space, and it will not pretty anyway
> due to swapping. So, I see the following options:
> 
> 1. to introduce a configuration parameter that will define whether to use
> mmap() to hash files or not. It is a trivial change, but the real question
> is what default value for this option (should we do some heuristic based
> on filesize vs available memory?)

I'm a fan of having the option.  It lets users who know what they're
doing (and who might be doing something unusual, like files that are
large relative to memory size, or files that are being modified during
git add) decide how to make the performance-robustness trade-off.

"Large" is relative.  How common is it to add files to git that are
large relative to the RAM size of the machine?  I have a machine with 6GB
of RAM and a repo on that machine with half a dozen 0.5-1.5GB files in it.
Git add is painfully slow on that repo even with mmap().  I wouldn't dare
trying to manipulate that repo on a machine with less than 1GB of RAM.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14 23:13                           ` Avery Pennarun
@ 2010-02-15  4:16                             ` Nicolas Pitre
  2010-02-15  5:01                               ` Avery Pennarun
  0 siblings, 1 reply; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-15  4:16 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Dmitry Potapov, Johannes Schindelin, Zygo Blaxell,
	Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sun, 14 Feb 2010, Avery Pennarun wrote:

> On Sun, Feb 14, 2010 at 2:06 PM, Dmitry Potapov <dpotapov@gmail.com> wrote:
> > BTW, probably, it is not difficult to stream a large file in chunks (and
> > it may be even much faster, because we work on CPU cache), but I suspect
> > it will not resolve all issues with huge files, because eventually we
> > need to store them in a pack file. So we need to develop some strategy
> > how to deal with them.
> 
> It definitely doesn't resolve all the issues.  There are different
> ways of looking at this; one is to not bother make git-add work
> smoothly with large files, because calculating the deltas will later
> cause a disastrous meltdown anyway.

We have that core.bigFileThreshold configuration variable now.  It is 
currently only used by fast-import.  But the idea is to apply it across 
the board when that makes sense.  And one of those places is to skip 
over those files when it comes to delta compression.

> In fact, arguably you should prevent git-add from adding large files 
> at all, because at least then you don't get the repository into a 
> hard-to-recover-from state with huge files.  (This happened at work a 
> few months ago; most people have no idea what to do in such a 
> situation.)

Git needs to be fixed in that case, not be crippled.

> The other way to look at it is that if we want git to *eventually*
> work with huge files, we have to fix each bug one at a time, and we
> can't go making things worse.

Obviously.

> For my own situation, I think I'm more likely to (and I know people
> who are more likely to) try storing huge files in git than I am likely
> to modify a file *while* I'm trying to store it in git.

And fancy operations on huge files are pretty unlikely.  Blame, diff, 
etc, are suited for text file which are by nature relatively small.  
And if your source code is all pasted in one single huge file that Git 
can't handle right now, then the compiler is unlikely to cope either.

> > One way to deal with them is to stream directly into a separate pack.
> > Still, it does not resolve all problems, because each pack file should
> > be mapped into a memory, and this may be a problem for 32-bit system
> > (or even 64-bit systems where a sysadmin set limit on amount virtual
> > memory available a single program).

This is not a problem at all.  Git already deals with packs bigger than 
the available memory by not mapping them whole -- see documentation for 
core.packedGitWindowSize.

> > The other way to handle huge files is to split them into chunks.
> > http://article.gmane.org/gmane.comp.version-control.git/120112

No.  The chunk idea doesn't fit the Git model well enough without many 
corner cases all over the place which is a major drawback.  I think that 
was discussed in that thread already.

> I have a bit of experience splitting files into chunks:
> http://groups.google.com/group/bup-list/browse_thread/thread/812031efd4c5f7e4
> 
> It works.  Also note that the speed gain from mmap'ing packs appears
> to be much less than the gain from mmap'ing indexes.  You could
> probably sacrifice most or all of the former and never really notice.
> Caching expanded deltas can be pretty valuable, though.  (bup
> presently avoids that whole question by not using deltas.)

We do have a cache of expanded deltas already.

> I can also confirm that streaming objects directly into packs is a
> massive performance increase when dealing with big files.  However,
> you then start to run into git's heuristics that often assume (for
> example) that if an object is in a pack, it should never (or rarely)
> be pruned.  This is normally a fine assumption, because if it was
> likely to get pruned, it probably never would have been put into a
> pack in the first place.

Would you please for my own sanity tell me where we do such thing.  I 
thought I had a firm grip on the pack model but you're casting a shadow 
of doubts on some code I might have written myself.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-15  4:16                             ` Nicolas Pitre
@ 2010-02-15  5:01                               ` Avery Pennarun
  2010-02-15  5:48                                 ` Nicolas Pitre
  0 siblings, 1 reply; 84+ messages in thread
From: Avery Pennarun @ 2010-02-15  5:01 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Dmitry Potapov, Johannes Schindelin, Zygo Blaxell,
	Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sun, Feb 14, 2010 at 11:16 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Sun, 14 Feb 2010, Avery Pennarun wrote:
>> In fact, arguably you should prevent git-add from adding large files
>> at all, because at least then you don't get the repository into a
>> hard-to-recover-from state with huge files.  (This happened at work a
>> few months ago; most people have no idea what to do in such a
>> situation.)
>
> Git needs to be fixed in that case, not be crippled.

That would be ideal, but is more work than disabling imports for large
files by default (for example), which would be easy.  In any case, my
solution at work was to say "if it hurts, don't do that" and it seems
to have worked out okay for now.

>> For my own situation, I think I'm more likely to (and I know people
>> who are more likely to) try storing huge files in git than I am likely
>> to modify a file *while* I'm trying to store it in git.
>
> And fancy operations on huge files are pretty unlikely.  Blame, diff,
> etc, are suited for text file which are by nature relatively small.
> And if your source code is all pasted in one single huge file that Git
> can't handle right now, then the compiler is unlikely to cope either.

Well, I'm thinking of things like textual database dumps, such as
those produced by mysqldump.  It would be nice to be able to diff
those efficiently, even if they're several gigs in size.  bup's
hierarchical chunking allows this.

>> > The other way to handle huge files is to split them into chunks.
>> > http://article.gmane.org/gmane.comp.version-control.git/120112
>
> No.  The chunk idea doesn't fit the Git model well enough without many
> corner cases all over the place which is a major drawback.  I think that
> was discussed in that thread already.
>
>> I have a bit of experience splitting files into chunks:
>> http://groups.google.com/group/bup-list/browse_thread/thread/812031efd4c5f7e4

Note that bup's rolling-checksum-based hierarchical chunking is not
the same as the chunking that was discussed in that thread, and it
resolves most of the problems.  Unless I'm missing something.

Also note that bup just uses normal tree objects (for better or worse)
instead of introducing a new object type.

>> It works.  Also note that the speed gain from mmap'ing packs appears
>> to be much less than the gain from mmap'ing indexes.  You could
>> probably sacrifice most or all of the former and never really notice.
>> Caching expanded deltas can be pretty valuable, though.  (bup
>> presently avoids that whole question by not using deltas.)
>
> We do have a cache of expanded deltas already.

Yes, sorry to have implied otherwise.  I was just comparing the
performance advantage of the delta expansion cache (which should be a
lot) with that of mmaping packfiles (which probably isn't much since
the packfile data is typically needed in expanded form anyway).

>> I can also confirm that streaming objects directly into packs is a
>> massive performance increase when dealing with big files.  However,
>> you then start to run into git's heuristics that often assume (for
>> example) that if an object is in a pack, it should never (or rarely)
>> be pruned.  This is normally a fine assumption, because if it was
>> likely to get pruned, it probably never would have been put into a
>> pack in the first place.
>
> Would you please for my own sanity tell me where we do such thing.  I
> thought I had a firm grip on the pack model but you're casting a shadow
> of doubts on some code I might have written myself.

Sorry, I didn't hunt down the code, but I ran into it while
experimenting before.  The rules are something like:

- git-prune only prunes unpacked objects

- git-repack claims to be willing to explode unreachable objects back
into loose objects with -A, but I'm not quite sure if its definition
of "unreachable" is the same as mine.  And I'm not sure rewriting a
pack with -A makes the old pack reliably unreachable according to -d.
It's possible I was just being dense.

- there seems to be no documented situation in which you can ever
delete unused objects from a pack without using repack -a or -A, which
can be amazingly slow if your packs are huge.  (Ideally you'd only
repack the particular packs that you want to shrink.)  For example, my
bup repo is currently 200 GB.

Anyway, I didn't have much luck when playing with it earlier, but
didn't investigate since I assumed it's just a workflow that nobody
much cares about.  Which I think is a reasonable position for git
developers to take anyway.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14 19:56                               ` Dmitry Potapov
  2010-02-14 23:52                                 ` Zygo Blaxell
@ 2010-02-15  5:05                                 ` Nicolas Pitre
  2010-02-15 12:23                                   ` Dmitry Potapov
  2010-02-15  7:48                                 ` Paolo Bonzini
  2 siblings, 1 reply; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-15  5:05 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Johannes Schindelin, Zygo Blaxell, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Sun, 14 Feb 2010, Dmitry Potapov wrote:

> 1. to introduce a configuration parameter that will define whether to use
> mmap() to hash files or not. It is a trivial change, but the real question
> is what default value for this option (should we do some heuristic based
> on filesize vs available memory?)

I don't like such kind of heuristic.  They're almost always wrong, and 
any issue is damn hard to reproduce. I tend to believe that mmap() works 
better by letting the OS paging in and out memory as needed while 
reading data into allocated memory is only going to force the system 
into swap.

> 2. to stream files in chunks. It is better because it is faster, especially on
> large files, as you calculate SHA-1 and zip data while they are in CPU
> cache. However, it may be more difficult to implement, because we have
> filters that should be apply to files that are put to the repository.

So?  "More difficult" when it is the right thing to do is no excuse not 
to do it and satisfy ourselves with an half solution.  Barely replacing 
mmap() with read() has drawbacks while the advantages aren't that many.  
Gaining a few speed percentage while making it less robust when memory 
is tight isn't such a great compromize to me.  BUT if you were to 
replace mmap() with read() and make the process chunked then you do 
improve both speed _and_ memory usage.

As to huge file: we have that core.bigFileThreshold variable now, and 
anything that crosses it should be considered "stream in / stream out" 
without further considerations.  That means no diff, no rename 
similarity estimates, no delta, no filter, no blame, no fancies.  If you 
have source code files that big then you do have a bigger problem 
already anyway.  Typical huge files are rarely manipulated, and when 
they do it is pretty unlikely to be compared with other versions using 
diff, and then that also means that you have the storage capacity and 
network bandwidth to deal with them.  Hence repository tightness is not 
your top concern in that case, but repack/checkout speed most likely is.

So big files should be streamed to a pack of their own at "git add" 
time.  Then repack will simply "reuse pack data" without delta 
compression attempts, meaning that they will be streamed into a 
single huge pack with no issue (this particular case is already 
supported in the code).

> 3. to improve Git to support huge files on computers with low memory.

That comes for free with #2.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-15  5:01                               ` Avery Pennarun
@ 2010-02-15  5:48                                 ` Nicolas Pitre
  2010-02-15 19:19                                   ` Avery Pennarun
  0 siblings, 1 reply; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-15  5:48 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Dmitry Potapov, Johannes Schindelin, Zygo Blaxell,
	Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Mon, 15 Feb 2010, Avery Pennarun wrote:

> - git-prune only prunes unpacked objects
> 
> - git-repack claims to be willing to explode unreachable objects back
> into loose objects with -A, but I'm not quite sure if its definition
> of "unreachable" is the same as mine.  

Unreachable means not referenced by the specified rev-list 
specification.  So if you give it --all --reflog then it means any 
objects that is not reachable through either your branches, tags or 
reflog entries.

> And I'm not sure rewriting a
> pack with -A makes the old pack reliably unreachable according to -d.

Reachability doesn't apply to packs.  That applies to objects.  And 
unreachable objects may be copied to loose objects with -A, or simply 
forgotten about with -a.  Then -d will literally delete the old pack 
file.

> - there seems to be no documented situation in which you can ever
> delete unused objects from a pack without using repack -a or -A, which
> can be amazingly slow if your packs are huge.  (Ideally you'd only
> repack the particular packs that you want to shrink.)  For example, my
> bup repo is currently 200 GB.

Ideally you don't keep volatile objects into huge packs.  That's why we 
have .keep to flag those packs that are huge and pure so not to touch 
them anymore.

Incremental repacking is there to gather only those _reachable_ loose 
objects into a new pack.  The objects that you're likely to make 
unreachable are probably going to come from a temporary branch that you 
deleted which is likely to affect objects only from that latest and 
small pack.

And repacking can be done unattended and in parallel to normal Git 
operations with no issues.  So even if it is slow to repack huge packs, 
it is something that you might do during the night and only once in a 
while.

But if you really want to shrink only one pack without touching the 
other packs, and you do know which objects have to be removed from that 
pack, then it is trivial to write a small script using git-show-index, 
sorting the output by offset, filter out the unwanted objects, keeping 
only the SHA1 column, and feeding the result into git-pack-objects.  Oh 
and delete the original pack when done of course.  It is also trivial to 
generate the list of all packed objects, compare it to the list of all 
reachable objects, and prune objects from the packs that contains those 
objects which are not to be found in the reachable object list.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-14 19:56                               ` Dmitry Potapov
  2010-02-14 23:52                                 ` Zygo Blaxell
  2010-02-15  5:05                                 ` Nicolas Pitre
@ 2010-02-15  7:48                                 ` Paolo Bonzini
  2010-02-15 12:25                                   ` Dmitry Potapov
  2 siblings, 1 reply; 84+ messages in thread
From: Paolo Bonzini @ 2010-02-15  7:48 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Johannes Schindelin, Zygo Blaxell, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On 02/14/2010 08:56 PM, Dmitry Potapov wrote:
> It may not work without enough swap space

No, the kernel will use the file itself as "swap" if it has to page out 
the memory (since the mmap is PROT_READ).

Paolo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-15  5:05                                 ` Nicolas Pitre
@ 2010-02-15 12:23                                   ` Dmitry Potapov
  0 siblings, 0 replies; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-15 12:23 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Johannes Schindelin, Zygo Blaxell, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Mon, Feb 15, 2010 at 8:05 AM, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Sun, 14 Feb 2010, Dmitry Potapov wrote:
>
>> 1. to introduce a configuration parameter that will define whether to use
>> mmap() to hash files or not. It is a trivial change, but the real question
>> is what default value for this option (should we do some heuristic based
>> on filesize vs available memory?)
>
> I don't like such kind of heuristic.  They're almost always wrong, and
> any issue is damn hard to reproduce. I tend to believe that mmap() works
> better by letting the OS paging in and out memory as needed while
> reading data into allocated memory is only going to force the system
> into swap.

Probably, you are right. Heuristic is a bad idea. Still, we may want to
add an option to disable mmap() during hash calculation if we preserve
mmap() here. Though, I don't like keeping mmap() there if we go for #2..
See below...

>
>> 2. to stream files in chunks. It is better because it is faster, especially on
>> large files, as you calculate SHA-1 and zip data while they are in CPU
>> cache. However, it may be more difficult to implement, because we have
>> filters that should be apply to files that are put to the repository.
>
> So?  "More difficult" when it is the right thing to do is no excuse not
> to do it and satisfy ourselves with an half solution.  Barely replacing
> mmap() with read() has drawbacks while the advantages aren't that many.
> Gaining a few speed percentage while making it less robust when memory
> is tight isn't such a great compromize to me.  BUT if you were to
> replace mmap() with read() and make the process chunked then you do
> improve both speed _and_ memory usage.

I have not had time to look closely at this, but there is one problem
that I noticed -- the header of any git object contains the blob length.
We know this length in advance (without reading all data) only for
regular files and only if they do not have any filter to be applied.
In all other cases, it seems we cannot do much better than we do now,
assuming that we do not want to change the storage format...

If so, the question remains what to do about regular files with some
filter. Currently, we use mmap() for the original data but store the
processed data in memory anyway. The question is whether want to keep
this use of mmap() here? Considering that it is a potential source of a
repository corruption and these filters should not be used for big files
because they take a lot of memory anyway, I think we should get rid of
mmap() in hashing file completely, once we can process regular files
without filters in chunks.


Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-15  7:48                                 ` Paolo Bonzini
@ 2010-02-15 12:25                                   ` Dmitry Potapov
  0 siblings, 0 replies; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-15 12:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Johannes Schindelin, Zygo Blaxell, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Mon, Feb 15, 2010 at 10:48 AM, Paolo Bonzini <bonzini@gnu.org> wrote:
> On 02/14/2010 08:56 PM, Dmitry Potapov wrote:
>>
>> It may not work without enough swap space
>
> No, the kernel will use the file itself as "swap" if it has to page out the
> memory (since the mmap is PROT_READ).

I was speaking about read(), when the whole file is read  in memory. In
this case, you are going to use a lot of "swap" and it is not pretty...
That's why I said maybe we should have an option here...


Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-15  5:48                                 ` Nicolas Pitre
@ 2010-02-15 19:19                                   ` Avery Pennarun
  2010-02-15 19:29                                     ` Nicolas Pitre
  0 siblings, 1 reply; 84+ messages in thread
From: Avery Pennarun @ 2010-02-15 19:19 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Dmitry Potapov, Johannes Schindelin, Zygo Blaxell,
	Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Mon, Feb 15, 2010 at 12:48 AM, Nicolas Pitre <nico@fluxnic.net> wrote:
> Ideally you don't keep volatile objects into huge packs.  That's why we
> have .keep to flag those packs that are huge and pure so not to touch
> them anymore.

Of course the problem here is that as soon as you import a single
(possibly volatile) 2GB file, your pack becomes "huge."  So these
heuristics stop working very well and start to need some revision.

Thanks for the clarification on packing and repacking.  I might be
able to use this method to come up with a better repacking/pruning
algorithm in bup, and from there to perhaps forward it on to git.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] don't use mmap() to hash files
  2010-02-15 19:19                                   ` Avery Pennarun
@ 2010-02-15 19:29                                     ` Nicolas Pitre
  0 siblings, 0 replies; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-15 19:29 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Dmitry Potapov, Johannes Schindelin, Zygo Blaxell,
	Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 827 bytes --]

On Mon, 15 Feb 2010, Avery Pennarun wrote:

> On Mon, Feb 15, 2010 at 12:48 AM, Nicolas Pitre <nico@fluxnic.net> wrote:
> > Ideally you don't keep volatile objects into huge packs.  That's why we
> > have .keep to flag those packs that are huge and pure so not to touch
> > them anymore.
> 
> Of course the problem here is that as soon as you import a single
> (possibly volatile) 2GB file, your pack becomes "huge."  So these
> heuristics stop working very well and start to need some revision.

You don't have to repack that often though.  In which case the 
single-object 2GB pack might be discarded before the next repack.  And 
loose objects are packed into a pack of their own more often than 
multiple packs being repacked into a single pack.  So I think the 
current heuristics should still work pretty well.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH] Teach "git add" and friends to be paranoid
  2010-02-14  1:18                 ` [PATCH] don't use mmap() to hash files Dmitry Potapov
                                     ` (2 preceding siblings ...)
  2010-02-14  3:05                   ` [PATCH v2] " Dmitry Potapov
@ 2010-02-18  1:16                   ` Junio C Hamano
  2010-02-18  1:20                     ` Junio C Hamano
                                       ` (3 more replies)
  3 siblings, 4 replies; 84+ messages in thread
From: Junio C Hamano @ 2010-02-18  1:16 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

When creating a loose object, we normally mmap(2) the entire file, and
hash and then compress to write it out in two separate steps for
efficiency.

This is perfectly good for the intended use of git---nobody is supposed to
be insane enough to expect that it won't break anything to muck with the
contents of a file after telling git to index it and before getting the
control back from git.

But the nature of breakage caused by such an abuse is rather bad.  We will
end up with loose object files, whose names do not match what are stored
and recovered when uncompressed.

This teaches the index_mem() codepath to be paranoid and hash and compress
the data after reading it in core.  The contents hashed may not match the
contents of the file in an insane use case, but at least this way the
result will be internally consistent.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 sha1_file.c |   81 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 66 insertions(+), 15 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 657825e..d8a7722 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2278,7 +2278,8 @@ static int create_tmpfile(char *buffer, size_t bufsiz, const char *filename)
 }
 
 static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
-			      void *buf, unsigned long len, time_t mtime)
+			      void *buf, unsigned long len, time_t mtime,
+			      int paranoid)
 {
 	int fd, ret;
 	size_t size;
@@ -2286,6 +2287,7 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	z_stream stream;
 	char *filename;
 	static char tmpfile[PATH_MAX];
+	git_SHA_CTX ctx;
 
 	filename = sha1_file_name(sha1);
 	fd = create_tmpfile(tmpfile, sizeof(tmpfile), filename);
@@ -2312,12 +2314,41 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	stream.next_in = (unsigned char *)hdr;
 	stream.avail_in = hdrlen;
 	while (deflate(&stream, 0) == Z_OK)
-		/* nothing */;
+		; /* nothing */
 
 	/* Then the data itself.. */
-	stream.next_in = buf;
-	stream.avail_in = len;
-	ret = deflate(&stream, Z_FINISH);
+	if (paranoid) {
+		unsigned char stablebuf[262144];
+		char *bufptr = buf;
+		unsigned long remainder = len;
+
+		git_SHA1_Init(&ctx);
+		git_SHA1_Update(&ctx, hdr, hdrlen);
+
+		ret = Z_OK;
+		while (remainder) {
+			unsigned long chunklen = remainder;
+
+			if (sizeof(stablebuf) <= chunklen)
+				chunklen = sizeof(stablebuf);
+			memcpy(stablebuf, bufptr, chunklen);
+			git_SHA1_Update(&ctx, stablebuf, chunklen);
+			stream.next_in = stablebuf;
+			stream.avail_in = chunklen;
+			do {
+				ret = deflate(&stream, Z_NO_FLUSH);
+			} while (ret == Z_OK);
+			bufptr += chunklen;
+			remainder -= chunklen;
+		}
+		if (ret != Z_STREAM_END)
+			ret = deflate(&stream, Z_FINISH);
+	} else {
+		stream.next_in = buf;
+		stream.avail_in = len;
+		ret = deflate(&stream, Z_FINISH);
+	}
+
 	if (ret != Z_STREAM_END)
 		die("unable to deflate new object %s (%d)", sha1_to_hex(sha1), ret);
 
@@ -2327,6 +2358,12 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 
 	size = stream.total_out;
 
+	if (paranoid) {
+		unsigned char paranoid_sha1[20];
+		git_SHA1_Final(paranoid_sha1, &ctx);
+		if (hashcmp(paranoid_sha1, sha1))
+			die("hashed file is volatile");
+	}
 	if (write_buffer(fd, compressed, size) < 0)
 		die("unable to write sha1 file");
 	close_sha1_file(fd);
@@ -2344,7 +2381,7 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	return move_temp_to_file(tmpfile, filename);
 }
 
-int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
+static int write_sha1_file_paranoid(void *buf, unsigned long len, const char *type, unsigned char *returnsha1, int paranoid)
 {
 	unsigned char sha1[20];
 	char hdr[32];
@@ -2358,7 +2395,12 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
 		hashcpy(returnsha1, sha1);
 	if (has_sha1_file(sha1))
 		return 0;
-	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
+	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0, paranoid);
+}
+
+int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
+{
+	return write_sha1_file_paranoid(buf, len, type, returnsha1, 0);
 }
 
 int force_object_loose(const unsigned char *sha1, time_t mtime)
@@ -2376,7 +2418,7 @@ int force_object_loose(const unsigned char *sha1, time_t mtime)
 	if (!buf)
 		return error("cannot read sha1_file for %s", sha1_to_hex(sha1));
 	hdrlen = sprintf(hdr, "%s %lu", typename(type), len) + 1;
-	ret = write_loose_object(sha1, hdr, hdrlen, buf, len, mtime);
+	ret = write_loose_object(sha1, hdr, hdrlen, buf, len, mtime, 0);
 	free(buf);
 
 	return ret;
@@ -2405,10 +2447,15 @@ int has_sha1_file(const unsigned char *sha1)
 	return has_loose_object(sha1);
 }
 
+#define INDEX_MEM_WRITE_OBJECT  01
+#define INDEX_MEM_PARANOID      02
+
 static int index_mem(unsigned char *sha1, void *buf, size_t size,
-		     int write_object, enum object_type type, const char *path)
+		     enum object_type type, const char *path, int flag)
 {
 	int ret, re_allocated = 0;
+	int write_object = flag & INDEX_MEM_WRITE_OBJECT;
+	int paranoid = flag & INDEX_MEM_PARANOID;
 
 	if (!type)
 		type = OBJ_BLOB;
@@ -2426,9 +2473,11 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
 	}
 
 	if (write_object)
-		ret = write_sha1_file(buf, size, typename(type), sha1);
+		ret = write_sha1_file_paranoid(buf, size, typename(type),
+					       sha1, paranoid);
 	else
 		ret = hash_sha1_file(buf, size, typename(type), sha1);
+
 	if (re_allocated)
 		free(buf);
 	return ret;
@@ -2437,23 +2486,25 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
 int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 	     enum object_type type, const char *path)
 {
-	int ret;
+	int ret, flag;
 	size_t size = xsize_t(st->st_size);
 
+	flag = write_object ? INDEX_MEM_WRITE_OBJECT : 0;
 	if (!S_ISREG(st->st_mode)) {
 		struct strbuf sbuf = STRBUF_INIT;
 		if (strbuf_read(&sbuf, fd, 4096) >= 0)
-			ret = index_mem(sha1, sbuf.buf, sbuf.len, write_object,
-					type, path);
+			ret = index_mem(sha1, sbuf.buf, sbuf.len,
+					type, path, flag);
 		else
 			ret = -1;
 		strbuf_release(&sbuf);
 	} else if (size) {
 		void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
-		ret = index_mem(sha1, buf, size, write_object, type, path);
+		flag |= INDEX_MEM_PARANOID;
+		ret = index_mem(sha1, buf, size, type, path, flag);
 		munmap(buf, size);
 	} else
-		ret = index_mem(sha1, NULL, size, write_object, type, path);
+		ret = index_mem(sha1, NULL, size, type, path, flag);
 	close(fd);
 	return ret;
 }
-- 
1.7.0.81.g58679

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18  1:16                   ` [PATCH] Teach "git add" and friends to be paranoid Junio C Hamano
@ 2010-02-18  1:20                     ` Junio C Hamano
  2010-02-18 15:32                       ` Zygo Blaxell
  2010-02-18  1:38                     ` Jeff King
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2010-02-18  1:20 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Junio C Hamano <gitster@pobox.com> writes:

> But the nature of breakage caused by such an abuse is rather bad.  We will
> end up with loose object files, whose names do not match what are stored
> and recovered when uncompressed.
>
> This teaches the index_mem() codepath to be paranoid and hash and compress
> the data after reading it in core.  The contents hashed may not match the
> contents of the file in an insane use case, but at least this way the
> result will be internally consistent.

With a small fix to the test program earlier in the thread, this seems to
protect the repository; I didn't bother to assess the performance impact
of the patch, though.

Here is the corrected test.

-- >8 --
#!/bin/sh
set -e

# Create an empty git repo in /tmp/git-test
rm -fr /tmp/git-test
mkdir /tmp/git-test
cd /tmp/git-test
git init

# Create a file named foo and add it to the repo
touch foo
git add foo

# Thread 1:  continuously modify foo:
while echo -n .; do
	dd if=/dev/urandom of=foo count=1024 bs=1k conv=notrunc >/dev/null 2>&1
done &

# Thread 2:  loop until the repo is corrupted
while git fsck; do
	# Note the implied 'git add' in 'commit -a'
	# It will do the same with explicit 'git add'
	git commit -a -m'Test' || break
done

# Kill thread 1, we don't need it any more
kill $!

# Success!  Well, sort of.
if git fsck
then
	echo Repository is corrupted.  Have a nice day.
else
	echo Repository is still healthy.  You are stupid.
fi

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18  1:16                   ` [PATCH] Teach "git add" and friends to be paranoid Junio C Hamano
  2010-02-18  1:20                     ` Junio C Hamano
@ 2010-02-18  1:38                     ` Jeff King
  2010-02-18  4:55                       ` Nicolas Pitre
  2010-02-18 10:14                     ` Thomas Rast
  2010-02-19  8:28                     ` Dmitry Potapov
  3 siblings, 1 reply; 84+ messages in thread
From: Jeff King @ 2010-02-18  1:38 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Wed, Feb 17, 2010 at 05:16:23PM -0800, Junio C Hamano wrote:

> +	if (paranoid) {
> +		unsigned char stablebuf[262144];

Is 256K a bit big for allocating on the stack? Modern OS's seem to give
us at least a couple of megabytes (my Linux boxen all have 8M, and
even Solaris 8 seems to have that much). But PTHREAD_STACK_MIN is only
16K (I don't think it is possible to hit this code path in a thread
right now, but I'm not sure). And I have no idea what the situation is
on Windows.

I dunno if it is worth worrying about, but maybe somebody more clueful
than me can comment.

-Peff

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18  1:38                     ` Jeff King
@ 2010-02-18  4:55                       ` Nicolas Pitre
  2010-02-18  5:36                         ` Junio C Hamano
  0 siblings, 1 reply; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-18  4:55 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara,
	Thomas Rast, Jonathan Nieder, git

On Wed, 17 Feb 2010, Jeff King wrote:

> On Wed, Feb 17, 2010 at 05:16:23PM -0800, Junio C Hamano wrote:
> 
> > +	if (paranoid) {
> > +		unsigned char stablebuf[262144];
> 
> Is 256K a bit big for allocating on the stack? Modern OS's seem to give
> us at least a couple of megabytes (my Linux boxen all have 8M, and
> even Solaris 8 seems to have that much). But PTHREAD_STACK_MIN is only
> 16K (I don't think it is possible to hit this code path in a thread
> right now, but I'm not sure). And I have no idea what the situation is
> on Windows.
> 
> I dunno if it is worth worrying about, but maybe somebody more clueful
> than me can comment.

It is likely to have better performance if the buffer is small enough to 
fit in the CPU L1 cache.  There are two sequencial passes over the 
buffer: one for the SHA1 computation, and another for the compression, 
and currently they're sure to trash the L1 cache on each pass.

Of course that requires big enough objects to matter.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18  4:55                       ` Nicolas Pitre
@ 2010-02-18  5:36                         ` Junio C Hamano
  2010-02-18  7:27                           ` Wincent Colaiuta
  2010-02-22 12:59                           ` Paolo Bonzini
  0 siblings, 2 replies; 84+ messages in thread
From: Junio C Hamano @ 2010-02-18  5:36 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jeff King, Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara,
	Thomas Rast, Jonathan Nieder, git

Nicolas Pitre <nico@fluxnic.net> writes:

> It is likely to have better performance if the buffer is small enough to 
> fit in the CPU L1 cache.  There are two sequencial passes over the 
> buffer: one for the SHA1 computation, and another for the compression, 
> and currently they're sure to trash the L1 cache on each pass.

I did a very unscientific test to hash about 14k paths (arch/ and fs/ from
the kernel source) using "git-hash-object -w --stdin-paths" into an empty
repository with varying sizes of paranoia buffer (quarter, 1, 4, 8 and
256kB) and saw 8-30% overhead.  256kB did hurt and around 4kB seemed to be
optimal for my this small sample load.

In any case, with any size of paranoia, this hurts the sane use case, so
I'd introduce an expert switch to disable it, like this.

-- >8 --
When creating a loose object, we normally mmap(2) the entire file, and
hash and then compress to write it out in two separate steps for
efficiency.

This is perfectly good for the intended use of git---nobody is supposed to
be insane enough to expect that it won't break anything to muck with the
contents of a file after telling git to index it and before getting the
control back from git.

But the nature of breakage caused by such an abuse is rather bad.  We will
end up with loose object files, whose names do not match what are stored
and recovered when uncompressed.

This teaches the index_mem() codepath to be paranoid and hash and compress
the data after reading it in core.  The contents hashed may not match the
contents of the file in an insane use case, but at least this way the
result will be internally consistent.

People with saner use of git can regain performance by setting a new
configuration variable 'core.volatilefiles' to false to disable this
check.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config.txt |    7 ++++
 cache.h                  |    1 +
 config.c                 |    6 +++
 environment.c            |    1 +
 sha1_file.c              |   83 +++++++++++++++++++++++++++++++++++++--------
 5 files changed, 83 insertions(+), 15 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 52786c7..0295aee 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -117,6 +117,14 @@ core.fileMode::
 	the working copy are ignored; useful on broken filesystems like FAT.
 	See linkgit:git-update-index[1]. True by default.
 
+core.volatilefiles::
+	If you modify a file after telling git to record it (e.g. with
+	"git add") but before git finishes the request and gives the
+	control back to you, you may create a broken object (and of course
+	you can keep both halves ;-).  Setting this option to true will
+	tell git to be extra careful to detect the situation and abort.
+	Defaults to true.
+
 core.ignoreCygwinFSTricks::
 	This option is only used by Cygwin implementation of Git. If false,
 	the Cygwin stat() and lstat() functions are used. This may be useful
diff --git a/cache.h b/cache.h
index 231c06d..e5a87cf 100644
--- a/cache.h
+++ b/cache.h
@@ -497,6 +497,7 @@ extern int trust_ctime;
 extern int quote_path_fully;
 extern int has_symlinks;
 extern int ignore_case;
+extern int worktree_files_are_volatile;
 extern int assume_unchanged;
 extern int prefer_symlink_refs;
 extern int log_all_ref_updates;
diff --git a/config.c b/config.c
index 790405a..9898041 100644
--- a/config.c
+++ b/config.c
@@ -360,6 +360,12 @@ static int git_default_core_config(const char *var, const char *value)
 		trust_executable_bit = git_config_bool(var, value);
 		return 0;
 	}
+
+	if (!strcmp(var, "core.volatilefiles")) {
+		worktree_files_are_volatile = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.trustctime")) {
 		trust_ctime = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index e278bce..5d0faf3 100644
--- a/environment.c
+++ b/environment.c
@@ -22,6 +22,7 @@ int is_bare_repository_cfg = -1; /* unspecified */
 int log_all_ref_updates = -1; /* unspecified */
 int warn_ambiguous_refs = 1;
 int repository_format_version;
+int worktree_files_are_volatile = 1; /* yuck */
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
 int shared_repository = PERM_UMASK;
diff --git a/sha1_file.c b/sha1_file.c
index 52d1ead..e126179 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2335,14 +2335,18 @@ static int create_tmpfile(char *buffer, size_t bufsiz, const char *filename)
 }
 
 static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
-			      void *buf, unsigned long len, time_t mtime)
+			      void *buf, unsigned long len, time_t mtime,
+			      int paranoid)
 {
 	int fd, size, ret;
 	unsigned char *compressed;
 	z_stream stream;
 	char *filename;
 	static char tmpfile[PATH_MAX];
+	git_SHA_CTX ctx;
 
+	if (!worktree_files_are_volatile)
+		paranoid = 0;
 	filename = sha1_file_name(sha1);
 	fd = create_tmpfile(tmpfile, sizeof(tmpfile), filename);
 	if (fd < 0) {
@@ -2366,12 +2370,41 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	stream.next_in = (unsigned char *)hdr;
 	stream.avail_in = hdrlen;
 	while (deflate(&stream, 0) == Z_OK)
-		/* nothing */;
+		; /* nothing */
 
 	/* Then the data itself.. */
-	stream.next_in = buf;
-	stream.avail_in = len;
-	ret = deflate(&stream, Z_FINISH);
+	if (paranoid) {
+		unsigned char stablebuf[4096];
+		char *bufptr = buf;
+		unsigned long remainder = len;
+
+		git_SHA1_Init(&ctx);
+		git_SHA1_Update(&ctx, hdr, hdrlen);
+
+		ret = Z_OK;
+		while (remainder) {
+			unsigned long chunklen = remainder;
+
+			if (sizeof(stablebuf) <= chunklen)
+				chunklen = sizeof(stablebuf);
+			memcpy(stablebuf, bufptr, chunklen);
+			git_SHA1_Update(&ctx, stablebuf, chunklen);
+			stream.next_in = stablebuf;
+			stream.avail_in = chunklen;
+			do {
+				ret = deflate(&stream, Z_NO_FLUSH);
+			} while (ret == Z_OK);
+			bufptr += chunklen;
+			remainder -= chunklen;
+		}
+		if (ret != Z_STREAM_END)
+			ret = deflate(&stream, Z_FINISH);
+	} else {
+		stream.next_in = buf;
+		stream.avail_in = len;
+		ret = deflate(&stream, Z_FINISH);
+	}
+
 	if (ret != Z_STREAM_END)
 		die("unable to deflate new object %s (%d)", sha1_to_hex(sha1), ret);
 
@@ -2381,6 +2414,12 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 
 	size = stream.total_out;
 
+	if (paranoid) {
+		unsigned char paranoid_sha1[20];
+		git_SHA1_Final(paranoid_sha1, &ctx);
+		if (hashcmp(paranoid_sha1, sha1))
+			die("hashed file is volatile");
+	}
 	if (write_buffer(fd, compressed, size) < 0)
 		die("unable to write sha1 file");
 	close_sha1_file(fd);
@@ -2398,7 +2437,7 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	return move_temp_to_file(tmpfile, filename);
 }
 
-int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
+static int write_sha1_file_paranoid(void *buf, unsigned long len, const char *type, unsigned char *returnsha1, int paranoid)
 {
 	unsigned char sha1[20];
 	char hdr[32];
@@ -2412,7 +2451,12 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
 		hashcpy(returnsha1, sha1);
 	if (has_sha1_file(sha1))
 		return 0;
-	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
+	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0, paranoid);
+}
+
+int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
+{
+	return write_sha1_file_paranoid(buf, len, type, returnsha1, 0);
 }
 
 int force_object_loose(const unsigned char *sha1, time_t mtime)
@@ -2430,7 +2474,7 @@ int force_object_loose(const unsigned char *sha1, time_t mtime)
 	if (!buf)
 		return error("cannot read sha1_file for %s", sha1_to_hex(sha1));
 	hdrlen = sprintf(hdr, "%s %lu", typename(type), len) + 1;
-	ret = write_loose_object(sha1, hdr, hdrlen, buf, len, mtime);
+	ret = write_loose_object(sha1, hdr, hdrlen, buf, len, mtime, 0);
 	free(buf);
 
 	return ret;
@@ -2467,10 +2511,15 @@ int has_sha1_file(const unsigned char *sha1)
 	return has_loose_object(sha1);
 }
 
+#define INDEX_MEM_WRITE_OBJECT  01
+#define INDEX_MEM_PARANOID      02
+
 static int index_mem(unsigned char *sha1, void *buf, size_t size,
-		     int write_object, enum object_type type, const char *path)
+		     enum object_type type, const char *path, int flag)
 {
 	int ret, re_allocated = 0;
+	int write_object = flag & INDEX_MEM_WRITE_OBJECT;
+	int paranoid = flag & INDEX_MEM_PARANOID;
 
 	if (!type)
 		type = OBJ_BLOB;
@@ -2488,9 +2537,11 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
 	}
 
 	if (write_object)
-		ret = write_sha1_file(buf, size, typename(type), sha1);
+		ret = write_sha1_file_paranoid(buf, size, typename(type),
+					       sha1, paranoid);
 	else
 		ret = hash_sha1_file(buf, size, typename(type), sha1);
+
 	if (re_allocated)
 		free(buf);
 	return ret;
@@ -2499,23 +2550,25 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
 int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 	     enum object_type type, const char *path)
 {
-	int ret;
+	int ret, flag;
 	size_t size = xsize_t(st->st_size);
 
+	flag = write_object ? INDEX_MEM_WRITE_OBJECT : 0;
 	if (!S_ISREG(st->st_mode)) {
 		struct strbuf sbuf = STRBUF_INIT;
 		if (strbuf_read(&sbuf, fd, 4096) >= 0)
-			ret = index_mem(sha1, sbuf.buf, sbuf.len, write_object,
-					type, path);
+			ret = index_mem(sha1, sbuf.buf, sbuf.len,
+					type, path, flag);
 		else
 			ret = -1;
 		strbuf_release(&sbuf);
 	} else if (size) {
 		void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
-		ret = index_mem(sha1, buf, size, write_object, type, path);
+		flag |= INDEX_MEM_PARANOID;
+		ret = index_mem(sha1, buf, size, type, path, flag);
 		munmap(buf, size);
 	} else
-		ret = index_mem(sha1, NULL, size, write_object, type, path);
+		ret = index_mem(sha1, NULL, size, type, path, flag);
 	close(fd);
 	return ret;
 }
-- 
1.7.0.81.g58679

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18  5:36                         ` Junio C Hamano
@ 2010-02-18  7:27                           ` Wincent Colaiuta
  2010-02-18 16:18                             ` Zygo Blaxell
  2010-02-22 12:59                           ` Paolo Bonzini
  1 sibling, 1 reply; 84+ messages in thread
From: Wincent Colaiuta @ 2010-02-18  7:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Jeff King, Dmitry Potapov, Zygo Blaxell,
	Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

El 18/02/2010, a las 06:36, Junio C Hamano escribió:

> Nicolas Pitre <nico@fluxnic.net> writes:
> 
>> It is likely to have better performance if the buffer is small enough to 
>> fit in the CPU L1 cache.  There are two sequencial passes over the 
>> buffer: one for the SHA1 computation, and another for the compression, 
>> and currently they're sure to trash the L1 cache on each pass.
> 
> I did a very unscientific test to hash about 14k paths (arch/ and fs/ from
> the kernel source) using "git-hash-object -w --stdin-paths" into an empty
> repository with varying sizes of paranoia buffer (quarter, 1, 4, 8 and
> 256kB) and saw 8-30% overhead.  256kB did hurt and around 4kB seemed to be
> optimal for my this small sample load.
> 
> In any case, with any size of paranoia, this hurts the sane use case, so
> I'd introduce an expert switch to disable it, like this.

Shouldn't a switch that hurts performance and is only needed for insane use cases default to off rather than on?

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18  1:16                   ` [PATCH] Teach "git add" and friends to be paranoid Junio C Hamano
  2010-02-18  1:20                     ` Junio C Hamano
  2010-02-18  1:38                     ` Jeff King
@ 2010-02-18 10:14                     ` Thomas Rast
  2010-02-18 18:16                       ` Junio C Hamano
  2010-02-19  8:28                     ` Dmitry Potapov
  3 siblings, 1 reply; 84+ messages in thread
From: Thomas Rast @ 2010-02-18 10:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara, Jonathan Nieder, git

On Thursday 18 February 2010 02:16:23 Junio C Hamano wrote:
> When creating a loose object, we normally mmap(2) the entire file, and
> hash and then compress to write it out in two separate steps for
> efficiency.
> 
> This is perfectly good for the intended use of git---nobody is supposed to
> be insane enough to expect that it won't break anything to muck with the
> contents of a file after telling git to index it and before getting the
> control back from git.

This makes it sound as if the user is to blame, but IMHO we're just
not checking the input well enough.  The user should never be able to
corrupt the repository (without git noticing!) just by running a git
command and manipulating the worktree in parallel.  The file data at
any given time is just user input, and you also cannot (I hope;
otherwise let's fix it!) corrupt the repository merely by typoing some
command arguments.

(Mucking around in .git is an entirely different matter, but that is
off limits.)

> This teaches the index_mem() codepath to be paranoid and hash and compress
> the data after reading it in core.  The contents hashed may not match the
> contents of the file in an insane use case, but at least this way the
> result will be internally consistent.

Doesn't that trigger on windows, where xmmap() already makes a copy?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18  1:20                     ` Junio C Hamano
@ 2010-02-18 15:32                       ` Zygo Blaxell
  2010-02-19 17:51                         ` Junio C Hamano
  0 siblings, 1 reply; 84+ messages in thread
From: Zygo Blaxell @ 2010-02-18 15:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Dmitry Potapov, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Wed, Feb 17, 2010 at 05:20:00PM -0800, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> With a small fix to the test program earlier in the thread, this seems to
> protect the repository; I didn't bother to assess the performance impact
> of the patch, though.
> 
> Here is the corrected test.

Depends on what you mean by "corrected" I suppose.

> # Thread 2:  loop until the repo is corrupted
> while git fsck; do
> 	# Note the implied 'git add' in 'commit -a'
> 	# It will do the same with explicit 'git add'
> 	git commit -a -m'Test' || break

This is at least partly wrong--it will terminate prematurely if Thread
1 gets stalled and fails to modify 'foo' during the loop (git commit
normally refuses to commit a tree with no changes).  This can cause
the test for the corruption bug to return false success results.

If you add '--allow-empty' to the git commit command you will fix that
case, but there might be others.  

If git commit runs out of disk space, for example, the commit should
fail, but the repository should still not be corrupt.  Future commits
(for example after freeing some disk space) should eventually succeed.
Really, the original loop was correct, and this new one isn't.

> else
> 	echo Repository is still healthy.  You are stupid.

If git is working, you should never reach this line, because git
fsck should not fail after executing any sequence of git porcelain
operations--and this particular sequence is nothing but 'git commit'
in a single thread.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18  7:27                           ` Wincent Colaiuta
@ 2010-02-18 16:18                             ` Zygo Blaxell
  2010-02-18 18:12                               ` Jonathan Nieder
  0 siblings, 1 reply; 84+ messages in thread
From: Zygo Blaxell @ 2010-02-18 16:18 UTC (permalink / raw)
  To: Wincent Colaiuta
  Cc: Junio C Hamano, Nicolas Pitre, Jeff King, Dmitry Potapov,
	Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Thu, Feb 18, 2010 at 08:27:28AM +0100, Wincent Colaiuta wrote:
> Shouldn't a switch that hurts performance and is only needed for insane use cases default to off rather than on?

While I don't disagree that default off might(*) be a good idea,
I do object to the categorization of this use case as 'insane'.

Neither the documentation for 'git add' nor its various aliases (e.g. 'git
commit' with paths or -a, etc) mentions that any use of 'git add'
might cause repository corruption under any circumstances.  Contrast with
examples of repository-corrupting pitfalls in the man pages of tools
such as 'git clone' and 'git gc'.

In fact, the language in the git add man page seems to suggest the
opposite, using words like "snapshot" and pointing out several times
that the index is intentionally immune to changes interleaved between
'git add' and 'git commit' commands.

Common sense (for Unix users) is that the index is not immune to changes
*during* git add, but nowhere in my wildest nightmares would I conceive
that changes in file contents during git add would *corrupt the
repository* and git would *fail to notice or give useful diagnostics*
until *days or weeks later* after the corruption has already *spread to
multiple repositories* through *git push with default options*.

Now, if you want to put that text in the man pages of 'git add' and
friends, and point out the paranoia switch there, I have nothing to
object to.

I also see nothing prohibiting concurrent file modification in some
reasonable revision-control use cases.  What happens if I do a 'git
commit -a' on, say, proprietary EDA tool data files or Microsoft Office
documents, and those tools choose an unfortunate moment to automatically
update files under revision control?  Granted, I can't really expect the
repo to contain usable data, but what I do expect is another commit, or
a rebased/amended commit, that fixes the mangled file's contents--not to
be required to rebase on the commit's parent everything that comes
after it, then purge my reflogs so 'git gc' will work again.

Working directories on network filesystems might do all kinds of strange
things, most of which aren't intentional.  It's one thing to commit a
useless tree, and quite another to unintentionally commit an irretrievable
one.

(*) "might" be a good idea because there's been some evidence to suggest
that a paranoid implementation of git add might perform better than the
mmap-based one in all cases, if more work was done than anyone seems
willing to do.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18 16:18                             ` Zygo Blaxell
@ 2010-02-18 18:12                               ` Jonathan Nieder
  2010-02-18 18:35                                 ` Junio C Hamano
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Nieder @ 2010-02-18 18:12 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Wincent Colaiuta, Junio C Hamano, Nicolas Pitre, Jeff King,
	Dmitry Potapov, Ilari Liusvaara, Thomas Rast, git

Zygo Blaxell wrote:
> On Thu, Feb 18, 2010 at 08:27:28AM +0100, Wincent Colaiuta wrote:
>> Shouldn't a switch that hurts performance and is only needed for
>> insane use cases default to off rather than on?
>
> While I don't disagree that default off might(*) be a good idea,
> I do object to the categorization of this use case as 'insane'.

FWIW I think default off would not be a good idea.  This talk of
insane uses started from the idea that git is not so great for taking
automatic snapshots, but as you pointed out, other situations can
trigger this and the failure mode is pretty bad.

> (*) "might" be a good idea because there's been some evidence to suggest
> that a paranoid implementation of git add might perform better than the
> mmap-based one in all cases, if more work was done than anyone seems
> willing to do.

What you are saying here seems a bit handwavy.  If you have some
concrete ideas about what this paranoid implementation should look
like, I encourage you to write a rough patch.  The two patches so far
have indicated the relevant parts of sha1_file.c (index_fd at the
beginning and write_sha1_file at the end of the pipeline,
respectively).  Special cases include:

 - The blob being added to the repository is a special file (e.g.,
   pipe) specified on the 'git hash-object' command line: I think it’s
   fine if this is slow, but it should keep working.

 - The blob was generated in memory (e.g. 'git apply --cached').

 - autocrlf conversion is on.  This means scanning through the file to
   collect statistics on the dominant line ending, then scanning
   through again to convert the file.

 - some other filter is on.  This means sending the file as input to
   a command, then slurping it up somewhere until its length has been
   determined for the beginning of the blob header

 - The blob being added to the repository is already in the repository,
   so it would be a waste of time to compress and write it again.

Some of these already don’t have great performance for large files
(autocrlf and filters), and I suspect there is room for improvement
for many of them.

Jonathan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18 10:14                     ` Thomas Rast
@ 2010-02-18 18:16                       ` Junio C Hamano
  2010-02-18 19:58                         ` Nicolas Pitre
  0 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2010-02-18 18:16 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara, Jonathan Nieder, git

Thomas Rast <trast@student.ethz.ch> writes:

> This makes it sound as if the user is to blame, but IMHO we're just
> not checking the input well enough.

Honesty is very good.  An alternative implementation that does not hurt
performance as much as the "paranoia" would, and checks "the input well
enough" would be very welcome.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18 18:12                               ` Jonathan Nieder
@ 2010-02-18 18:35                                 ` Junio C Hamano
  0 siblings, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2010-02-18 18:35 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Zygo Blaxell, Wincent Colaiuta, Junio C Hamano, Nicolas Pitre,
	Jeff King, Dmitry Potapov, Ilari Liusvaara, Thomas Rast, git

Jonathan Nieder <jrnieder@gmail.com> writes:

> Zygo Blaxell wrote:
>> On Thu, Feb 18, 2010 at 08:27:28AM +0100, Wincent Colaiuta wrote:
>>> Shouldn't a switch that hurts performance and is only needed for
>>> insane use cases default to off rather than on?
>>
>> While I don't disagree that default off might(*) be a good idea,
>> I do object to the categorization of this use case as 'insane'.
>
> FWIW I think default off would not be a good idea.  This talk of
> insane uses started from the idea that git is not so great for taking
> automatic snapshots,...

But git is not so great for taking automatic snapshots, and that is a
fact.  You shouldn't be expecting such a thing, but more importantly, we
shouldn't be dishonest about it either.  git fanboys who spread "you can
use it to snapshot automatically" without thinking are actively doing
disservice to the users by making them even more confused.

If we make this "safety" an opt-in feature, it would give people an excuse
to claim that git _by default_ stores a corrupt object, and when they make
such a claim, they may not reveal that it happens only when they abuse git
in a way it it was not designed to be used to begin with.  And it may not
be because they are malicious, but merely because they are uninformed.

The approach to use paranoia by default is to regard "safety" as not about
protecting the users from such an abuse of their own, but primarily as a
way to protect us from potential FUD.

What Wincent suggested would work very well if there are only honest and
informed people around in the world.  People who use git as intended would
not have to do anything special.  People who abuse git for their special
use case would be very aware of the fact that they are abusing git, and
more importantly, would also be honest about it.  They would not complain
that "git will store corrupt objects by default", and just flip the option
that is designed to support their use case, and they will get their
desired result.  Everybody is happy.

But such a happy ending would happen only in an ideal world, in which
sadly we do not live in.  It is not 2005 anymore, and the risk of FUD
arising from uninformed abuses is very real.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18 18:16                       ` Junio C Hamano
@ 2010-02-18 19:58                         ` Nicolas Pitre
  2010-02-18 20:11                           ` 16 gig, 350,000 file repository Bill Lear
                                             ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-18 19:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara,
	Jonathan Nieder, git

On Thu, 18 Feb 2010, Junio C Hamano wrote:

> Thomas Rast <trast@student.ethz.ch> writes:
> 
> > This makes it sound as if the user is to blame, but IMHO we're just
> > not checking the input well enough.
> 
> Honesty is very good.  An alternative implementation that does not hurt
> performance as much as the "paranoia" would, and checks "the input well
> enough" would be very welcome.

Can't we rely on the mtime of the source file?  Sample it before 
starting hashing it, then make sure it didn't change when done.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* 16 gig, 350,000 file repository
  2010-02-18 19:58                         ` Nicolas Pitre
@ 2010-02-18 20:11                           ` Bill Lear
  2010-02-18 20:58                             ` Nicolas Pitre
  2010-02-18 20:14                           ` [PATCH] Teach "git add" and friends to be paranoid Peter Harris
  2010-02-18 20:17                           ` Junio C Hamano
  2 siblings, 1 reply; 84+ messages in thread
From: Bill Lear @ 2010-02-18 20:11 UTC (permalink / raw)
  To: git

I'm starting a new, large project and would like a quick bit of advice.

Bringing in a set of test cases and other files from a ClearCase
repository resulted in a 350,000 file git repo of about 16 gigabytes.

The time to clone over a fast network was about 250 minutes.  I could
not verify if the repo had been packed properly, etc.

However, we are thinking of using submodules or subtrees, to allow a
person to selectively clone only a part of the repo they need for
their work.  Is there a way to do this without submodules/subtrees?

We also need to be able to branch the entire repo, which I think would
make submodules kind of a pain, but don't know...

What is the current thinking on these issues in the git community?


Bill

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18 19:58                         ` Nicolas Pitre
  2010-02-18 20:11                           ` 16 gig, 350,000 file repository Bill Lear
@ 2010-02-18 20:14                           ` Peter Harris
  2010-02-18 20:17                           ` Junio C Hamano
  2 siblings, 0 replies; 84+ messages in thread
From: Peter Harris @ 2010-02-18 20:14 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Junio C Hamano, Thomas Rast, Dmitry Potapov, Zygo Blaxell,
	Ilari Liusvaara, Jonathan Nieder, git

On Thu, Feb 18, 2010 at 2:58 PM, Nicolas Pitre wrote:
>
> Can't we rely on the mtime of the source file?  Sample it before
> starting hashing it, then make sure it didn't change when done.

Some filesystems (eg FAT) have an abysmal mtime granularity (2
seconds). The same race exists on other filesystems, albeit with a
much narrower window of opportunity.

Peter Harris

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18 19:58                         ` Nicolas Pitre
  2010-02-18 20:11                           ` 16 gig, 350,000 file repository Bill Lear
  2010-02-18 20:14                           ` [PATCH] Teach "git add" and friends to be paranoid Peter Harris
@ 2010-02-18 20:17                           ` Junio C Hamano
  2010-02-18 21:30                             ` Nicolas Pitre
  2 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2010-02-18 20:17 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Thomas Rast, Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara,
	Jonathan Nieder, git

Nicolas Pitre <nico@fluxnic.net> writes:

>> Honesty is very good.  An alternative implementation that does not hurt
>> performance as much as the "paranoia" would, and checks "the input well
>> enough" would be very welcome.
>
> Can't we rely on the mtime of the source file?  Sample it before 
> starting hashing it, then make sure it didn't change when done.

I suspect that opening to mmap(2), hashing once to compute the object
name, and deflating it to write it out, will all happen within the same
second, unless you are talking about a really huge file, or you started at
very near a second boundary.

I am perfectly Ok that it will have false negatives that way, but if the
probability of it catching problems is too low because git is too fast
relative to the file timestamp granularity, then it doesn't sound very
useful in practice, unless of course you are on better filesystems.

It won't have false positives and that is a very good thing, though.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: 16 gig, 350,000 file repository
  2010-02-18 20:11                           ` 16 gig, 350,000 file repository Bill Lear
@ 2010-02-18 20:58                             ` Nicolas Pitre
  2010-02-19  9:27                               ` Erik Faye-Lund
  2010-02-22 22:20                               ` Bill Lear
  0 siblings, 2 replies; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-18 20:58 UTC (permalink / raw)
  To: Bill Lear; +Cc: git

On Thu, 18 Feb 2010, Bill Lear wrote:

> I'm starting a new, large project and would like a quick bit of advice.
> 
> Bringing in a set of test cases and other files from a ClearCase
> repository resulted in a 350,000 file git repo of about 16 gigabytes.
> 
> The time to clone over a fast network was about 250 minutes.  I could
> not verify if the repo had been packed properly, etc.

I'd start from there.  If you didn't do a 'git gc --aggressive' after 
the import then it is quite likely that your repo isn't well packed.

Of course you'll need a big machine to repack this.  But that should be 
needed only once.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18 20:17                           ` Junio C Hamano
@ 2010-02-18 21:30                             ` Nicolas Pitre
  2010-02-19  1:04                               ` Jonathan Nieder
  0 siblings, 1 reply; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-18 21:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara,
	Jonathan Nieder, git

On Thu, 18 Feb 2010, Junio C Hamano wrote:

> Nicolas Pitre <nico@fluxnic.net> writes:
> 
> >> Honesty is very good.  An alternative implementation that does not hurt
> >> performance as much as the "paranoia" would, and checks "the input well
> >> enough" would be very welcome.
> >
> > Can't we rely on the mtime of the source file?  Sample it before 
> > starting hashing it, then make sure it didn't change when done.
> 
> I suspect that opening to mmap(2), hashing once to compute the object
> name, and deflating it to write it out, will all happen within the same
> second, unless you are talking about a really huge file, or you started at
> very near a second boundary.

How is the index dealing with this?  Surely if a file is added to the 
index and modified within the same second then 'git status' will fail to 
notice the changes.  I'm not familiar enough with that part of Git.

Alternatively, you could use the initial mtime sample to determine the 
filesystem's time granularity by noticing how many LSBs are zero.  
Let's say FAT should have a granularity of one second.  Then if the 
mtime of the file is less than one second away before starting to hash 
then just wait for one second.  If one second later the mtime has 
changed and still less than a second away then abort.  If after the hash 
the mtime has changed then abort.

On a recent filesystem, it is likely that the mtime granularity is a 
nanosecond.  Nevertheless the above algorithm should just work all the 
same, although it is unlikely that the mtime will be within the current 
nanosecond, hence the probability for having to do an initial wait is 
almost zero.  On kernels without hires timers the granularity will be 
like 10 ms.

Of course you might be unlucky and the initial mtime sample happens to 
be right on a whole second even on a high resolution mtime filesystem, 
in which case the delay test will consider one second instead of 10 ms 
or whatever.  but the probability is rather small that you'll end up 
with all sub-second bits to be all zeroes causing a longer delay than 
actually necessary, and this would matter only for files that would have 
been modified within that second.  I don't think there is a reliable way 
to enquire a filesystem+OS time stamping granularity.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18 21:30                             ` Nicolas Pitre
@ 2010-02-19  1:04                               ` Jonathan Nieder
  2010-02-19 15:26                                 ` Zygo Blaxell
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Nieder @ 2010-02-19  1:04 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Junio C Hamano, Thomas Rast, Dmitry Potapov, Zygo Blaxell,
	Ilari Liusvaara, git

Nicolas Pitre wrote:
> On Thu, 18 Feb 2010, Junio C Hamano wrote:

>> I suspect that opening to mmap(2), hashing once to compute the object
>> name, and deflating it to write it out, will all happen within the same
>> second, unless you are talking about a really huge file, or you started at
>> very near a second boundary.
>
> How is the index dealing with this?  Surely if a file is added to the 
> index and modified within the same second then 'git status' will fail to 
> notice the changes.  I'm not familiar enough with that part of Git.

See Documentation/technical/racy-git.txt and t/t0010-racy-git.sh.

Short version: in the awful case, the timestamp of the index is the
same as (or before) the timestamp of the file.  Git will notice this
and re-hash the tracked file.

> Alternatively, you could use the initial mtime sample to determine the 
> filesystem's time granularity by noticing how many LSBs are zero.

Yuck.

If such detection is going to happen, I would prefer to see it used
once to determine the initial value of a per-repository configuration
variable asking to speed up ‘git add’ and friends.

Note that we are currently not using the nsec timestamps to make any
important decisions, probably because in some filesystems they are
unreliable when inode cache entries are evicted (not sure about the
current status; does this work in NFS, for example?).  Within the
short runtime of ‘git add’, I guess this would not be as much of a
problem.

Jonathan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18  1:16                   ` [PATCH] Teach "git add" and friends to be paranoid Junio C Hamano
                                       ` (2 preceding siblings ...)
  2010-02-18 10:14                     ` Thomas Rast
@ 2010-02-19  8:28                     ` Dmitry Potapov
  2010-02-19 17:52                       ` Junio C Hamano
  3 siblings, 1 reply; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-19  8:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Hi Junio,

I am sorry I have not had time to reply earlier. I think it is possible
to avoid the overhead of being on the safe side in a few common cases.
Here is a patch. I have not had time to test it, but changes appear to
trivial.

-- >8 --
From 3e53610a41c4aad458dff13135a73bb4944f456b Mon Sep 17 00:00:00 2001
From: Dmitry Potapov <dpotapov@gmail.com>
Date: Fri, 19 Feb 2010 11:00:51 +0300
Subject: [PATCH] speed up "git add" by avoiding the paranoid mode

While the paranoid mode preserve the git repository from corruption in the
case when the added file is changed simultaneously with running "git add",
it has some overhead. However, in a few common cases, it is possible to
avoid this mode and still be on the safe side:

1. If mmap() is implemented as reading the whole file in memory.

2. If the whole file was read in memory as result of applying some filter.

3. If the added file is small, it is faster to use read() than mmap().

Signed-off-by: Dmitry Potapov <dpotapov@gmail.com>
---
 sha1_file.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index d8a7722..4efeb21 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2469,6 +2469,7 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
 		                   write_object ? safe_crlf : 0)) {
 			buf = strbuf_detach(&nbuf, &size);
 			re_allocated = 1;
+			paranoid = 0;
 		}
 	}
 
@@ -2490,7 +2491,7 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 	size_t size = xsize_t(st->st_size);
 
 	flag = write_object ? INDEX_MEM_WRITE_OBJECT : 0;
-	if (!S_ISREG(st->st_mode)) {
+	if (!S_ISREG(st->st_mode) || size < 262144) {
 		struct strbuf sbuf = STRBUF_INIT;
 		if (strbuf_read(&sbuf, fd, 4096) >= 0)
 			ret = index_mem(sha1, sbuf.buf, sbuf.len,
@@ -2500,7 +2501,9 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 		strbuf_release(&sbuf);
 	} else if (size) {
 		void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
+#ifndef NO_MMAP
 		flag |= INDEX_MEM_PARANOID;
+#endif
 		ret = index_mem(sha1, buf, size, type, path, flag);
 		munmap(buf, size);
 	} else
-- 
1.7.0

-- >8 --

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: 16 gig, 350,000 file repository
  2010-02-18 20:58                             ` Nicolas Pitre
@ 2010-02-19  9:27                               ` Erik Faye-Lund
  2010-02-22 22:20                               ` Bill Lear
  1 sibling, 0 replies; 84+ messages in thread
From: Erik Faye-Lund @ 2010-02-19  9:27 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Bill Lear, git

On Thu, Feb 18, 2010 at 9:58 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Thu, 18 Feb 2010, Bill Lear wrote:
>
>> I'm starting a new, large project and would like a quick bit of advice.
>>
>> Bringing in a set of test cases and other files from a ClearCase
>> repository resulted in a 350,000 file git repo of about 16 gigabytes.
>>
>> The time to clone over a fast network was about 250 minutes.  I could
>> not verify if the repo had been packed properly, etc.
>
> I'd start from there.  If you didn't do a 'git gc --aggressive' after
> the import then it is quite likely that your repo isn't well packed.
>
> Of course you'll need a big machine to repack this.

Something like this? http://www.gadgetopia.com/images/big_machine.jpg

-- 
Erik "kusma" Faye-Lund

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-19  1:04                               ` Jonathan Nieder
@ 2010-02-19 15:26                                 ` Zygo Blaxell
  2010-02-19 17:52                                   ` Junio C Hamano
  0 siblings, 1 reply; 84+ messages in thread
From: Zygo Blaxell @ 2010-02-19 15:26 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Nicolas Pitre, Junio C Hamano, Thomas Rast, Dmitry Potapov,
	Ilari Liusvaara, git

On Thu, Feb 18, 2010 at 07:04:56PM -0600, Jonathan Nieder wrote:
> Nicolas Pitre wrote:
> > On Thu, 18 Feb 2010, Junio C Hamano wrote:
> >> I suspect that opening to mmap(2), hashing once to compute the object
> >> name, and deflating it to write it out, will all happen within the same
> >> second, unless you are talking about a really huge file, or you started at
> >> very near a second boundary.
> >
> > How is the index dealing with this?  Surely if a file is added to the 
> > index and modified within the same second then 'git status' will fail to 
> > notice the changes.  I'm not familiar enough with that part of Git.
> 
> See Documentation/technical/racy-git.txt and t/t0010-racy-git.sh.
> 
> Short version: in the awful case, the timestamp of the index is the
> same as (or before) the timestamp of the file.  Git will notice this
> and re-hash the tracked file.

As far as I can tell, the index doesn't handle this case at all.

Suppose the file is modified during git add near the beginning of the
file, after git add has read that part of the file, but the modifications
finish before git add does.  Now the mtime of the file is earlier
than the index timestamp, but the file contents don't match the index.
This holds even if the objects git adds to the index aren't corrupted.
Actually right now you can have all four combinations:  index up to date
or not, and object matching its sha1 hash or not, depending on where and
when you modify data during an index update.

racy-git.txt doesn't discuss concurrent modification of files with the
index.  It only discusses low-resolution file timestamps and modifications
at times that are close to, but not concurrent with, index modifications.

Git probably also doesn't handle things like NTP time corrections
(especially those where time moves backward by sub-second intervals) and
mismatched server/client clocks on remote filesystems either (mind you,
I know of no SCM that currently handles that case, and CVS in particular
is unusually bad at it).

Personally, I find the combination of nanosecond-precision timestamps
and network file systems amusing.  At nanosecond precision, relativistic
effects start to matter across a volume of space the size of my laptop.
I'm not sure how timestamps at any resolution could be a reliable metric
for detecting changes to file contents in the general case.  A valuable
hint in many cases, but not authoritative (unless they all come from a
single monotonic high-resolution clock guaranteed to increment faster than
git--but they don't).

rsync solves this sort of problem with a 'modification window' parameter,
which is a time interval that is "close enough" to consider two timestamps
to be equal.  Some of rsync's use cases set that window to six months.
Git would use a modification window for the opposite reason rsync
does--rsync uses the window to avoid unnecessarily examining files that
have different timestamps, while git would use it to re-examine files
even when it appears to be unnecessary.

Git probably wants the modification window to be the maximum clock
offset between a network filesystem client and server plus the minimum
representable interval in the filesystem's timestamp data type--which
is a value git couldn't possibly know for some cases, so it needs input
from the user.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18 15:32                       ` Zygo Blaxell
@ 2010-02-19 17:51                         ` Junio C Hamano
  0 siblings, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2010-02-19 17:51 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Dmitry Potapov, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Zygo Blaxell <zblaxell@gibbs.hungrycats.org> writes:

> If git commit runs out of disk space, for example, the commit should
> fail, but the repository should still not be corrupt.  Future commits
> (for example after freeing some disk space) should eventually succeed.

That is true.  It is Ok to create a corrupt object as long as running an
equivalent of fsck immediately after an object creation to catch the
breakage to prevent it from propagating further.

That is essentially what "paranoid" switch does, but it adds overhead that
is unnecessary for the use case we primarily target in git.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-19  8:28                     ` Dmitry Potapov
@ 2010-02-19 17:52                       ` Junio C Hamano
  2010-02-20 19:23                         ` Junio C Hamano
  0 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2010-02-19 17:52 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Dmitry Potapov <dpotapov@gmail.com> writes:

> ... I think it is possible
> to avoid the overhead of being on the safe side in a few common cases.
> Here is a patch. I have not had time to test it, but changes appear to
> trivial.

Yeah, these are obvious "paranoia not needed" cases.

How much "speed-up" are we talking about, though?  Can we quantify?  I
personally think it is not even worth to quantify it but instead simply
say "avoid unnecessary computation" without saying "speed up", though ;-)

Thanks.

> -- >8 --
> From 3e53610a41c4aad458dff13135a73bb4944f456b Mon Sep 17 00:00:00 2001
> From: Dmitry Potapov <dpotapov@gmail.com>
> Date: Fri, 19 Feb 2010 11:00:51 +0300
> Subject: [PATCH] speed up "git add" by avoiding the paranoid mode
>
> While the paranoid mode preserve the git repository from corruption in the
> case when the added file is changed simultaneously with running "git add",
> it has some overhead. However, in a few common cases, it is possible to
> avoid this mode and still be on the safe side:
>
> 1. If mmap() is implemented as reading the whole file in memory.
>
> 2. If the whole file was read in memory as result of applying some filter.
>
> 3. If the added file is small, it is faster to use read() than mmap().
>
> Signed-off-by: Dmitry Potapov <dpotapov@gmail.com>
> ---
>  sha1_file.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/sha1_file.c b/sha1_file.c
> index d8a7722..4efeb21 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2469,6 +2469,7 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
>  		                   write_object ? safe_crlf : 0)) {
>  			buf = strbuf_detach(&nbuf, &size);
>  			re_allocated = 1;
> +			paranoid = 0;
>  		}
>  	}
>  
> @@ -2490,7 +2491,7 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
>  	size_t size = xsize_t(st->st_size);
>  
>  	flag = write_object ? INDEX_MEM_WRITE_OBJECT : 0;
> -	if (!S_ISREG(st->st_mode)) {
> +	if (!S_ISREG(st->st_mode) || size < 262144) {
>  		struct strbuf sbuf = STRBUF_INIT;
>  		if (strbuf_read(&sbuf, fd, 4096) >= 0)
>  			ret = index_mem(sha1, sbuf.buf, sbuf.len,
> @@ -2500,7 +2501,9 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
>  		strbuf_release(&sbuf);
>  	} else if (size) {
>  		void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
> +#ifndef NO_MMAP
>  		flag |= INDEX_MEM_PARANOID;
> +#endif
>  		ret = index_mem(sha1, buf, size, type, path, flag);
>  		munmap(buf, size);
>  	} else
> -- 
> 1.7.0
>
> -- >8 --

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-19 15:26                                 ` Zygo Blaxell
@ 2010-02-19 17:52                                   ` Junio C Hamano
  2010-02-19 19:08                                     ` Zygo Blaxell
  0 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2010-02-19 17:52 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Jonathan Nieder, Nicolas Pitre, Thomas Rast, Dmitry Potapov,
	Ilari Liusvaara, git

Zygo Blaxell <zblaxell@gibbs.hungrycats.org> writes:

> As far as I can tell, the index doesn't handle this case at all.
> ...
> racy-git.txt doesn't discuss concurrent modification of files with the
> index.  It only discusses low-resolution file timestamps and modifications
> at times that are close to, but not concurrent with, index modifications.

Correct.  As I said a few times in this thread, a use case with concurrent
modifications is outside of the original design scope of git.

As you may have realized, racy-git solution actually _relies_ on lack of
concurrent modifications.  The document does not even _talk_ about this
assumption, exactly because at least back then it was a common knowledge
shared by everybody that users are not supposed to muck with files in the
work tree until they get control back from git and they can keep both
halves if they get a new broken loose object if they did so ;-).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-19 17:52                                   ` Junio C Hamano
@ 2010-02-19 19:08                                     ` Zygo Blaxell
  0 siblings, 0 replies; 84+ messages in thread
From: Zygo Blaxell @ 2010-02-19 19:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, Nicolas Pitre, Thomas Rast, Dmitry Potapov,
	Ilari Liusvaara, git

On Fri, Feb 19, 2010 at 09:52:10AM -0800, Junio C Hamano wrote:
> As you may have realized, racy-git solution actually _relies_ on lack of
> concurrent modifications.  

It relies on 1) no concurrent modifications, 2) strictly increasing
timestamps, and 3) consistent timestamps between the working directory
and index.  Believe it or not, out of those three I actually think the
first assumption is the most reasonable, because it's something a
user could prevent without using administrative privileges or changing
filesystems.

For performance reasons I frequently work with GIT_DIR on ext3 and working
directory on tmpfs (though a mix of ext3 and ext4 has the same issues).
One has second resolution, the other nanosecond.  If two event timestamps
with different precisions are compared as values at maximum precision,
the later event might appear to occur before the earlier one.

I try to avoid using anything that relies on mtime on network filesystems
because a lot more than just Git breaks in those cases.

NTP breakage is rarer, mostly because NTP step events are rare, and
negative ones ever rarer.  I've seen negative steps on laptops when they
lose time accuracy during suspend or when Internet access is unavailable
(or asymmetrically laggy), then get it back later.  On the other hand,
I don't actually test for effects of this event anywhere, so I can't say
it hasn't caused breakage I don't know about, although I can say it hasn't
cause breakage I do know about either.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-19 17:52                       ` Junio C Hamano
@ 2010-02-20 19:23                         ` Junio C Hamano
  2010-02-21  7:21                           ` Dmitry Potapov
  0 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2010-02-20 19:23 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Junio C Hamano <gitster@pobox.com> writes:

> Dmitry Potapov <dpotapov@gmail.com> writes:
>
>> ... I think it is possible
>> to avoid the overhead of being on the safe side in a few common cases.
>> Here is a patch. I have not had time to test it, but changes appear to
>> trivial.
>
> Yeah, these are obvious "paranoia not needed" cases.
>

Actually the "if it is smaller than 256k" part is not quite obvious.

>> @@ -2490,7 +2491,7 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
>>  	size_t size = xsize_t(st->st_size);
>>  
>>  	flag = write_object ? INDEX_MEM_WRITE_OBJECT : 0;
>> -	if (!S_ISREG(st->st_mode)) {
>> +	if (!S_ISREG(st->st_mode) || size < 262144) {
>>  		struct strbuf sbuf = STRBUF_INIT;
>>  		if (strbuf_read(&sbuf, fd, 4096) >= 0)
>>  			ret = index_mem(sha1, sbuf.buf, sbuf.len,

INDEX_MEM_PARANOID is never given to index_mem() in this codepath, so
trade-offs look like this:

 * In non-paranoia mode, your conjecture is that between

   - malloc, read, SHA-1, deflate, and then free; and
   - mmap, SHA-1, deflate and then munmap

   the former is faster for small files that can fit in core without
   thrashing.

 * In paranoia mode, your conjecture is that between

   - malloc, read, SHA-1, deflate, and then free; and
   - mmap, SHA-1, SHA-1 and deflate in chunks, and then munmap

   the former is faster for small files that can fit in core without
   thrashing.

The "mmap" strategy has larger cost in paranoia mode compared to its cost
in non-paranoia mode.  The "read" strategy on the other hand has the same
cost in both modes.  If this "read small files" is good for non-paranoia
mode, it is obvious that it is also good (better) for paranoia mode.

Which means that this hunk addresses an unrelated issue.  "paranoid
avoidance" falls naturally as a side effect of doing this, but that is not
the primary effect of this change.

There needs some benchmarking to justify it, I think.

So I'd split this hunk out when queuing.

Thanks.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-20 19:23                         ` Junio C Hamano
@ 2010-02-21  7:21                           ` Dmitry Potapov
  2010-02-21 19:32                             ` Junio C Hamano
  0 siblings, 1 reply; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-21  7:21 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sat, Feb 20, 2010 at 11:23:11AM -0800, Junio C Hamano wrote:
> 
> There needs some benchmarking to justify it, I think.
> 
> So I'd split this hunk out when queuing.

I completely argee with your reasoning that it is a separate issue and
it needs some benchmarking to prove its usefulness. So, I have done it
today.

For that, I have created a repository up to 512Mb size and containing up
to 100,000 files (for small files the file number was the limiting
factor, for big files, the total size limit was used). I intentionally
limit the total size to 512Mb to make sure that they are in the FS cache
and no disk related effects.  The content of files have been generated
randomly by /dev/urandom and then used in all tests for one file size.
I have made 5 runs using mmap() (with git 1.7.0) and 5 runs with using
read() (after applying the patch below).

Below is the best result of 5 runs for each size. "Before" marks the
original version, which uses mmap(). "After" marks the modified version
using read(). The command used to measure time was:

cat list | time git hash-object --stdin-paths >/dev/null

So it is just calculating SHA-1 without deflating and writing the
result on the disk, which significantly depends on fsync() speed.

Tested on: Intel(R) Core(TM)2 Quad  CPU   Q9300  @ 2.50GHz

Here are results:

file size = 1Kb; Hashing 100000 files
Before:
0.63user 0.86system 0:01.49elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+100428minor)pagefaults 0swaps
After:
0.54user 0.53system 0:01.07elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+421minor)pagefaults 0swaps

file size = 2Kb; Hashing 100000 files
Before:
1.04user 0.79system 0:01.82elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+100428minor)pagefaults 0swaps
After:
0.95user 0.48system 0:01.43elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+422minor)pagefaults 0swaps

file size = 4Kb; Hashing 100000 files
Before:
1.73user 0.74system 0:02.47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+100428minor)pagefaults 0swaps
After:
1.54user 0.57system 0:02.11elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+424minor)pagefaults 0swaps

file size = 8Kb; Hashing 64000 files
Before:
1.86user 0.63system 0:02.48elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.75user 0.50system 0:02.23elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+424minor)pagefaults 0swaps

file size = 16Kb; Hashing 32000 files
Before:
1.73user 0.41system 0:02.14elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.68user 0.32system 0:02.00elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+425minor)pagefaults 0swaps

file size = 32Kb; Hashing 16000 files
Before:
1.65user 0.32system 0:01.96elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.64user 0.24system 0:01.87elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+431minor)pagefaults 0swaps

file size = 64Kb; Hashing 8000 files
Before:
1.71user 0.17system 0:01.87elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.63user 0.18system 0:01.81elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+438minor)pagefaults 0swaps

file size = 128Kb; Hashing 4000 files
Before:
1.60user 0.20system 0:01.79elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.55user 0.20system 0:01.75elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+454minor)pagefaults 0swaps

file size = 256Kb; Hashing 2000 files
Before:
1.62user 0.15system 0:01.77elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128429minor)pagefaults 0swaps
After:
1.56user 0.16system 0:01.71elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+551minor)pagefaults 0swaps

file size = 512Kb; Hashing 1000 files
Before:
1.59user 0.17system 0:01.76elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.56user 0.15system 0:01.71elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+679minor)pagefaults 0swaps

file size = 1024Kb; Hashing 500 files
Before:
1.64user 0.15system 0:01.78elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128429minor)pagefaults 0swaps
After:
1.61user 0.15system 0:01.76elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+934minor)pagefaults 0swaps


As you can see, in all tests the read() version performed better than
mmap() though the difference decreases with increase of the file size.
While for 1Kb files, the speed up is 39% (based on the elapsed time),
it is mere 1% for 1Mb file size.

Note: I do not use strbuf_read(), because it is suboptimal to deal with
this case, because we know the size ahead. (In fact, strbuf_read() is
not so good even for unknown size as it does redundant strbuf_grow()
almost in every use case, which probably needs to be fixed).

-- >8 --
From 6b3f8335dece7c9b9f810b1ab08f1bcb090e4d5e Mon Sep 17 00:00:00 2001
From: Dmitry Potapov <dpotapov@gmail.com>
Date: Sun, 21 Feb 2010 09:32:19 +0300
Subject: [PATCH] hash-object: don't use mmap() for small files

Using read() instead of mmap() can be 39% speed up for 1Kb files and is
1% speed up 1Mb files. For larger files, it is better to use mmap(),
because the difference between is not significant, and when there is not
enough memory, mmap() performs much better, because it avoids swapping.

Signed-off-by: Dmitry Potapov <dpotapov@gmail.com>
---
 sha1_file.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 657825e..8a83e56 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2434,6 +2434,8 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
 	return ret;
 }
 
+#define SMALL_FILE_SIZE (1024*1024)
+
 int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 	     enum object_type type, const char *path)
 {
@@ -2448,6 +2450,14 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 		else
 			ret = -1;
 		strbuf_release(&sbuf);
+	} else if (size <= SMALL_FILE_SIZE) {
+		char *buf = xmalloc(size);
+		if (size == read_in_full(fd, buf, size))
+			ret = index_mem(sha1, buf, size, write_object, type,
+					path);
+		else
+			ret = error("short read %s", strerror(errno));
+		free(buf);
 	} else if (size) {
 		void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
 		ret = index_mem(sha1, buf, size, write_object, type, path);
-- 
1.7.0

-- >8 --


Thanks,
Dmitry

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-21  7:21                           ` Dmitry Potapov
@ 2010-02-21 19:32                             ` Junio C Hamano
  2010-02-22  3:35                               ` Dmitry Potapov
  0 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2010-02-21 19:32 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Dmitry Potapov <dpotapov@gmail.com> writes:

> file size = 1Kb; Hashing 100000 files
> Before:
> 0.63user 0.86system 0:01.49elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+100428minor)pagefaults 0swaps
> After:
> 0.54user 0.53system 0:01.07elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+421minor)pagefaults 0swaps
>
> As you can see, in all tests the read() version performed better than
> mmap() though the difference decreases with increase of the file size.
> While for 1Kb files, the speed up is 39% (based on the elapsed time),
> it is mere 1% for 1Mb file size.

Sounds good.  Summarizing your numbers,

       1Kb 39.25%
       2Kb 27.27%
       4Kb 17.06%
       8Kb 11.21%
      16Kb 7.00%
      32Kb 4.81%
      64Kb 3.31%
     128Kb 2.29%
     256Kb 3.51%
     512Kb 2.92%
    1024Kb 1.14%

32*1024 sounds like a better cut-off to me.  After that, doubling the size
does not get comparable gain, and numbers get unstable (notice the glitch
around 256kB).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-21 19:32                             ` Junio C Hamano
@ 2010-02-22  3:35                               ` Dmitry Potapov
  2010-02-22  6:59                                 ` Junio C Hamano
  0 siblings, 1 reply; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-22  3:35 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sun, Feb 21, 2010 at 11:32:10AM -0800, Junio C Hamano wrote:
> 
> 32*1024 sounds like a better cut-off to me.  After that, doubling the size
> does not get comparable gain, and numbers get unstable (notice the glitch
> around 256kB).

The reduction of speed-up after 32Kb is most likely due to L1 cache
size, which is 32Kb data per core on Core 2, and L2 cache is shared
among cores and is considerably slow. I have run my test a few more
times, and here are results:

   1 - 39.25%
   2 - 30.00%
   4 - 17.79%
   8 - 11.76%
  16 - 7.58%
  32 - 5.38%
  64 - 3.89%
 128 - 2.87%
 256 - 2.31%
 512 - 2.92%
1024 - 1.14%

and here is one more re-run starting with 32Kb:

  32 - 5.38%
  64 - 3.89%
 128 - 2.29%
 256 - 2.91%
 512 - 2.92%
1024 - 1.14%

If you look at speed-up numbers, you can think that the numbers are
unstable, but in fact, the best time in 5 runs does not differ more
than 0.01s between those trials. But because difference for >=128Kb
is 0.05s or less, the accuracy of the above numbers is less than 25%.
But overall the outcome is clear -- read() is always a winner.

It would be interesting to see what difference Nehalem, which has a
smaller but much faster L2 cache than Core 2. It may perform better
at larger sizes up to 256Kb.

Anyway, based on above data, I believe that the proper cut-off should be
at least 64Kb, because additional 32Kb (from 32Kb to 64Kb) is about of
2.5% of total memory that git consumes anyway, and it gives you speed-up
around 3.5%...



Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-22  3:35                               ` Dmitry Potapov
@ 2010-02-22  6:59                                 ` Junio C Hamano
  2010-02-22 12:25                                   ` Dmitry Potapov
  2010-02-22 15:40                                   ` Nicolas Pitre
  0 siblings, 2 replies; 84+ messages in thread
From: Junio C Hamano @ 2010-02-22  6:59 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

Dmitry Potapov <dpotapov@gmail.com> writes:

> If you look at speed-up numbers, you can think that the numbers are
> unstable, but in fact, the best time in 5 runs does not differ more
> than 0.01s between those trials. But because difference for >=128Kb
> is 0.05s or less, the accuracy of the above numbers is less than 25%.

Then wouldn't it make the following statement...

> But overall the outcome is clear -- read() is always a winner.

"... a winner, below 128kB; above that the difference is within noise and
measurement error"?

> It would be interesting to see what difference Nehalem, which has a
> smaller but much faster L2 cache than Core 2. It may perform better
> at larger sizes up to 256Kb.

Interesting.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-22  6:59                                 ` Junio C Hamano
@ 2010-02-22 12:25                                   ` Dmitry Potapov
  2010-02-22 15:40                                   ` Nicolas Pitre
  1 sibling, 0 replies; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-22 12:25 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Zygo Blaxell, Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Sun, Feb 21, 2010 at 10:59:51PM -0800, Junio C Hamano wrote:
> 
> "... a winner, below 128kB; above that the difference is within noise and
> measurement error"?

What I was trying to say is that if you see consistently four or five
points win (based on many trials), it is clear a win; but you have 25%
error bar for the gain number.


Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-18  5:36                         ` Junio C Hamano
  2010-02-18  7:27                           ` Wincent Colaiuta
@ 2010-02-22 12:59                           ` Paolo Bonzini
  2010-02-22 13:33                             ` Dmitry Potapov
  1 sibling, 1 reply; 84+ messages in thread
From: Paolo Bonzini @ 2010-02-22 12:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Jeff King, Dmitry Potapov, Zygo Blaxell,
	Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On 02/18/2010 06:36 AM, Junio C Hamano wrote:
> Nicolas Pitre<nico@fluxnic.net>  writes:
>
>> It is likely to have better performance if the buffer is small enough to
>> fit in the CPU L1 cache.  There are two sequencial passes over the
>> buffer: one for the SHA1 computation, and another for the compression,
>> and currently they're sure to trash the L1 cache on each pass.
>
> I did a very unscientific test to hash about 14k paths (arch/ and fs/ from
> the kernel source) using "git-hash-object -w --stdin-paths" into an empty
> repository with varying sizes of paranoia buffer (quarter, 1, 4, 8 and
> 256kB) and saw 8-30% overhead.  256kB did hurt and around 4kB seemed to be
> optimal for my this small sample load.
>
> In any case, with any size of paranoia, this hurts the sane use case

Because by mmaping + memcpying you are getting the worst of both cases: 
you get a page fault per page like with mmap, and touch memory twice 
like with read.

Paolo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-22 12:59                           ` Paolo Bonzini
@ 2010-02-22 13:33                             ` Dmitry Potapov
  0 siblings, 0 replies; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-22 13:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Junio C Hamano, Nicolas Pitre, Jeff King, Zygo Blaxell,
	Ilari Liusvaara, Thomas Rast, Jonathan Nieder, git

On Mon, Feb 22, 2010 at 01:59:50PM +0100, Paolo Bonzini wrote:
> On 02/18/2010 06:36 AM, Junio C Hamano wrote:
> >
> >In any case, with any size of paranoia, this hurts the sane use case
> 
> Because by mmaping + memcpying you are getting the worst of both
> cases: you get a page fault per page like with mmap, and touch
> memory twice like with read.

and there is also an extra round of SHA-1 calculation, which I believe
is more expensive than memcpy().


Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-22  6:59                                 ` Junio C Hamano
  2010-02-22 12:25                                   ` Dmitry Potapov
@ 2010-02-22 15:40                                   ` Nicolas Pitre
  2010-02-22 16:01                                     ` Dmitry Potapov
  2010-02-22 17:31                                     ` Zygo Blaxell
  1 sibling, 2 replies; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-22 15:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Dmitry Potapov, Zygo Blaxell, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Sun, 21 Feb 2010, Junio C Hamano wrote:

> Dmitry Potapov <dpotapov@gmail.com> writes:
> 
> > If you look at speed-up numbers, you can think that the numbers are
> > unstable, but in fact, the best time in 5 runs does not differ more
> > than 0.01s between those trials. But because difference for >=128Kb
> > is 0.05s or less, the accuracy of the above numbers is less than 25%.
> 
> Then wouldn't it make the following statement...
> 
> > But overall the outcome is clear -- read() is always a winner.
> 
> "... a winner, below 128kB; above that the difference is within noise and
> measurement error"?

read() is not always a winner.  A read() call will always have the data 
duplicated in memory.  Especially with large files, it is more efficient 
on the system as a whole to mmap() a 50 MB file rather than allocating 
an extra 50 MB of anonymous memory that cannot be paged out (except to 
the swap file which would be yet another data duplication).  With mmap() 
when there is memory pressure the read-only mapped memory is simply 
dropped with no extra IO.

So when read() is not _significantly_ faster than mmap() then it should 
not be used.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-22 15:40                                   ` Nicolas Pitre
@ 2010-02-22 16:01                                     ` Dmitry Potapov
  2010-02-22 17:31                                     ` Zygo Blaxell
  1 sibling, 0 replies; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-22 16:01 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Junio C Hamano, Zygo Blaxell, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Mon, Feb 22, 2010 at 6:40 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
>
> read() is not always a winner.  A read() call will always have the data
> duplicated in memory.  Especially with large files, it is more efficient
> on the system as a whole to mmap() a 50 MB file rather than allocating

Agreed. I meant it was faster on data that I measured, which was <=1Mb.
IMHO, read() should be used up to 64Kb.

Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-22 15:40                                   ` Nicolas Pitre
  2010-02-22 16:01                                     ` Dmitry Potapov
@ 2010-02-22 17:31                                     ` Zygo Blaxell
  2010-02-22 18:01                                       ` Nicolas Pitre
  2010-02-22 18:05                                       ` Dmitry Potapov
  1 sibling, 2 replies; 84+ messages in thread
From: Zygo Blaxell @ 2010-02-22 17:31 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Junio C Hamano, Dmitry Potapov, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Mon, Feb 22, 2010 at 10:40:59AM -0500, Nicolas Pitre wrote:
> On Sun, 21 Feb 2010, Junio C Hamano wrote:
> > Dmitry Potapov <dpotapov@gmail.com> writes:
> > > But overall the outcome is clear -- read() is always a winner.
> > 
> > "... a winner, below 128kB; above that the difference is within noise and
> > measurement error"?
> 
> read() is not always a winner.  A read() call will always have the data 
> duplicated in memory.  Especially with large files, it is more efficient 
> on the system as a whole to mmap() a 50 MB file rather than allocating 
> an extra 50 MB of anonymous memory that cannot be paged out (except to 
> the swap file which would be yet another data duplication).  With mmap() 
> when there is memory pressure the read-only mapped memory is simply 
> dropped with no extra IO.

That holds if you're comparing read() and mmap() of the entire file as a
single chunk, instead of in fixed-size chunks at the sweet spot between
syscall overhead and CPU cache size.

If you're read()ing a chunk at a time into a fixed size buffer, and
doing sha1 and deflate in chunks, the data should be copied once into CPU
cache, processed with both algorithms, and replaced with new data from
the next chunk.  The data will be copied from the page cache instead
of directly mapped, which is a small overhead, but setting up the page
map in mmap() also a small overhead, so you have to use benchmarks to
know which of the overheads is smaller.  It might be that there's no
one answer that applies to all CPU configurations.

If you're doing mmap() and sha1 and deflate of a 50MB file in two
separate passes that are the same size as the file, you load 50MB of
data into CPU cache at least twice, you get two sets of associated
things like TLB misses, and if the file is very large, you page it from
disk twice.  So it might make sense to process in chunks regardless
of read() vs mmap() fetching the data.

If you're malloc()ing 50MB, you're wasting memory and CPU bandwidth
making up pages full of zeros before you've even processed the first byte.
I don't see how that could ever be faster for large file cases.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-22 17:31                                     ` Zygo Blaxell
@ 2010-02-22 18:01                                       ` Nicolas Pitre
  2010-02-22 19:56                                         ` Junio C Hamano
  2010-02-22 18:05                                       ` Dmitry Potapov
  1 sibling, 1 reply; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-22 18:01 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Junio C Hamano, Dmitry Potapov, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Mon, 22 Feb 2010, Zygo Blaxell wrote:

> On Mon, Feb 22, 2010 at 10:40:59AM -0500, Nicolas Pitre wrote:
> > On Sun, 21 Feb 2010, Junio C Hamano wrote:
> > > Dmitry Potapov <dpotapov@gmail.com> writes:
> > > > But overall the outcome is clear -- read() is always a winner.
> > > 
> > > "... a winner, below 128kB; above that the difference is within noise and
> > > measurement error"?
> > 
> > read() is not always a winner.  A read() call will always have the data 
> > duplicated in memory.  Especially with large files, it is more efficient 
> > on the system as a whole to mmap() a 50 MB file rather than allocating 
> > an extra 50 MB of anonymous memory that cannot be paged out (except to 
> > the swap file which would be yet another data duplication).  With mmap() 
> > when there is memory pressure the read-only mapped memory is simply 
> > dropped with no extra IO.
> 
> That holds if you're comparing read() and mmap() of the entire file as a
> single chunk, instead of in fixed-size chunks at the sweet spot between
> syscall overhead and CPU cache size.

Obviously.  But we currently don't have the infrastructure to do chunked 
read of the input data.  I think we should do that eventually, by 
applying the pack windowing code to input files as well.  That would 
make memory usage constant even for huge files, but this is much more 
complicated to support especially for data fed through stdin.

> If you're read()ing a chunk at a time into a fixed size buffer, and
> doing sha1 and deflate in chunks, the data should be copied once into CPU
> cache, processed with both algorithms, and replaced with new data from
> the next chunk.  The data will be copied from the page cache instead
> of directly mapped, which is a small overhead, but setting up the page
> map in mmap() also a small overhead, so you have to use benchmarks to
> know which of the overheads is smaller.  It might be that there's no
> one answer that applies to all CPU configurations.

Normally mmap() has more overhead than read().  However mmap() provides 
much nicer properties than read() by simplifying the code a lot, and by 
letting the OS manage memory pressure much more gracefully.

> If you're doing mmap() and sha1 and deflate of a 50MB file in two
> separate passes that are the same size as the file, you load 50MB of
> data into CPU cache at least twice, you get two sets of associated
> things like TLB misses, and if the file is very large, you page it from
> disk twice.  So it might make sense to process in chunks regardless
> of read() vs mmap() fetching the data.

We do have to make two separate passes anyway.  The first pass is to 
hash the data only, and if that hash already exists in the object store 
then we call it done and skip over the deflate process which is still 
the dominant cost.  And that happens quite often.

However, with a really large file, then it becomes advantageous to 
simply do the hash and deflate in parallel one chunk at a time, and 
simply discard the newly created objects if it happens to already 
exists.  That's the whole idea behind the newly introduced 
core.bigFileThreshold config variable (but the code to honor it in 
sha1_file.c doesn't exist yet).

> If you're malloc()ing 50MB, you're wasting memory and CPU bandwidth
> making up pages full of zeros before you've even processed the first byte.
> I don't see how that could ever be faster for large file cases.

It can't.  This is why read() is not much better than mmap() in those 
cases.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-22 17:31                                     ` Zygo Blaxell
  2010-02-22 18:01                                       ` Nicolas Pitre
@ 2010-02-22 18:05                                       ` Dmitry Potapov
  2010-02-22 18:14                                         ` Nicolas Pitre
  1 sibling, 1 reply; 84+ messages in thread
From: Dmitry Potapov @ 2010-02-22 18:05 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Nicolas Pitre, Junio C Hamano, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Mon, Feb 22, 2010 at 8:31 PM, Zygo Blaxell
<zblaxell@gibbs.hungrycats.org> wrote:
>
> If you're read()ing a chunk at a time into a fixed size buffer, and
> doing sha1 and deflate in chunks, the data should be copied once into CPU
> cache, processed with both algorithms, and replaced with new data from
> the next chunk.

Currently, we calculate SHA-1, then lookup whether the object with this
SHA-1 exists, and if it does not, then deflate and write it to the
object storage. So, we avoid deflate and write cost if this object
already exists. Moreover, when we deflate data, we create the temporary
file in the same directory where the target object will be stored, thus
avoiding cross-directory rename (which is important for some reason, but
I don't remember why).  So, creating the temporary file requires the
knowledge first two digits of SHA-1, which you cannot know without
calculation SHA-1.

So, the idea of processing file in chunks is very attractive, but it has
two drawbacks:
1. extra cost (deflating+writing) when the object is already stored
2. some issues with cross-directory renaming


Dmitry

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-22 18:05                                       ` Dmitry Potapov
@ 2010-02-22 18:14                                         ` Nicolas Pitre
  0 siblings, 0 replies; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-22 18:14 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Zygo Blaxell, Junio C Hamano, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Mon, 22 Feb 2010, Dmitry Potapov wrote:

> Currently, we calculate SHA-1, then lookup whether the object with this
> SHA-1 exists, and if it does not, then deflate and write it to the
> object storage. So, we avoid deflate and write cost if this object
> already exists. Moreover, when we deflate data, we create the temporary
> file in the same directory where the target object will be stored, thus
> avoiding cross-directory rename (which is important for some reason, but
> I don't remember why).  So, creating the temporary file requires the
> knowledge first two digits of SHA-1, which you cannot know without
> calculation SHA-1.

Even that initial SHA1 calculation can be done in chunks.  Worth doing 
for large enough files only though.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-22 18:01                                       ` Nicolas Pitre
@ 2010-02-22 19:56                                         ` Junio C Hamano
  2010-02-22 20:52                                           ` Nicolas Pitre
  0 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2010-02-22 19:56 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Zygo Blaxell, Dmitry Potapov, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

Nicolas Pitre <nico@fluxnic.net> writes:

> We do have to make two separate passes anyway.  The first pass is to 
> hash the data only, and if that hash already exists in the object store 
> then we call it done and skip over the deflate process which is still 
> the dominant cost.  And that happens quite often.
>
> However, with a really large file, then it becomes advantageous to 
> simply do the hash and deflate in parallel one chunk at a time, and 
> simply discard the newly created objects if it happens to already 
> exists.  That's the whole idea behind the newly introduced 
> core.bigFileThreshold config variable (but the code to honor it in 
> sha1_file.c doesn't exist yet).

The core.bigFileThreshold could be used in sha1_file.c to decide writing
straight into a new packfile; that would avoid both the later repacking
cost and the cross directory rename issue for loose object files.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Teach "git add" and friends to be paranoid
  2010-02-22 19:56                                         ` Junio C Hamano
@ 2010-02-22 20:52                                           ` Nicolas Pitre
  0 siblings, 0 replies; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-22 20:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Zygo Blaxell, Dmitry Potapov, Ilari Liusvaara, Thomas Rast,
	Jonathan Nieder, git

On Mon, 22 Feb 2010, Junio C Hamano wrote:

> Nicolas Pitre <nico@fluxnic.net> writes:
> 
> > We do have to make two separate passes anyway.  The first pass is to 
> > hash the data only, and if that hash already exists in the object store 
> > then we call it done and skip over the deflate process which is still 
> > the dominant cost.  And that happens quite often.
> >
> > However, with a really large file, then it becomes advantageous to 
> > simply do the hash and deflate in parallel one chunk at a time, and 
> > simply discard the newly created objects if it happens to already 
> > exists.  That's the whole idea behind the newly introduced 
> > core.bigFileThreshold config variable (but the code to honor it in 
> > sha1_file.c doesn't exist yet).
> 
> The core.bigFileThreshold could be used in sha1_file.c to decide writing
> straight into a new packfile; that would avoid both the later repacking
> cost and the cross directory rename issue for loose object files.

Yep, that's the plan.  Let's find the time now.


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: 16 gig, 350,000 file repository
  2010-02-18 20:58                             ` Nicolas Pitre
  2010-02-19  9:27                               ` Erik Faye-Lund
@ 2010-02-22 22:20                               ` Bill Lear
  2010-02-22 22:31                                 ` Nicolas Pitre
  1 sibling, 1 reply; 84+ messages in thread
From: Bill Lear @ 2010-02-22 22:20 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

On Thursday, February 18, 2010 at 15:58:42 (-0500) Nicolas Pitre writes:
>On Thu, 18 Feb 2010, Bill Lear wrote:
>
>> I'm starting a new, large project and would like a quick bit of advice.
>> 
>> Bringing in a set of test cases and other files from a ClearCase
>> repository resulted in a 350,000 file git repo of about 16 gigabytes.
>> 
>> The time to clone over a fast network was about 250 minutes.  I could
>> not verify if the repo had been packed properly, etc.
>
>I'd start from there.  If you didn't do a 'git gc --aggressive' after 
>the import then it is quite likely that your repo isn't well packed.
>
>Of course you'll need a big machine to repack this.  But that should be 
>needed only once.

Ok, well they have a "big machine", but not big enough.  It's running
out of memory on the gc.  I believe they have a fair amount of memory:

% free
             total       used       free     shared    buffers     cached
Mem:      16629680   16051444     578236          0      28332   14385948
-/+ buffers/cache:    1637164   14992516
Swap:      8289500       1704    8287796

and they are using git 1.6.6.

Assuming we can figure out how to gc this puppy (is there any way on a
machine without 64 gigabytes?), there is still a question that
remains: how to organize a project that has a very large amount of
test cases (and test data) that we might not want to pull across the
wire each time.  Instead of shallow clone, as sort of slicing clone
operation?

We thought of using submodules.  That is, code (say) goes in a separate
repo 'src' and functional tests go in another, called 'ftests'.  Then,
we add 'ftests' as a submodule to 'src'.  Great.  However, we need to
be able to branch 'src' and 'ftests' together.  Example: I am working on
a new feature in a branch "GLX-473_incremental_compression".  I would like
to be able to create the branch in both the 'src' repo and the 'ftests'
repo at the same time, make changes, commit, and push to that branch for
both.  When developers check out the repo, they move to that branch, but
do NOT want the cloned ftests.  However, the QA team wants both the source
and the tests that I have checked in and pushed.

Is there an easy way to support this?


Bill

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: 16 gig, 350,000 file repository
  2010-02-22 22:20                               ` Bill Lear
@ 2010-02-22 22:31                                 ` Nicolas Pitre
  0 siblings, 0 replies; 84+ messages in thread
From: Nicolas Pitre @ 2010-02-22 22:31 UTC (permalink / raw)
  To: Bill Lear; +Cc: git

On Mon, 22 Feb 2010, Bill Lear wrote:

> On Thursday, February 18, 2010 at 15:58:42 (-0500) Nicolas Pitre writes:
> >On Thu, 18 Feb 2010, Bill Lear wrote:
> >
> >> I'm starting a new, large project and would like a quick bit of advice.
> >> 
> >> Bringing in a set of test cases and other files from a ClearCase
> >> repository resulted in a 350,000 file git repo of about 16 gigabytes.
> >> 
> >> The time to clone over a fast network was about 250 minutes.  I could
> >> not verify if the repo had been packed properly, etc.
> >
> >I'd start from there.  If you didn't do a 'git gc --aggressive' after 
> >the import then it is quite likely that your repo isn't well packed.
> >
> >Of course you'll need a big machine to repack this.  But that should be 
> >needed only once.
> 
> Ok, well they have a "big machine", but not big enough.  It's running
> out of memory on the gc.  I believe they have a fair amount of memory:
> 
> % free
>              total       used       free     shared    buffers     cached
> Mem:      16629680   16051444     578236          0      28332   14385948
> -/+ buffers/cache:    1637164   14992516
> Swap:      8289500       1704    8287796
> 
> and they are using git 1.6.6.

Hmmm. OK.

You might try:

	git repack -a -f -d --depth=200 --window=100 --window-memory=1g


Nicolas

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2010-02-22 22:31 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20100211234753.22574.48799.reportbug@gibbs.hungrycats.org>
2010-02-12  0:27 ` Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs Jonathan Nieder
2010-02-12  1:23   ` Zygo Blaxell
2010-02-13 12:12     ` Jonathan Nieder
2010-02-13 13:39       ` Ilari Liusvaara
2010-02-13 14:39         ` Thomas Rast
2010-02-13 16:29           ` Ilari Liusvaara
2010-02-13 22:09             ` Dmitry Potapov
2010-02-13 22:37               ` Zygo Blaxell
2010-02-14  1:18                 ` [PATCH] don't use mmap() to hash files Dmitry Potapov
2010-02-14  1:37                   ` Junio C Hamano
2010-02-14  2:18                     ` Dmitry Potapov
2010-02-14  3:14                       ` Junio C Hamano
2010-02-14 11:14                         ` Thomas Rast
2010-02-14 11:46                           ` Junio C Hamano
2010-02-14  1:53                   ` Johannes Schindelin
2010-02-14  2:00                     ` Junio C Hamano
2010-02-14  2:42                     ` Dmitry Potapov
2010-02-14 11:07                       ` Jakub Narebski
2010-02-14 11:55                       ` Paolo Bonzini
2010-02-14 18:10                       ` Johannes Schindelin
2010-02-14 19:06                         ` Dmitry Potapov
2010-02-14 19:22                           ` Johannes Schindelin
2010-02-14 19:28                             ` Johannes Schindelin
2010-02-14 19:56                               ` Dmitry Potapov
2010-02-14 23:52                                 ` Zygo Blaxell
2010-02-15  5:05                                 ` Nicolas Pitre
2010-02-15 12:23                                   ` Dmitry Potapov
2010-02-15  7:48                                 ` Paolo Bonzini
2010-02-15 12:25                                   ` Dmitry Potapov
2010-02-14 19:55                             ` Dmitry Potapov
2010-02-14 23:13                           ` Avery Pennarun
2010-02-15  4:16                             ` Nicolas Pitre
2010-02-15  5:01                               ` Avery Pennarun
2010-02-15  5:48                                 ` Nicolas Pitre
2010-02-15 19:19                                   ` Avery Pennarun
2010-02-15 19:29                                     ` Nicolas Pitre
2010-02-14  3:05                   ` [PATCH v2] " Dmitry Potapov
2010-02-18  1:16                   ` [PATCH] Teach "git add" and friends to be paranoid Junio C Hamano
2010-02-18  1:20                     ` Junio C Hamano
2010-02-18 15:32                       ` Zygo Blaxell
2010-02-19 17:51                         ` Junio C Hamano
2010-02-18  1:38                     ` Jeff King
2010-02-18  4:55                       ` Nicolas Pitre
2010-02-18  5:36                         ` Junio C Hamano
2010-02-18  7:27                           ` Wincent Colaiuta
2010-02-18 16:18                             ` Zygo Blaxell
2010-02-18 18:12                               ` Jonathan Nieder
2010-02-18 18:35                                 ` Junio C Hamano
2010-02-22 12:59                           ` Paolo Bonzini
2010-02-22 13:33                             ` Dmitry Potapov
2010-02-18 10:14                     ` Thomas Rast
2010-02-18 18:16                       ` Junio C Hamano
2010-02-18 19:58                         ` Nicolas Pitre
2010-02-18 20:11                           ` 16 gig, 350,000 file repository Bill Lear
2010-02-18 20:58                             ` Nicolas Pitre
2010-02-19  9:27                               ` Erik Faye-Lund
2010-02-22 22:20                               ` Bill Lear
2010-02-22 22:31                                 ` Nicolas Pitre
2010-02-18 20:14                           ` [PATCH] Teach "git add" and friends to be paranoid Peter Harris
2010-02-18 20:17                           ` Junio C Hamano
2010-02-18 21:30                             ` Nicolas Pitre
2010-02-19  1:04                               ` Jonathan Nieder
2010-02-19 15:26                                 ` Zygo Blaxell
2010-02-19 17:52                                   ` Junio C Hamano
2010-02-19 19:08                                     ` Zygo Blaxell
2010-02-19  8:28                     ` Dmitry Potapov
2010-02-19 17:52                       ` Junio C Hamano
2010-02-20 19:23                         ` Junio C Hamano
2010-02-21  7:21                           ` Dmitry Potapov
2010-02-21 19:32                             ` Junio C Hamano
2010-02-22  3:35                               ` Dmitry Potapov
2010-02-22  6:59                                 ` Junio C Hamano
2010-02-22 12:25                                   ` Dmitry Potapov
2010-02-22 15:40                                   ` Nicolas Pitre
2010-02-22 16:01                                     ` Dmitry Potapov
2010-02-22 17:31                                     ` Zygo Blaxell
2010-02-22 18:01                                       ` Nicolas Pitre
2010-02-22 19:56                                         ` Junio C Hamano
2010-02-22 20:52                                           ` Nicolas Pitre
2010-02-22 18:05                                       ` Dmitry Potapov
2010-02-22 18:14                                         ` Nicolas Pitre
2010-02-14  1:36   ` mmap with MAP_PRIVATE is useless (was Re: Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs) Paolo Bonzini
2010-02-14  1:53     ` mmap with MAP_PRIVATE is useless Junio C Hamano
2010-02-14  2:11       ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.