[PATCH] sha1_file: don't malloc the whole compressed result when writing out objects

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
@ 2010-02-21  4:27 Nicolas Pitre
  2010-02-21 19:45 ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Nicolas Pitre @ 2010-02-21  4:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

There is no real advantage to malloc the whole output buffer and
deflate the data in a single pass when writing loose objects. That is
like only 1% faster while using more memory, especially with large
files where memory usage is far more. It is best to deflate and write
the data out in small chunks reusing the same memory instead.

For example, using 'git add' on a few large files averaging 40 MB ...

Before:
21.45user 1.10system 0:22.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+828040outputs (0major+142640minor)pagefaults 0swaps

After:
21.50user 1.25system 0:22.76elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+828040outputs (0major+104408minor)pagefaults 0swaps

While the runtime stayed relatively the same, the number of minor page
faults went down significantly.

Signed-off-by: Nicolas Pitre <nico@fluxnic.net>
---

I think this is worth doing independently of the paranoid mode being 
discussed.

diff --git a/sha1_file.c b/sha1_file.c
index 657825e..9196b57 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2281,8 +2281,7 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 			      void *buf, unsigned long len, time_t mtime)
 {
 	int fd, ret;
-	size_t size;
-	unsigned char *compressed;
+	unsigned char compressed[4096];
 	z_stream stream;
 	char *filename;
 	static char tmpfile[PATH_MAX];
@@ -2301,12 +2300,8 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	/* Set it up */
 	memset(&stream, 0, sizeof(stream));
 	deflateInit(&stream, zlib_compression_level);
-	size = 8 + deflateBound(&stream, len+hdrlen);
-	compressed = xmalloc(size);
-
-	/* Compress it */
 	stream.next_out = compressed;
-	stream.avail_out = size;
+	stream.avail_out = sizeof(compressed);
 
 	/* First header.. */
 	stream.next_in = (unsigned char *)hdr;
@@ -2317,20 +2312,21 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	/* Then the data itself.. */
 	stream.next_in = buf;
 	stream.avail_in = len;
-	ret = deflate(&stream, Z_FINISH);
+	do {
+		ret = deflate(&stream, Z_FINISH);
+		if (write_buffer(fd, compressed, stream.next_out - compressed) < 0)
+			die("unable to write sha1 file");
+		stream.next_out = compressed;
+		stream.avail_out = sizeof(compressed);
+	} while (ret == Z_OK);
+
 	if (ret != Z_STREAM_END)
 		die("unable to deflate new object %s (%d)", sha1_to_hex(sha1), ret);
-
 	ret = deflateEnd(&stream);
 	if (ret != Z_OK)
 		die("deflateEnd on object %s failed (%d)", sha1_to_hex(sha1), ret);
 
-	size = stream.total_out;
-
-	if (write_buffer(fd, compressed, size) < 0)
-		die("unable to write sha1 file");
 	close_sha1_file(fd);
-	free(compressed);
 
 	if (mtime) {
 		struct utimbuf utb;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-21  4:27 [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects Nicolas Pitre
@ 2010-02-21 19:45 ` Junio C Hamano
  2010-02-21 21:26   ` Nicolas Pitre
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2010-02-21 19:45 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@fluxnic.net> writes:

> I think this is worth doing independently of the paranoid mode being 
> discussed.

While I agree it might be worth doing, I can see that you really hate
"paranoia".  Now your loop is letting deflate() decide how much it happens
to like to consume in a given round, it is much trickier to plug the
paranoia in without majorly rewriting the loop this patch introduces.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-21 19:45 ` Junio C Hamano
@ 2010-02-21 21:26   ` Nicolas Pitre
  2010-02-21 22:22     ` Junio C Hamano
  2010-02-21 22:30     ` Junio C Hamano
  0 siblings, 2 replies; 13+ messages in thread
From: Nicolas Pitre @ 2010-02-21 21:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, 21 Feb 2010, Junio C Hamano wrote:

> Nicolas Pitre <nico@fluxnic.net> writes:
> 
> > I think this is worth doing independently of the paranoid mode being 
> > discussed.
> 
> While I agree it might be worth doing, I can see that you really hate
> "paranoia".  Now your loop is letting deflate() decide how much it happens
> to like to consume in a given round, it is much trickier to plug the
> paranoia in without majorly rewriting the loop this patch introduces.

I disagree.

Here's my take on the paranoia issue.  Now the question is whether or 
not this should really be optional.  I would think no.

FWIW, we already have that double SHA1 protection when dealing with pack 
files with fixup_pack_header_footer() (see commit abeb40e5aa).

---------- >8
From: Nicolas Pitre <nico@fluxnic.net>
Date: Sun, 21 Feb 2010 15:48:06 -0500
Subject: [PATCH] sha1_file: be paranoid when creating loose objects

We don't want the data being deflated and stored into loose objects
to be different from what we expect.  While the deflated data is
protected by a CRC which is good enough for safe data retrieval
operations, we still want to be doubly sure that the source data used
at object creation time is still what we expected once that data has
been deflated and its CRC32 computed.

The most plausible data corruption may occur if the source file is
modified while Git is deflating and writing it out in a loose object.
Or Git itself could have a bug causing memory corruption.  Or even bad
RAM could cause trouble.  So it is best to make sure everything is
coherent and checksum protected from beginning to end.

To do so we compute the SHA1 of the data being deflated _after_ the
deflate operation has consumed that data, and make sure it matches
with the expected SHA1.  This way we can rely on the CRC32 checked by
the inflate operation to provide a good indication that the data is still
coherent with its SHA1 hash.

There is some overhead of course. Using 'git add' on a set of large files:

Before:

	real    0m25.210s
	user    0m23.783s
	sys     0m1.408s

After:

	real    0m26.537s
	user    0m25.175s
	sys     0m1.358s

The overhead is around 5% for full data coherency guarantee.

Signed-off-by: Nicolas Pitre <nico@fluxnic.net>

diff --git a/sha1_file.c b/sha1_file.c
index 9196b57..c0214d7 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2283,6 +2283,8 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	int fd, ret;
 	unsigned char compressed[4096];
 	z_stream stream;
+	git_SHA_CTX c;
+	unsigned char parano_sha1[20];
 	char *filename;
 	static char tmpfile[PATH_MAX];

@@ -2302,18 +2304,22 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	deflateInit(&stream, zlib_compression_level);
 	stream.next_out = compressed;
 	stream.avail_out = sizeof(compressed);
+	git_SHA1_Init(&c);

 	/* First header.. */
 	stream.next_in = (unsigned char *)hdr;
 	stream.avail_in = hdrlen;
 	while (deflate(&stream, 0) == Z_OK)
 		/* nothing */;
+	git_SHA1_Update(&c, hdr, hdrlen);

 	/* Then the data itself.. */
 	stream.next_in = buf;
 	stream.avail_in = len;
 	do {
+		unsigned char *in0 = stream.next_in;
 		ret = deflate(&stream, Z_FINISH);
+		git_SHA1_Update(&c, in0, stream.next_in - in0);
 		if (write_buffer(fd, compressed, stream.next_out - compressed) < 0)
 			die("unable to write sha1 file");
 		stream.next_out = compressed;
@@ -2325,6 +2331,9 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	ret = deflateEnd(&stream);
 	if (ret != Z_OK)
 		die("deflateEnd on object %s failed (%d)", sha1_to_hex(sha1), ret);
+	git_SHA1_Final(parano_sha1, &c);
+	if (hashcmp(sha1, parano_sha1) != 0)
+		die("confused by unstable object source data for %s", sha1_to_hex(sha1));

 	close_sha1_file(fd);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-21 21:26   ` Nicolas Pitre
@ 2010-02-21 22:22     ` Junio C Hamano
  2010-02-21 22:30     ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2010-02-21 22:22 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@fluxnic.net> writes:

> I disagree.
>
> Here's my take on the paranoia issue.

Ahh, yes, of course.

You are always a better programmer than I am and I keep getting reminded.

Thanks, and I agree it is a sane thing to do this unconditionally.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-21 21:26   ` Nicolas Pitre
  2010-02-21 22:22     ` Junio C Hamano
@ 2010-02-21 22:30     ` Junio C Hamano
  2010-02-22  1:35       ` Nicolas Pitre
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2010-02-21 22:30 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@fluxnic.net> writes:

>  	/* Then the data itself.. */
>  	stream.next_in = buf;
>  	stream.avail_in = len;
>  	do {
> +		unsigned char *in0 = stream.next_in;
>  		ret = deflate(&stream, Z_FINISH);
> +		git_SHA1_Update(&c, in0, stream.next_in - in0);

Actually, I have to take my earlier comment back.  This is not "paranoia".

I do not see anything that protects the memory area between in0 and
stream.next_in from getting modified while deflate() nor SHA1_Update() run
from the outside.  Unless you copy the data away to somewhere stable at
the beginning of each iteration of this loop and run deflate() and
SHA1_Update(), you cannot have "paranoia".

My comment about "trickier" is about determining the size of that buffer
used as "somewhere stable".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-21 22:30     ` Junio C Hamano
@ 2010-02-22  1:35       ` Nicolas Pitre
  2010-02-22  5:30         ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Nicolas Pitre @ 2010-02-22  1:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, 21 Feb 2010, Junio C Hamano wrote:

> Nicolas Pitre <nico@fluxnic.net> writes:
> 
> >  	/* Then the data itself.. */
> >  	stream.next_in = buf;
> >  	stream.avail_in = len;
> >  	do {
> > +		unsigned char *in0 = stream.next_in;
> >  		ret = deflate(&stream, Z_FINISH);
> > +		git_SHA1_Update(&c, in0, stream.next_in - in0);
> 
> Actually, I have to take my earlier comment back.  This is not "paranoia".
> 
> I do not see anything that protects the memory area between in0 and
> stream.next_in from getting modified while deflate() nor SHA1_Update() run
> from the outside.

So what?

> Unless you copy the data away to somewhere stable at
> the beginning of each iteration of this loop and run deflate() and
> SHA1_Update(), you cannot have "paranoia".

No.

The whole point is to detect data incoherencyes.

So current sequence of events is as follows:

T0	write_sha1_file_prepare() is called
T1	start initial SHA1 computation on data buffer
T2	in the middle of initial SHA1 computation
T3	end of initial SHA1 computation -> object name is determined
T4	write_loose_object() is called
...	enter the write loop
T5+n	deflate() called on buffer n
T6+n	git_SHA1_Update(() called on the same buffer n
T7+n	deflated data written out
...
Tend	abort if result of T6+n doesn't match object name from T3

So... what can happen:

1) Data is externally modified before T5+n: deflated data and its CRC32 
   will be coherent with the SHA1 computed in T6+n, but incoherent with 
   the SHA1 used for the object name. Wrong data is written to the 
   object even if it will inflate OK. We really want to prevent that 
   from happening. The test in Tend will fail.

2) Data is externally modified between T5+n and T6+n: the deflated data 
   and CRC32 will be coherent with the object name but incoherent with 
   the parano_sha1.  Although written data will be OK, this is way too 
   close from being wrong, and the test in Tend will fail.  If there is 
   more than one round into the loop and the external modifications are 
   large enough then this becomes the same as case 1 above.

3) Data is externally modified in T2: again the test in Tend will fail.

So in all possible cases I can think of, the write will abort.  No copy 
buffer needed, no filesystem mtime required, etc.  If the whole data is 
not stable between T1 and Tend then the object is not added to the 
repository.  Of course it is possible that the data be modified at the 
beginning of the file while the loop in T[5-7] is passed that point.  
But still, there is no data inconsistency at that point.

> My comment about "trickier" is about determining the size of that buffer
> used as "somewhere stable".

We don't care about such buffer.

Nicolas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-22  1:35       ` Nicolas Pitre
@ 2010-02-22  5:30         ` Junio C Hamano
  2010-02-22  5:50           ` Nicolas Pitre
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2010-02-22  5:30 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

Nicolas Pitre <nico@fluxnic.net> writes:

> The whole point is to detect data incoherencyes.

Yes.  We want to make sure that the SHA-1 we compute is over what we fed
deflate().

> So current sequence of events is as follows:
>
> T0	write_sha1_file_prepare() is called
> T1	start initial SHA1 computation on data buffer
> T2	in the middle of initial SHA1 computation
> T3	end of initial SHA1 computation -> object name is determined
> T4	write_loose_object() is called
> ...	enter the write loop
> T5+n	deflate() called on buffer n
> T6+n	git_SHA1_Update(() called on the same buffer n
> T7+n	deflated data written out
> ...
> Tend	abort if result of T6+n doesn't match object name from T3
>
> So... what can happen:
>
> 1) Data is externally modified before T5+n: deflated data and its CRC32 
>    will be coherent with the SHA1 computed in T6+n, but incoherent with 
>    the SHA1 used for the object name. Wrong data is written to the 
>    object even if it will inflate OK. We really want to prevent that 
>    from happening. The test in Tend will fail.
>
> 2) Data is externally modified between T5+n and T6+n: the deflated data 
>    and CRC32 will be coherent with the object name but incoherent with 
>    the parano_sha1.  Although written data will be OK, this is way too 
>    close from being wrong, and the test in Tend will fail.  If there is 
>    more than one round into the loop and the external modifications are 
>    large enough then this becomes the same as case 1 above.
>
> 3) Data is externally modified in T2: again the test in Tend will fail.
>
> So in all possible cases I can think of, the write will abort.

There is one pathological case.

Immediately before T5+n (or between T5+n and T6+n), the external process
changes the data deflate() is working on, but before T6+n, the external
process changes the data back.  Two SHA-1's computed may match, but it is
not a hash over what was deflated(); you won't be able to abort.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-22  5:30         ` Junio C Hamano
@ 2010-02-22  5:50           ` Nicolas Pitre
  2010-02-22  6:17             ` Junio C Hamano
  2010-02-22  6:27             ` Dmitry Potapov
  0 siblings, 2 replies; 13+ messages in thread
From: Nicolas Pitre @ 2010-02-22  5:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, 21 Feb 2010, Junio C Hamano wrote:

> Nicolas Pitre <nico@fluxnic.net> writes:
> 
> > The whole point is to detect data incoherencyes.
> 
> Yes.  We want to make sure that the SHA-1 we compute is over what we fed
> deflate().
> 
> > So current sequence of events is as follows:
> >
> > T0	write_sha1_file_prepare() is called
> > T1	start initial SHA1 computation on data buffer
> > T2	in the middle of initial SHA1 computation
> > T3	end of initial SHA1 computation -> object name is determined
> > T4	write_loose_object() is called
> > ...	enter the write loop
> > T5+n	deflate() called on buffer n
> > T6+n	git_SHA1_Update(() called on the same buffer n
> > T7+n	deflated data written out
> > ...
> > Tend	abort if result of T6+n doesn't match object name from T3
> >
> > So... what can happen:
> >
> > 1) Data is externally modified before T5+n: deflated data and its CRC32 
> >    will be coherent with the SHA1 computed in T6+n, but incoherent with 
> >    the SHA1 used for the object name. Wrong data is written to the 
> >    object even if it will inflate OK. We really want to prevent that 
> >    from happening. The test in Tend will fail.
> >
> > 2) Data is externally modified between T5+n and T6+n: the deflated data 
> >    and CRC32 will be coherent with the object name but incoherent with 
> >    the parano_sha1.  Although written data will be OK, this is way too 
> >    close from being wrong, and the test in Tend will fail.  If there is 
> >    more than one round into the loop and the external modifications are 
> >    large enough then this becomes the same as case 1 above.
> >
> > 3) Data is externally modified in T2: again the test in Tend will fail.
> >
> > So in all possible cases I can think of, the write will abort.
> 
> There is one pathological case.
> 
> Immediately before T5+n (or between T5+n and T6+n), the external process
> changes the data deflate() is working on, but before T6+n, the external
> process changes the data back.  Two SHA-1's computed may match, but it is
> not a hash over what was deflated(); you won't be able to abort.

And what real life case would trigger this?  Given the size of the 
window for this to happen, what are your chances?

Of course the odds for me to be struck by lightning also exist.  And if 
I work really really hard at it then I might be able to trigger that 
pathological case above even before the next thunderstorm.  But in 
practice I'm hardly concerned by either of those possibilities.


Nicolas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-22  5:50           ` Nicolas Pitre
@ 2010-02-22  6:17             ` Junio C Hamano
  2010-02-22  6:31               ` Junio C Hamano
  2010-02-22  6:27             ` Dmitry Potapov
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2010-02-22  6:17 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@fluxnic.net> writes:

> And what real life case would trigger this?  Given the size of the 
> window for this to happen, what are your chances?

> Of course the odds for me to be struck by lightning also exist.  And if 
> I work really really hard at it then I might be able to trigger that 
> pathological case above even before the next thunderstorm.  But in 
> practice I'm hardly concerned by either of those possibilities.

The real life case for any of this triggers for me is zero, as I won't be
mistreating git as a continuous & asynchronous back-up tool.

But then that would make the whole discussion moot.  There are people who
file "bug reports" with an artificial reproduction recipe built around a
loop that runs dd continuously overwriting a file while "git add" is asked
to add it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-22  5:50           ` Nicolas Pitre
  2010-02-22  6:17             ` Junio C Hamano
@ 2010-02-22  6:27             ` Dmitry Potapov
  1 sibling, 0 replies; 13+ messages in thread
From: Dmitry Potapov @ 2010-02-22  6:27 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Mon, Feb 22, 2010 at 12:50:18AM -0500, Nicolas Pitre wrote:
> 
> And what real life case would trigger this?  Given the size of the 
> window for this to happen, what are your chances?

If some process changes just one byte (or one word) back and forth
then the possibility of this is 25%. Whether such a process can
exist, I don't know... I could not imagine that anyone would want
to change the file when we add it to the repository. In any case,
it is wrong to call it a _paranoic_ mode if there is even a small
(but practically feasable) chance of this to happen.

Dmitry

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-22  6:17             ` Junio C Hamano
@ 2010-02-22  6:31               ` Junio C Hamano
  2010-02-22 17:36                 ` Nicolas Pitre
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2010-02-22  6:31 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Nicolas Pitre <nico@fluxnic.net> writes:
>
>> And what real life case would trigger this?  Given the size of the 
>> window for this to happen, what are your chances?
>
>> Of course the odds for me to be struck by lightning also exist.  And if 
>> I work really really hard at it then I might be able to trigger that 
>> pathological case above even before the next thunderstorm.  But in 
>> practice I'm hardly concerned by either of those possibilities.
>
> The real life case for any of this triggers for me is zero, as I won't be
> mistreating git as a continuous & asynchronous back-up tool.
>
> But then that would make the whole discussion moot.  There are people who
> file "bug reports" with an artificial reproduction recipe built around a
> loop that runs dd continuously overwriting a file while "git add" is asked
> to add it.

Having said all that, I like your approach better.  It is not worth paying
the price of unnecessary memcpy(3) that would _only_ help catching the
insanely artificial test case, but your patch strikes a good balance of
small overhead to catch the easier-to-trigger (either by stupidity, malice
or mistake) cases.

So I am tempted to discard the "paranoia" patch, and replace with your two
patches, with the following caveats in the log message.

--- /var/tmp/2	2010-02-21 22:23:30.000000000 -0800
+++ /var/tmp/1	2010-02-21 22:23:22.000000000 -0800
@@ -21,7 +21,9 @@
     deflate operation has consumed that data, and make sure it matches
     with the expected SHA1.  This way we can rely on the CRC32 checked by
     the inflate operation to provide a good indication that the data is still
-    coherent with its SHA1 hash.
+    coherent with its SHA1 hash.  One pathological case we ignore is when
+    the data is modified before (or during) deflate call, but changed back
+    before it is hashed.
     
     There is some overhead of course. Using 'git add' on a set of large files:
     

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-22  6:31               ` Junio C Hamano
@ 2010-02-22 17:36                 ` Nicolas Pitre
  2010-02-22 19:55                   ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Nicolas Pitre @ 2010-02-22 17:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, 21 Feb 2010, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Nicolas Pitre <nico@fluxnic.net> writes:
> >
> >> And what real life case would trigger this?  Given the size of the 
> >> window for this to happen, what are your chances?
> >
> >> Of course the odds for me to be struck by lightning also exist.  And if 
> >> I work really really hard at it then I might be able to trigger that 
> >> pathological case above even before the next thunderstorm.  But in 
> >> practice I'm hardly concerned by either of those possibilities.
> >
> > The real life case for any of this triggers for me is zero, as I won't be
> > mistreating git as a continuous & asynchronous back-up tool.
> >
> > But then that would make the whole discussion moot.  There are people who
> > file "bug reports" with an artificial reproduction recipe built around a
> > loop that runs dd continuously overwriting a file while "git add" is asked
> > to add it.
> 
> Having said all that, I like your approach better.  It is not worth paying
> the price of unnecessary memcpy(3) that would _only_ help catching the
> insanely artificial test case, but your patch strikes a good balance of
> small overhead to catch the easier-to-trigger (either by stupidity, malice
> or mistake) cases.

I think it also catches the bad RAM case which is probably more common 
too.

> So I am tempted to discard the "paranoia" patch, and replace with your two
> patches, with the following caveats in the log message.
> 
> --- /var/tmp/2	2010-02-21 22:23:30.000000000 -0800
> +++ /var/tmp/1	2010-02-21 22:23:22.000000000 -0800
> @@ -21,7 +21,9 @@
>      deflate operation has consumed that data, and make sure it matches
>      with the expected SHA1.  This way we can rely on the CRC32 checked by
>      the inflate operation to provide a good indication that the data is still
> -    coherent with its SHA1 hash.
> +    coherent with its SHA1 hash.  One pathological case we ignore is when
> +    the data is modified before (or during) deflate call, but changed back
> +    before it is hashed.

ACK.


Nicolas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects
  2010-02-22 17:36                 ` Nicolas Pitre
@ 2010-02-22 19:55                   ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2010-02-22 19:55 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@fluxnic.net> writes:

>> Having said all that, I like your approach better.  It is not worth paying
>> the price of unnecessary memcpy(3) that would _only_ help catching the
>> insanely artificial test case, but your patch strikes a good balance of
>> small overhead to catch the easier-to-trigger (either by stupidity, malice
>> or mistake) cases.
>
> I think it also catches the bad RAM case which is probably more common 
> too.

That is true; a broken RAM that returns unstable values will yield
different values between the time the first hash runs and the time the
deflate loop runs will trigger the safety.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-02-22 19:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-21  4:27 [PATCH] sha1_file: don't malloc the whole compressed result when writing out objects Nicolas Pitre
2010-02-21 19:45 ` Junio C Hamano
2010-02-21 21:26   ` Nicolas Pitre
2010-02-21 22:22     ` Junio C Hamano
2010-02-21 22:30     ` Junio C Hamano
2010-02-22  1:35       ` Nicolas Pitre
2010-02-22  5:30         ` Junio C Hamano
2010-02-22  5:50           ` Nicolas Pitre
2010-02-22  6:17             ` Junio C Hamano
2010-02-22  6:31               ` Junio C Hamano
2010-02-22 17:36                 ` Nicolas Pitre
2010-02-22 19:55                   ` Junio C Hamano
2010-02-22  6:27             ` Dmitry Potapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).