All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] t/t8006: Demonstrate blame is broken when cachetextconv is on
@ 2010-12-18 14:54 Kirill Smelkov
  2010-12-18 14:54 ` [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid Kirill Smelkov
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill Smelkov @ 2010-12-18 14:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Kirill Smelkov, Axel Bonnet, Clément Poulain,
	Diane Gasselin, Jeff King

I have a git repository with lots of .doc and .pdf files. There diff
works ok, but blaming is painfully slow without textconv cache, and with
textconv cache, blame says lots of lines are 'Not Yet Committed' which
is wrong.

Here is a test that demonstrates the problem.

Cc: Axel Bonnet <axel.bonnet@ensimag.imag.fr>
Cc: Clément Poulain <clement.poulain@ensimag.imag.fr>
Cc: Diane Gasselin <diane.gasselin@ensimag.imag.fr>
Cc: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru>
---
 t/t8006-blame-textconv.sh |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/t/t8006-blame-textconv.sh b/t/t8006-blame-textconv.sh
index dbf623b..fe90541 100755
--- a/t/t8006-blame-textconv.sh
+++ b/t/t8006-blame-textconv.sh
@@ -73,6 +73,28 @@ test_expect_success 'blame --textconv going through revisions' '
 	test_cmp expected result
 '
 
+test_expect_success 'setup +cachetextconv' '
+	git config diff.test.cachetextconv true
+'
+
+cat >expected_one <<EOF
+(Number2 2010-01-01 20:00:00 +0000 1) converted: test 1 version 2
+EOF
+
+# one.bin is blamed as 'Not Committed yet'
+test_expect_failure 'blame --textconv works with textconvcache' '
+	git blame --textconv two.bin >blame &&
+	find_blame <blame >result &&
+	test_cmp expected result &&
+	git blame --textconv one.bin >blame &&
+	find_blame  <blame >result &&
+	test_cmp expected_one result
+'
+
+test_expect_success 'setup -cachetextconv' '
+	git config diff.test.cachetextconv false
+'
+
 test_expect_success 'make a new commit' '
 	echo "bin: test number 2 version 3" >>two.bin &&
 	GIT_AUTHOR_NAME=Number3 git commit -a -m Third --date="2010-01-01 22:00:00"
-- 
1.7.3.4.570.g14308

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-18 14:54 [PATCH 1/2] t/t8006: Demonstrate blame is broken when cachetextconv is on Kirill Smelkov
@ 2010-12-18 14:54 ` Kirill Smelkov
  2010-12-18 16:13   ` Jeff King
  2010-12-20  2:32   ` Junio C Hamano
  0 siblings, 2 replies; 13+ messages in thread
From: Kirill Smelkov @ 2010-12-18 14:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Kirill Smelkov, Axel Bonnet, Clément Poulain,
	Diane Gasselin, Jeff King

It turned out, under blame there are requests to fill_textconv() with
sha1=0000000000000000000000000000000000000000 and sha1_valid=0.

As the code did not analyzed sha1 validity, we ended up putting 000000
into textconv cache which was fooling later blames to discover lots of
lines in 'Not Yet Committed' state.

Fix it.

Cc: Axel Bonnet <axel.bonnet@ensimag.imag.fr>
Cc: Clément Poulain <clement.poulain@ensimag.imag.fr>
Cc: Diane Gasselin <diane.gasselin@ensimag.imag.fr>
Cc: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru>
---
 diff.c                    |    4 ++--
 t/t8006-blame-textconv.sh |    3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/diff.c b/diff.c
index 0a43869..5422c43 100644
--- a/diff.c
+++ b/diff.c
@@ -4412,7 +4412,7 @@ size_t fill_textconv(struct userdiff_driver *driver,
 		return df->size;
 	}
 
-	if (driver->textconv_cache) {
+	if (driver->textconv_cache && df->sha1_valid) {
 		*outbuf = notes_cache_get(driver->textconv_cache, df->sha1,
 					  &size);
 		if (*outbuf)
@@ -4423,7 +4423,7 @@ size_t fill_textconv(struct userdiff_driver *driver,
 	if (!*outbuf)
 		die("unable to read files to diff");
 
-	if (driver->textconv_cache) {
+	if (driver->textconv_cache && df->sha1_valid) {
 		/* ignore errors, as we might be in a readonly repository */
 		notes_cache_put(driver->textconv_cache, df->sha1, *outbuf,
 				size);
diff --git a/t/t8006-blame-textconv.sh b/t/t8006-blame-textconv.sh
index fe90541..ea64cd8 100755
--- a/t/t8006-blame-textconv.sh
+++ b/t/t8006-blame-textconv.sh
@@ -81,8 +81,7 @@ cat >expected_one <<EOF
 (Number2 2010-01-01 20:00:00 +0000 1) converted: test 1 version 2
 EOF
 
-# one.bin is blamed as 'Not Committed yet'
-test_expect_failure 'blame --textconv works with textconvcache' '
+test_expect_success 'blame --textconv works with textconvcache' '
 	git blame --textconv two.bin >blame &&
 	find_blame <blame >result &&
 	test_cmp expected result &&
-- 
1.7.3.4.570.g14308

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-18 14:54 ` [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid Kirill Smelkov
@ 2010-12-18 16:13   ` Jeff King
  2010-12-18 20:55     ` Kirill Smelkov
  2010-12-20  2:26     ` Junio C Hamano
  2010-12-20  2:32   ` Junio C Hamano
  1 sibling, 2 replies; 13+ messages in thread
From: Jeff King @ 2010-12-18 16:13 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, git, Axel Bonnet, Clément Poulain, Diane Gasselin

On Sat, Dec 18, 2010 at 05:54:12PM +0300, Kirill Smelkov wrote:

> It turned out, under blame there are requests to fill_textconv() with
> sha1=0000000000000000000000000000000000000000 and sha1_valid=0.
> 
> As the code did not analyzed sha1 validity, we ended up putting 000000
> into textconv cache which was fooling later blames to discover lots of
> lines in 'Not Yet Committed' state.
> [...]
> -	if (driver->textconv_cache) {
> +	if (driver->textconv_cache && df->sha1_valid) {
>  		*outbuf = notes_cache_get(driver->textconv_cache, df->sha1,
>  					  &size);

In short:

  Acked-by: Jeff King <peff@peff.net>

But it took some thinking to convince myself, so the long answer is
below if anyone cares.

I was dubious at first that this could be the right solution. We still
end up putting the filespec through run_textconv, which didn't seem
right if it is not valid.

But reading into it more, there are two levels of invalidity:

  1. !DIFF_FILE_VALID(df) - we are not a valid file at all. I.e., we are
     /dev/null.

  2. !df->sha1_valid - we are pointing to a working tree file whose sha1
     we don't know

I think level (2) never happens at all in the regular diff code, which
is why this case was completely unhandled. But it is OK in that case
(required, even) to put the contents through run_textconv.

In theory we could actually calculate the sha1 in case (2) and cache
under that, but I don't know how much it would buy us in practice. It
saves us running the textconv filter at the expense of computing the
sha1. Which is probably a win for most filters, but on the other hand,
it is the wrong place to compute such a sha1. If it is a working tree
file, we should ideally update our stat info in the index so that the
info can be reused.

-Peff

PS It is a little disturbing that in fill_textconv, we handle
case (1), !DIFF_FILE_VALID for the non-textconv case, but not so for the
textconv case. I think we are OK, as get_textconv will never load a
textconv driver for a !DIFF_FILE_VALID filespec, so we always follow the
non-textconv codepath in that case. But I am tempted to do this just to
be more defensive:

diff --git a/diff.c b/diff.c
index b0ee213..5320849 100644
--- a/diff.c
+++ b/diff.c
@@ -4404,22 +4404,25 @@ size_t fill_textconv(struct userdiff_driver *driver,
 	if (!driver || !driver->textconv) {
 		if (!DIFF_FILE_VALID(df)) {
 			*outbuf = "";
 			return 0;
 		}
 		if (diff_populate_filespec(df, 0))
 			die("unable to read files to diff");
 		*outbuf = df->data;
 		return df->size;
 	}
 
+	if (!DIFF_FILE_VALID(df))
+		die("BUG: attempt to textconv an invalid filespec");
+
 	if (driver->textconv_cache) {
 		*outbuf = notes_cache_get(driver->textconv_cache, df->sha1,
 					  &size);
 		if (*outbuf)
 			return size;
 	}
 
 	*outbuf = run_textconv(driver->textconv, df, &size);
 	if (!*outbuf)
 		die("unable to read files to diff");
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-18 16:13   ` Jeff King
@ 2010-12-18 20:55     ` Kirill Smelkov
  2010-12-19  3:23       ` Junio C Hamano
  2010-12-20  2:26     ` Junio C Hamano
  1 sibling, 1 reply; 13+ messages in thread
From: Kirill Smelkov @ 2010-12-18 20:55 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git, Axel Bonnet, Cl??ment Poulain, Diane Gasselin

On Sat, Dec 18, 2010 at 11:13:37AM -0500, Jeff King wrote:
> On Sat, Dec 18, 2010 at 05:54:12PM +0300, Kirill Smelkov wrote:
> 
> > It turned out, under blame there are requests to fill_textconv() with
> > sha1=0000000000000000000000000000000000000000 and sha1_valid=0.
> > 
> > As the code did not analyzed sha1 validity, we ended up putting 000000
> > into textconv cache which was fooling later blames to discover lots of
> > lines in 'Not Yet Committed' state.
> > [...]
> > -	if (driver->textconv_cache) {
> > +	if (driver->textconv_cache && df->sha1_valid) {
> >  		*outbuf = notes_cache_get(driver->textconv_cache, df->sha1,
> >  					  &size);
> 
> In short:
> 
>   Acked-by: Jeff King <peff@peff.net>
> 
> But it took some thinking to convince myself, so the long answer is
> below if anyone cares.
> 
> I was dubious at first that this could be the right solution. We still
> end up putting the filespec through run_textconv, which didn't seem
> right if it is not valid.
> 
> But reading into it more, there are two levels of invalidity:
> 
>   1. !DIFF_FILE_VALID(df) - we are not a valid file at all. I.e., we are
>      /dev/null.
> 
>   2. !df->sha1_valid - we are pointing to a working tree file whose sha1
>      we don't know
> 
> I think level (2) never happens at all in the regular diff code, which
> is why this case was completely unhandled. But it is OK in that case
> (required, even) to put the contents through run_textconv.
> 
> In theory we could actually calculate the sha1 in case (2) and cache
> under that, but I don't know how much it would buy us in practice. It
> saves us running the textconv filter at the expense of computing the
> sha1. Which is probably a win for most filters, but on the other hand,
> it is the wrong place to compute such a sha1. If it is a working tree
> file, we should ideally update our stat info in the index so that the
> info can be reused.

Jeff,

Thanks for your ACK and for the explanation.

My last patches to git were blame related so semi-intuitively I knew
that invalid sha1 are coming from files in worktree. Your description
makes things much more clear and I'd put it into patch log as well.
What is the best practice for this? For me to re-roll, or for Junio to
merge texts?

Also experimenting shows that, as you say, regular diff does not have this
problem, and also that diff calculates sha1 for files in worktree and so
their textconv results are cached. So clearly, there are some behaviour
differences between diff and blame.


Thanks again,
Kirill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-18 20:55     ` Kirill Smelkov
@ 2010-12-19  3:23       ` Junio C Hamano
  2010-12-19 12:10         ` Kirill Smelkov
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2010-12-19  3:23 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, git, Axel Bonnet, Clément Poulain, Diane Gasselin

Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:

> Thanks for your ACK and for the explanation.
>
> My last patches to git were blame related so semi-intuitively I knew
> that invalid sha1 are coming from files in worktree. Your description
> makes things much more clear and I'd put it into patch log as well.
> What is the best practice for this? For me to re-roll, or for Junio to
> merge texts?

Re-rolling to explain changes in your own words is preferred; thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-19  3:23       ` Junio C Hamano
@ 2010-12-19 12:10         ` Kirill Smelkov
  2010-12-20  2:41           ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill Smelkov @ 2010-12-19 12:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, git, Axel Bonnet, Cl??ment Poulain, Diane Gasselin

On Sat, Dec 18, 2010 at 07:23:29PM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:
> 
> > Thanks for your ACK and for the explanation.
> >
> > My last patches to git were blame related so semi-intuitively I knew
> > that invalid sha1 are coming from files in worktree. Your description
> > makes things much more clear and I'd put it into patch log as well.
> > What is the best practice for this? For me to re-roll, or for Junio to
> > merge texts?
> 
> Re-rolling to explain changes in your own words is preferred; thanks.

I see, thanks.

I'm not that familiar with git internals involved, so here is updated
patch with added paragraph about "df->sha1_valid=0 means files from
worktree with unknown sha1", and appropriate excerpt from Jeff's post.
That's the most reasonable I could come up with.

Thanks,
Kirill

P.S. please don't forget to pick patch 1 which is unchanged.



---- 8< ----

From: Kirill Smelkov <kirr@landau.phys.spbu.ru>
Date: Sat, 18 Dec 2010 16:27:28 +0300
Subject: [PATCH v2 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

It turned out, under blame there are requests to fill_textconv() with
sha1=0000000000000000000000000000000000000000 and sha1_valid=0.

As the code did not analyzed sha1 validity, we ended up putting 000000
into textconv cache which was fooling later blames to discover lots of
lines in 'Not Yet Committed' state.

In practice df->sha1_valid=0 means blame requests to run textconv on a
file in worktree whose sha1 is not know yet.

Fix it.

On Sat, Dec 18, 2010 at 11:13:37AM -0500, Jeff King wrote:
> 
> In short:
> 
>   Acked-by: Jeff King <peff@peff.net>
> 
> But it took some thinking to convince myself, so the long answer is
> below if anyone cares.
> 
> I was dubious at first that this could be the right solution. We still
> end up putting the filespec through run_textconv, which didn't seem
> right if it is not valid.
> 
> But reading into it more, there are two levels of invalidity:
> 
>   1. !DIFF_FILE_VALID(df) - we are not a valid file at all. I.e., we are
>      /dev/null.
> 
>   2. !df->sha1_valid - we are pointing to a working tree file whose sha1
>      we don't know
> 
> I think level (2) never happens at all in the regular diff code, which
> is why this case was completely unhandled. But it is OK in that case
> (required, even) to put the contents through run_textconv.

Cc: Axel Bonnet <axel.bonnet@ensimag.imag.fr>
Cc: Clц╘ment Poulain <clement.poulain@ensimag.imag.fr>
Cc: Diane Gasselin <diane.gasselin@ensimag.imag.fr>
Cc: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru>
Acked-by: Jeff King <peff@peff.net>
---
 diff.c                    |    4 ++--
 t/t8006-blame-textconv.sh |    3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/diff.c b/diff.c
index 0a43869..5422c43 100644
--- a/diff.c
+++ b/diff.c
@@ -4412,7 +4412,7 @@ size_t fill_textconv(struct userdiff_driver *driver,
 		return df->size;
 	}
 
-	if (driver->textconv_cache) {
+	if (driver->textconv_cache && df->sha1_valid) {
 		*outbuf = notes_cache_get(driver->textconv_cache, df->sha1,
 					  &size);
 		if (*outbuf)
@@ -4423,7 +4423,7 @@ size_t fill_textconv(struct userdiff_driver *driver,
 	if (!*outbuf)
 		die("unable to read files to diff");
 
-	if (driver->textconv_cache) {
+	if (driver->textconv_cache && df->sha1_valid) {
 		/* ignore errors, as we might be in a readonly repository */
 		notes_cache_put(driver->textconv_cache, df->sha1, *outbuf,
 				size);
diff --git a/t/t8006-blame-textconv.sh b/t/t8006-blame-textconv.sh
index fe90541..ea64cd8 100755
--- a/t/t8006-blame-textconv.sh
+++ b/t/t8006-blame-textconv.sh
@@ -81,8 +81,7 @@ cat >expected_one <<EOF
 (Number2 2010-01-01 20:00:00 +0000 1) converted: test 1 version 2
 EOF
 
-# one.bin is blamed as 'Not Committed yet'
-test_expect_failure 'blame --textconv works with textconvcache' '
+test_expect_success 'blame --textconv works with textconvcache' '
 	git blame --textconv two.bin >blame &&
 	find_blame <blame >result &&
 	test_cmp expected result &&
-- 
1.7.3.4.570.g14308

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-18 16:13   ` Jeff King
  2010-12-18 20:55     ` Kirill Smelkov
@ 2010-12-20  2:26     ` Junio C Hamano
  2010-12-20  4:42       ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2010-12-20  2:26 UTC (permalink / raw)
  To: Jeff King
  Cc: Kirill Smelkov, git, Axel Bonnet, Clément Poulain, Diane Gasselin

Jeff King <peff@peff.net> writes:

> PS It is a little disturbing that in fill_textconv, we handle
> case (1), !DIFF_FILE_VALID for the non-textconv case, but not so for the
> textconv case. I think we are OK, as get_textconv will never load a
> textconv driver for a !DIFF_FILE_VALID filespec, so we always follow the
> non-textconv codepath in that case. But I am tempted to do this just to
> be more defensive:

FILE_VALID() is about "does that side have a blob there, or is this
create/delete diff?", so the caller should be handling this properly as
you said, but your fill_textconv() already prepares for the case where the
caller for some reason calls this function with "no blob on this side" and
returns an empty string (see the precontext of your patch).

I think it is fine to be defensive to prepare for such a case, but then
dying like this patch does is inconsistent.  Perhaps we should move the
new check higher and remove the *outbuf = "" while at it?

> diff --git a/diff.c b/diff.c
> index b0ee213..5320849 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -4404,22 +4404,25 @@ size_t fill_textconv(struct userdiff_driver *driver,
>  	if (!driver || !driver->textconv) {
>  		if (!DIFF_FILE_VALID(df)) {
>  			*outbuf = "";
>  			return 0;
>  		}
>  		if (diff_populate_filespec(df, 0))
>  			die("unable to read files to diff");
>  		*outbuf = df->data;
>  		return df->size;
>  	}
>  
> +	if (!DIFF_FILE_VALID(df))
> +		die("BUG: attempt to textconv an invalid filespec");
> +

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-18 14:54 ` [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid Kirill Smelkov
  2010-12-18 16:13   ` Jeff King
@ 2010-12-20  2:32   ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2010-12-20  2:32 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: git, Axel Bonnet, Clément Poulain, Diane Gasselin, Jeff King

Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:

> It turned out, under blame there are requests to fill_textconv() with
> sha1=0000000000000000000000000000000000000000 and sha1_valid=0.

The code shouldn't even look at sha1[] if sha1_valid is false, as we do
not know the hash value for the blob (reading from the working tree).

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-19 12:10         ` Kirill Smelkov
@ 2010-12-20  2:41           ` Junio C Hamano
  2010-12-20  4:46             ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2010-12-20  2:41 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Jeff King, git, Axel Bonnet, Cl??ment Poulain,
	Diane Gasselin

Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:

> On Sat, Dec 18, 2010 at 07:23:29PM -0800, Junio C Hamano wrote:
>> Kirill Smelkov <kirr@landau.phys.spbu.ru> writes:
>> 
>> > Thanks for your ACK and for the explanation.
>> >
>> > My last patches to git were blame related so semi-intuitively I knew
>> > that invalid sha1 are coming from files in worktree. Your description
>> > makes things much more clear and I'd put it into patch log as well.
>> > What is the best practice for this? For me to re-roll, or for Junio to
>> > merge texts?
>> 
>> Re-rolling to explain changes in your own words is preferred; thanks.
>
> I see, thanks.
>
> I'm not that familiar with git internals involved, so here is updated
> patch with added paragraph about "df->sha1_valid=0 means files from
> worktree with unknown sha1", and appropriate excerpt from Jeff's post.
> That's the most reasonable I could come up with.
>
> Thanks,
> Kirill
>
> P.S. please don't forget to pick patch 1 which is unchanged.

Here is how I would describe it.

commit 87bb04bb760659dd33d7a173333329cd900620a9
Author: Kirill Smelkov <kirr@landau.phys.spbu.ru>
Date:   Sat Dec 18 17:54:12 2010 +0300

    fill_textconv(): Don't get/put cache if sha1 is not valid
    
    When blaming files in the working tree, the filespec is marked with
    !sha1_valid, as we have not given the contents an object name yet.  The
    function to cache textconv results (keyed on the object name), however,
    didn't check this condition, and ended up on storing the cached result
    under a random object name.
    
    Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-20  2:26     ` Junio C Hamano
@ 2010-12-20  4:42       ` Jeff King
  2010-12-20  8:42         ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2010-12-20  4:42 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Kirill Smelkov, git, Axel Bonnet, Clément Poulain, Diane Gasselin

On Sun, Dec 19, 2010 at 06:26:55PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > PS It is a little disturbing that in fill_textconv, we handle
> > case (1), !DIFF_FILE_VALID for the non-textconv case, but not so for the
> > textconv case. I think we are OK, as get_textconv will never load a
> > textconv driver for a !DIFF_FILE_VALID filespec, so we always follow the
> > non-textconv codepath in that case. But I am tempted to do this just to
> > be more defensive:
> 
> FILE_VALID() is about "does that side have a blob there, or is this
> create/delete diff?", so the caller should be handling this properly as
> you said, but your fill_textconv() already prepares for the case where the
> caller for some reason calls this function with "no blob on this side" and
> returns an empty string (see the precontext of your patch).
> 
> I think it is fine to be defensive to prepare for such a case, but then
> dying like this patch does is inconsistent.  Perhaps we should move the
> new check higher and remove the *outbuf = "" while at it?

I'm not sure returning the empty string for a textconv is the right
solution. I am inclined to say that trying to textconv a
!DIFF_FILE_VALID is simply an error. More on that in a second.

If we were to do anything else, I would think it would be to feed "" to
the textconv filter, in case it wanted to do something magical for the
create/delete case. For example, imagine a textconv filter which turned
a string of bytes like "foo" into a length plus set of converted bytes,
like "3: FOO". You would want the /dev/null case to turn into "0: ".

Now, this is obviously a ridiculous toy case. I have no idea if anyone
would want to do something like that. So far most people have been happy
with /dev/null never being textconv'd, and always looking like the empty
string. Moreover, even if somebody did want this behavior, 99% of the
other filters _wouldn't_ want the behavior. Because programs like
odt2txt or exiftool that people _do_ use for textconv filters do not
want to be fed /dev/null; they will signal an error.

So that is my "if we wanted it to do something useful, this is what it
would do" case, and I don't see any real-world evidence that anybody
wants that. Now on to being defensive.

What we are defending against is a caller marking something as
to-be-textconv'd, even though it is !DIFF_FILE_VALID. But what did the
caller want? One sensible behavior is what I described above. Or maybe
they did just want the empty string. Or more likely, it is simply a bug
in the diff code. Since we haven't defined the semantics, I would much
rather loudly scream "BUG!" than paper over it by returning what we
guess they would have wanted (which may generate subtly wrong results).
And then we can think about that use case and decide what the semantics,
if any, should be.

So I stand by my thought that it should die(). But I don't think there
_are_ any such bugs currently, so it probably doesn't matter much either
way. I can live with "return 0", or even just leaving it alone.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-20  2:41           ` Junio C Hamano
@ 2010-12-20  4:46             ` Jeff King
  2010-12-20 19:28               ` Kirill Smelkov
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2010-12-20  4:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Kirill Smelkov, git, Axel Bonnet, Cl??ment Poulain, Diane Gasselin

On Sun, Dec 19, 2010 at 06:41:22PM -0800, Junio C Hamano wrote:

> > I'm not that familiar with git internals involved, so here is updated
> > patch with added paragraph about "df->sha1_valid=0 means files from
> > worktree with unknown sha1", and appropriate excerpt from Jeff's post.
> > That's the most reasonable I could come up with.
> [...]
> Here is how I would describe it.
> 
> commit 87bb04bb760659dd33d7a173333329cd900620a9
> Author: Kirill Smelkov <kirr@landau.phys.spbu.ru>
> Date:   Sat Dec 18 17:54:12 2010 +0300
> 
>     fill_textconv(): Don't get/put cache if sha1 is not valid
>     
>     When blaming files in the working tree, the filespec is marked with
>     !sha1_valid, as we have not given the contents an object name yet.  The
>     function to cache textconv results (keyed on the object name), however,
>     didn't check this condition, and ended up on storing the cached result
>     under a random object name.
>     
>     Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru>

FWIW, I think that is a good description.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-20  4:42       ` Jeff King
@ 2010-12-20  8:42         ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2010-12-20  8:42 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Kirill Smelkov, git, Axel Bonnet,
	Clément Poulain, Diane Gasselin

Jeff King <peff@peff.net> writes:

> On Sun, Dec 19, 2010 at 06:26:55PM -0800, Junio C Hamano wrote:
> ...
>> FILE_VALID() is about "does that side have a blob there, or is this
>> create/delete diff?", so the caller should be handling this properly as
>> you said, but your fill_textconv() already prepares for the case where the
>> caller for some reason calls this function with "no blob on this side" and
>> returns an empty string (see the precontext of your patch).
>> 
>> I think it is fine to be defensive to prepare for such a case, but then
>> dying like this patch does is inconsistent.  Perhaps we should move the
>> new check higher and remove the *outbuf = "" while at it?
>
> I'm not sure returning the empty string for a textconv is the right
> solution....
>
> So I stand by my thought that it should die(). But I don't think there
> _are_ any such bugs currently, so it probably doesn't matter much either
> way. I can live with "return 0", or even just leaving it alone.

I must have phrased it badly.  I am actually Ok either way (i.e. make this
function prepare for a future when we start passing the missing side to
the function, and have a special case for "if (!DIFF_FILE_VALID)" and
returning something like an empty string, or make this function refuse to
be fed the missing side by dying in "if (!DIFF_FILE_VALID)".  I was only
pointing out that the result of applying your patch does one in one case
and another in the other case, which is inconsistent.  And we do not know
in advance what is the reasonable fallback value for the missing side, so
we do not now "something like an empty string" is a reasonable thing to do
yet.  Hence "move the new check higher and remove ..." was my suggestion,
which would look like the attached, which would be consistent with your
message I am replying to.  IOW, I think we are on the same page.

 diff.c |   15 +++++++++++----
 1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/diff.c b/diff.c
index 5422c43..a04ab2f 100644
--- a/diff.c
+++ b/diff.c
@@ -4401,11 +4401,18 @@ size_t fill_textconv(struct userdiff_driver *driver,
 {
 	size_t size;
 
+	/*
+	 * !DIFF_FILE_VALID(df) means this is a missing side of the
+	 * diff (preimage of creation, or postimage of deletion diff).
+	 * The caller should not try to textconv such a filespec, as
+	 * there is no such blob to begin with!
+	 */
+	
+	if (!DIFF_FILE_VALID(df))
+		die("Feeding missing side to fill_textconv?: '%s'",
+		    df->path);
+
 	if (!driver || !driver->textconv) {
-		if (!DIFF_FILE_VALID(df)) {
-			*outbuf = "";
-			return 0;
-		}
 		if (diff_populate_filespec(df, 0))
 			die("unable to read files to diff");
 		*outbuf = df->data;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid
  2010-12-20  4:46             ` Jeff King
@ 2010-12-20 19:28               ` Kirill Smelkov
  0 siblings, 0 replies; 13+ messages in thread
From: Kirill Smelkov @ 2010-12-20 19:28 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git, Axel Bonnet, Cl??ment Poulain, Diane Gasselin

On Sun, Dec 19, 2010 at 11:46:56PM -0500, Jeff King wrote:
> On Sun, Dec 19, 2010 at 06:41:22PM -0800, Junio C Hamano wrote:
> 
> > > I'm not that familiar with git internals involved, so here is updated
> > > patch with added paragraph about "df->sha1_valid=0 means files from
> > > worktree with unknown sha1", and appropriate excerpt from Jeff's post.
> > > That's the most reasonable I could come up with.
> > [...]
> > Here is how I would describe it.
> > 
> > commit 87bb04bb760659dd33d7a173333329cd900620a9
> > Author: Kirill Smelkov <kirr@landau.phys.spbu.ru>
> > Date:   Sat Dec 18 17:54:12 2010 +0300
> > 
> >     fill_textconv(): Don't get/put cache if sha1 is not valid
> >     
> >     When blaming files in the working tree, the filespec is marked with
> >     !sha1_valid, as we have not given the contents an object name yet.  The
> >     function to cache textconv results (keyed on the object name), however,
> >     didn't check this condition, and ended up on storing the cached result
> >     under a random object name.
> >     
> >     Signed-off-by: Kirill Smelkov <kirr@landau.phys.spbu.ru>
> 
> FWIW, I think that is a good description.

Junio, Jeff, thanks for re-wording it. Though I think my v2 text was
saying the same, only with more info + examples. My english is pretty
bad this days, so I kind of understand why it was tempting to be redone :)


Thanks anyway, and for picking this into next,
Kirill


P.S. somehow 'Acked-by: Jeff King <peff@peff.net>' was dropped.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-12-20 19:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-18 14:54 [PATCH 1/2] t/t8006: Demonstrate blame is broken when cachetextconv is on Kirill Smelkov
2010-12-18 14:54 ` [PATCH 2/2] fill_textconv(): Don't get/put cache if sha1 is not valid Kirill Smelkov
2010-12-18 16:13   ` Jeff King
2010-12-18 20:55     ` Kirill Smelkov
2010-12-19  3:23       ` Junio C Hamano
2010-12-19 12:10         ` Kirill Smelkov
2010-12-20  2:41           ` Junio C Hamano
2010-12-20  4:46             ` Jeff King
2010-12-20 19:28               ` Kirill Smelkov
2010-12-20  2:26     ` Junio C Hamano
2010-12-20  4:42       ` Jeff King
2010-12-20  8:42         ` Junio C Hamano
2010-12-20  2:32   ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.