git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: Thomas Guyot-Sionnest <tguyot@gmail.com>,
	git@vger.kernel.org, dermoth@aei.ca, peff@peff.net
Subject: Re: [PATCH v2] diff: Fix modified lines stats with --stat and --numstat
Date: Sun, 20 Sep 2020 12:11:20 -0700	[thread overview]
Message-ID: <xmqqlfh4gt5z.fsf@gitster.c.googlers.com> (raw)
In-Reply-To: <20200920153915.GB2726066@nand.local> (Taylor Blau's message of "Sun, 20 Sep 2020 11:39:15 -0400")

Taylor Blau <me@ttaylorr.com> writes:

> On Sun, Sep 20, 2020 at 09:09:46AM -0400, Thomas Guyot-Sionnest wrote:
>> In builtin_diffstat(), when both files are coming from "stdin" (which
>> could be better described as the file's content being written directly
>> into the file object), oideq() compares two null hashes and ignores the
>> actual differences for the statistics.
>>
>> This patch checks if is_stdin flag is set on both sides and compare
>> contents directly.
>>
>> Signed-off-by: Thomas Guyot-Sionnest <tguyot@gmail.com>
>> ---
>> Range-diff:
>> 1:  479c2835fc ! 1:  1f25713d44 diff: Fix modified lines stats with --stat and --numstat
>>     @@ -20,8 +20,12 @@
>>       	}
>>
>>      -	same_contents = oideq(&one->oid, &two->oid);
>>     ++	/* What is_stdin really means is that the file's content is only
>>     ++	 * in the filespec's buffer and its oid is zero. We can't compare
>>     ++	 * oid's if both are null and we can just diff the buffers */
>>      +	if (one->is_stdin && two->is_stdin)
>>     -+		same_contents = !strcmp(one->data, two->data);
>>     ++		same_contents = (one->size == two->size ?
>>     ++			!memcmp(one->data, two->data, one->size) : 0);
>>      +	else
>>      +		same_contents = oideq(&one->oid, &two->oid);
>
> After reading your explanation in [1], this version makes more sense to
> me.

These oid fields are prepared by calling diff_fill_oid_info(), and
even for paths that are dirty (hence no "free" oid available from
index or tree entry), an appropriate oid is computed.

But there are paths for which oid cannot be computed without
destroying their contents.  Such paths are marked by the function
with null_oid.

It happens to be that stdin is the only class of paths that are
treated that way _right_ _now_, but future code may support
different kind of paths that share the same trait.

When we want to know "is comparing the oid sufficient?", we
shouldn't inspect the is_stdin flag ourselves in a caller of
diff_fill_oid_info(), because the helper _is_ responsible for
knowing what kind of paths are special, and signals that "assume
this would not be equal to anything else" by giving null_oid back.

The caller should use the info left by diff_fill_oid_info(), namely,
"even if the oid on both sides are the same, if it is null_oid, then
we know diff_fill_oid_info() didn't actually compute the oid, and we
need to compare the blob ourselves".

And there is no point in doing memcmp() here, I think.  

The same_contents() check is done as an optimization to avoid xdl.
Even if the two sides were thought to be different at the oid level,
xdl comparison may find that there is no difference after all
(e.g. think of whitespace ignoring comparison), so we should assume
and rely on that the downstream code MUST BE prepared to handle
false negatives (i.e. same_contents says they are different, but
they actually produce no diffstat).  Running memcmp() over contents
in potentially a large buffer to find that they are different, and
then have xdl process that large buffer again, would be a waste.

Summarizing the above, I think the second best fix is this (which
means that the posted patch is the third best):

	/*
	 * diff_fill_oid_info() marked one/two->oid with null_oid
	 * for a path whose oid is not available.  Disable early
	 * return optimization for them.
	 */
	if (oideq(&one->oid, &null_oid) || oideq(&two->oid, &null_oid))
		same_contents = 0; /* could be different */
	else if (oideq(&one->oid, &two->oid))
		same_contents = 1; /* definitely the same */
	else
		same_contents = 0; /* definitely different */

But I suspect that the best fix is to teach diff_fill_oid_info() to
hash the in-memory data to compute the oid, instead of punting and
filling the oid field with null_oid.  If function builtin_diffstat()
is allowed to look at the contents and run memcmp() here, the 'data'
field should have been filled and valid when diff_fill_oid_info()
looked at it already.

The "best" fix will have wider consequences, so we may not want to
jump to it right away without careful consideration.

For example, the "best" fix will fix another bug.  The 'index'
header shows a NULL object name in normal "diff --patch" output for
these paths in the current code, which means they cannot be used
with "apply --3way".  That way, this codepath does not have to know
anything about the "null means we don't know" convention.  

Try:

    $ (cat COPYING; echo) >RENAMING
    $ git diff --no-index COPYING - <RENAMING | grep '^index '
    index 536e55524d..0000000000 100644

and notice that the stdin side has a null object name in the current
code.  I think we will show the right object name if we fix the
diff_fill_oid_info().

Thanks.



  parent reply	other threads:[~2020-09-20 19:11 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-18 11:32 Allow passing pipes to diff --no-index + bugfix Thomas Guyot-Sionnest
2020-09-18 11:32 ` [PATCH 1/2] diff: Fix modified lines stats with --stat and --numstat Thomas Guyot-Sionnest
2020-09-18 14:46   ` Taylor Blau
2020-09-18 15:10     ` Thomas Guyot-Sionnest
2020-09-18 17:37       ` Jeff King
2020-09-18 18:00         ` Thomas Guyot-Sionnest
2020-09-20  4:53       ` Thomas Guyot
2020-09-18 17:27   ` Jeff King
2020-09-18 17:52     ` Thomas Guyot-Sionnest
2020-09-18 18:06       ` Junio C Hamano
2020-09-23 19:16         ` Johannes Schindelin
2020-09-23 19:23           ` Junio C Hamano
2020-09-23 20:44             ` Johannes Schindelin
2020-09-24  4:49               ` Thomas Guyot
2020-09-24  5:24                 ` [PATCH v3] " Thomas Guyot-Sionnest
2020-09-24  7:41                   ` [PATCH v4] " Thomas Guyot-Sionnest
2020-09-24  6:40                 ` [PATCH 1/2] " Junio C Hamano
2020-09-24  7:13                   ` Thomas Guyot
2020-09-24 17:19                     ` Junio C Hamano
2020-09-24 17:38                       ` Junio C Hamano
2020-09-23 15:05     ` Johannes Schindelin
2020-09-20 13:09   ` [PATCH v2] " Thomas Guyot-Sionnest
2020-09-20 15:39     ` Taylor Blau
2020-09-20 16:38       ` Thomas Guyot
2020-09-20 19:11       ` Junio C Hamano [this message]
2020-09-20 20:08         ` Junio C Hamano
2020-09-20 20:36         ` Junio C Hamano
2020-09-20 22:15           ` Junio C Hamano
2020-09-21 19:26         ` Jeff King
2020-09-21 21:51           ` Junio C Hamano
2020-09-21 22:20             ` Jeff King
2020-09-21 22:37               ` Junio C Hamano
2020-09-18 11:32 ` [PATCH 2/2] Allow passing pipes for input pipes to diff --no-index Thomas Guyot-Sionnest
2020-09-18 14:36   ` Taylor Blau
2020-09-18 16:34     ` Thomas Guyot-Sionnest
2020-09-18 17:19       ` Jeff King
2020-09-18 17:21         ` Jeff King
2020-09-18 17:39         ` Thomas Guyot-Sionnest
2020-09-18 17:48         ` Junio C Hamano
2020-09-18 18:02           ` Jeff King
2020-09-20 12:54             ` Thomas Guyot
2020-09-21 19:31               ` Jeff King
2020-09-21 20:14                 ` Junio C Hamano
2020-09-18 17:58       ` Taylor Blau
2020-09-18 18:05         ` Jeff King
2020-09-18 17:20     ` Jeff King
2020-09-18 18:00       ` Taylor Blau
2020-09-18 21:56   ` brian m. carlson
2020-09-18 17:51 ` Allow passing pipes to diff --no-index + bugfix Junio C Hamano
2020-09-18 18:24   ` Thomas Guyot-Sionnest

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqlfh4gt5z.fsf@gitster.c.googlers.com \
    --to=gitster@pobox.com \
    --cc=dermoth@aei.ca \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=tguyot@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).