git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Jeff King <peff@peff.net>
Cc: phillip.wood@dunelm.org.uk, "John Cai" <johncai86@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"John Cai via GitGitGadget" <gitgitgadget@gmail.com>,
	git@vger.kernel.org, "Patrick Steinhardt" <ps@pks.im>
Subject: Re: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm
Date: Tue, 14 Feb 2023 18:35:00 -0800	[thread overview]
Message-ID: <CABPp-BFnCzWH6Aai0ZYv1fR7GMfXqiAE3n8q1Gcrhh-Zv_wTjA@mail.gmail.com> (raw)
In-Reply-To: <Y+b2l4Le2gTxGwO8@coredump.intra.peff.net>

On Fri, Feb 10, 2023 at 5:59 PM Jeff King <peff@peff.net> wrote:
>
> On Thu, Feb 09, 2023 at 02:44:15PM +0000, Phillip Wood wrote:
>
> > To see the differences between the output of patience and histogram
> > algorithms I diffed the output of "git log -p --no-merges
> > --diff-algorithm=patience" and "git log -p --no-merges
> > --diff-algorithm=histogram". The first three differences are
> >
> > - 6c065f72b8 (http: support CURLOPT_PROTOCOLS_STR, 2023-01-16)
> >   In get_curl_allowed_protocols() the patience algorithm shows the
> >   change in the return statement more clearly
> >
> > - 47cfc9bd7d (attr: add flag `--source` to work with tree-ish, 2023-01-14)
> >    The histogram algorithm shows read_attr_from_index() being moved
> >    whereas the patience algorithm does not making the diff easier to
> >    follow.
> >
> > - b0226007f0 (fsmonitor: eliminate call to deprecated FSEventStream
> > function, 2022-12-14)
> >   In fsm_listen__stop_async() the histogram algorithm shows
> >   data->shutdown_style = SHUTDOWN_EVENT;
> >   being moved, which is not as clear as the patience output which
> >   shows it as a context line.
>
> Just a small counter-point, since I happened to be looking at myers vs
> patience for something elsewhere in the thread, but:
>
>   git show 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f builtin/add.c

"fatal: bad object 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f"

Is that a local commit of yours?

> looks slightly better to me with myers, even though it is 2 lines
> longer. The issue is that patience and histogram are very eager to use
> blank lines as anchor points, so a diff like:
>
>   -some words
>   -
>   -and some more
>   +unrelated content
>   +
>   +but it happens to also be two paragraphs
>
> in myers becomes:
>
>   -some words
>   +unrelated content
>
>   -and some more
>   +but it happens to also be two paragraphs
>
> in patience (here I'm using single lines, but in practice these may be
> paragraphs, or stanzas of code). I think that's also the _strength_ of
> patience in many cases, but it really depends on the content. Replacing
> a multi-stanza block with another one may be the best explanation for
> what happened. Or the two stanzas may be independent, and showing the
> change for each one may be better.
>
> I'm not sure which one happens more often. And you'd probably want to
> weight it by how good/bad the change is. In the example I showed I don't
> find patience very much worse, since it's already a pretty ugly diff.
> But in cases where patience shines, it may be making things
> significantly more readable.
>
> I don't have a super strong opinion, but I just wanted to chime in that
> it is not clear to me that patience/histogram is always a win over myers
> (yes, I know your examples were comparing patience vs histogram, but the
> larger thread is discussing the other).

Oh, I agree histogram is not always a win over myers.  I just feel it
is the majority of the time.  But if you want more than "feels",
here's some solid data to back that up...

I found a study on the subject over at
https://link.springer.com/article/10.1007/s10664-019-09772-z.  They
were particularly interested in whether other academic studies could
have been affected by git's different diff algorithms, and came away
with the answer that it did.  They looked at a few hundred thousand
commits across two dozen different repositories and found (note that
they only looked at myers and histogram, ignoring patience and
minimal):

   * 92.4% - 98.6% of the diffs (depending on repo) are identical
whether you use myers or histogram
   * 93.8% - 99.2% of the diffs (depending on repo) have the same
number of added/deleted lines with myers and histogram
   * Of the >20k diffs that were not the identical, they selected a
random sample of 377 diffs (taking care to make sure they were
statistically representative)
   * They divided the 377 diffs into "code" and "non-code" diffs, i.e.
those modifying source code and those modifying other textual files
   * They had two people annotating the diffs and independently
scoring them, and then checked for agreement between their answers
afterwards.  (No, they didn't always agree, but they did have
substantial agreement.)

For the (again, non-identical) diffs modifying non-code, they found
(see table 11) that:
   * 14.9% of the myers diffs are better
   * 13.4% of the histogram diffs are better
   * 71.6% of the diffs have equal quality

For the (non-identical) diffs modifying code, they found (again, see
table 11) that:
   * 16.9% of the myers diffs are better
   * 62.6% of the histogram diffs are better
   * 20.6% of the diffs have equal quality

A ratio of 4 to 1 for histogram being better on code diffs is pretty
weighty to me.

It's possible these results would have been even better were it not
for a couple of bugs in the histogram code (ported from the original
in jgit).  Phillip pointed me to a problematic testcase that Stefan
Beller found, and in attempting to fix it (I'm on fix #4 or so), I
believe I found another issue.  However, I don't want to go into too
much detail yet, as I found problems with some of my previous fixes
and already invalidated things I told Phillip just last week.

  reply	other threads:[~2023-02-15  2:36 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-05  3:46 [PATCH 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-05  3:46 ` [PATCH 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-06 16:20   ` Phillip Wood
2023-02-05  3:46 ` [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm John Cai via GitGitGadget
2023-02-05 17:50   ` Eric Sunshine
2023-02-06 13:10     ` John Cai
2023-02-06 16:27   ` Phillip Wood
2023-02-06 18:14     ` Eric Sunshine
2023-02-06 19:50     ` John Cai
2023-02-09  8:26       ` Elijah Newren
2023-02-09 10:31         ` "bad" diffs (was: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm) Ævar Arnfjörð Bjarmason
2023-02-09 16:37         ` [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm John Cai
2023-02-06 16:39   ` Ævar Arnfjörð Bjarmason
2023-02-06 20:37     ` John Cai
2023-02-07 14:55       ` Phillip Wood
2023-02-07 17:00         ` John Cai
2023-02-09  9:09           ` Elijah Newren
2023-02-09 14:44             ` Phillip Wood
2023-02-10  9:57               ` Elijah Newren
2023-02-11 17:39                 ` Phillip Wood
2023-02-11  1:59               ` Jeff King
2023-02-15  2:35                 ` Elijah Newren [this message]
2023-02-15  4:21                   ` Jeff King
2023-02-15  5:20                     ` Junio C Hamano
2023-02-15 14:44                 ` Phillip Wood
2023-02-15 15:00                   ` Jeff King
2023-02-07 17:27         ` Ævar Arnfjörð Bjarmason
2023-02-15 14:47           ` Phillip Wood
2023-02-09  8:44       ` Elijah Newren
2023-02-14 21:16         ` John Cai
2023-02-15  3:41           ` Elijah Newren
2023-02-09  7:50     ` Elijah Newren
2023-02-09  9:41       ` Ævar Arnfjörð Bjarmason
2023-02-11  2:04         ` Jeff King
2023-02-07 17:56   ` Jeff King
2023-02-07 20:18     ` Ævar Arnfjörð Bjarmason
2023-02-07 20:47       ` Junio C Hamano
2023-02-07 21:05         ` Ævar Arnfjörð Bjarmason
2023-02-07 21:28           ` Junio C Hamano
2023-02-07 21:44             ` Ævar Arnfjörð Bjarmason
2023-02-09 16:34     ` John Cai
2023-02-11  1:39       ` Jeff King
2023-02-14 21:40 ` [PATCH v2 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-14 21:40   ` [PATCH v2 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-15  2:38     ` Junio C Hamano
2023-02-15 23:34       ` John Cai
2023-02-15 23:42         ` Junio C Hamano
2023-02-16  2:14           ` Jeff King
2023-02-16  2:57             ` Junio C Hamano
2023-02-16 20:34               ` John Cai
2023-02-14 21:40   ` [PATCH v2 2/2] diff: teach diff to read gitattribute diff-algorithm John Cai via GitGitGadget
2023-02-15  2:56     ` Junio C Hamano
2023-02-15  3:20       ` Junio C Hamano
2023-02-16 20:37         ` John Cai
2023-02-17 20:21   ` [PATCH v3 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-17 20:21     ` [PATCH v3 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-17 21:27       ` Junio C Hamano
2023-02-18  1:36       ` Elijah Newren
2023-02-17 20:21     ` [PATCH v3 2/2] diff: teach diff to read algorithm from diff driver John Cai via GitGitGadget
2023-02-17 21:50       ` Junio C Hamano
2023-02-18  2:56       ` Elijah Newren
2023-02-20 15:32         ` John Cai
2023-02-20 16:21           ` Elijah Newren
2023-02-20 16:49             ` John Cai
2023-02-20 17:32               ` Elijah Newren
2023-02-20 20:53                 ` John Cai
2023-02-22 19:47                 ` Jeff King
2023-02-24 17:44                   ` John Cai
2023-02-18  1:16     ` [PATCH v3 0/2] Teach diff to honor diff algorithms set through git attributes Elijah Newren
2023-02-20 13:37       ` John Cai
2023-02-20 21:04     ` [PATCH v4 " John Cai via GitGitGadget
2023-02-20 21:04       ` [PATCH v4 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-20 21:04       ` [PATCH v4 2/2] diff: teach diff to read algorithm from diff driver John Cai via GitGitGadget
2023-02-21 17:34       ` [PATCH v4 0/2] Teach diff to honor diff algorithms set through git attributes Junio C Hamano
2023-02-21 18:05         ` Elijah Newren
2023-02-21 18:51           ` Junio C Hamano
2023-02-21 19:36             ` John Cai
2023-02-21 20:16               ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BFnCzWH6Aai0ZYv1fR7GMfXqiAE3n8q1Gcrhh-Zv_wTjA@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=johncai86@gmail.com \
    --cc=peff@peff.net \
    --cc=phillip.wood@dunelm.org.uk \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).