git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Elijah Newren <newren@gmail.com>
Cc: John Cai <johncai86@gmail.com>,
	phillip.wood@dunelm.org.uk,
	John Cai via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Subject: "bad" diffs (was: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm)
Date: Thu, 09 Feb 2023 11:31:03 +0100	[thread overview]
Message-ID: <230209.868rh7xg14.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <CABPp-BGhuTyq_hrpMc+Ky3yt1UgO7DcAsgcYH15FK--QLdCsQw@mail.gmail.com>


On Thu, Feb 09 2023, Elijah Newren wrote:

> [...] I
> think "myers" is default for historical reasons and histogram is
> better not just for special Salesforce xml files, but code files too.
> The output makes more sense to users.  So much so that even though my
> simple testing suggested it had a 2% performance penalty compared to
> myers, I forced ort to use it[1] even though I designed  everything
> else in that algorithm around eking out maximum performance.  Others
> who have tested the diff algorithms have also found histogram has very
> similar performance to myers, and oftentimes even beats it[2][3].
> Also, worries about invalidating rerere caches[4] was real, but we
> already paid that price when we switched to ort.

FWIW as someone who went through that one-time pain for my many git.git
topics it wasn't even a price, it was paying me!

Maybe it's just confirmation bias, or looking at the conflicts with
fresh eyes, but I found I mis-solved some of them seemingly because the
output from the old conflicts was so confusing, but much better with
"ort".

> And if performance
> is still a worry, [3] gives me reason to believe we can make our
> histogram implementation faster.  Finally, for the period of time when
> Palantir was letting me make an internal git distribution (mostly for
> testing ort), I also carried a patch that changed the default diff
> algorithm to histogram (not just for ort, but for diff/log/etc. as
> well).  Never had any complaints from the users from it.  Perhaps you
> could do the same in your local version of git used by gitaly?

I think that might be worth considering for GitLab, John? Although the
bias has definitely been to go with vanilla git semantics, but it sounds
like you might be in favor of endorsing a patch to change the default,
so if that happens to solve the problem...

> I also think the user request got converted to a particular
> solution without looking at the wider problem space:  The idea seemed
> to assume "myers" is default for a good reason, and thus asked for an
> option to use something else.  I'm not sure the assumption is valid; 

Just on the "wider problem", I've also looked at a lot of "bad diffs"
and there's interesting cases where histogram does equally bad as the
others *from most user's POV*, but from a "let's make a small diff" for
a computer it's perfect.

I've seen this most commonly with repetative file formats, e.g JSON-like
ones are a good example. You've probably looked at this exact thing N
times, but in case it's useful consider:
	
	$ cat a
	{
	        thing => 'foo',
	        a => 'b',
	},
	{
	        c => 'd',
	        thing => 'bar',
	}
	$ cat b
	{
	        thing => 'foo',
	        a => 'b',
	},
	{
	        c => 'd',
	},
	{
	        e => 'f',
	        thing => 'bar',
	}

Here all of our diff algorithms generate this equally terrible diff, or
a great diff, depending on your POV :):
	
	diff --git a/a b/b
	index 186017c1f38..5afadde1800 100644
	--- a/a
	+++ b/b
	@@ -4,5 +4,8 @@
	 },
	 {
	        c => 'd',
	+},
	+{
	+       e => 'f',
	        thing => 'bar',
	 }

The problem here is that from the user's POV they didn't add a closing
bracket, comma, open bracket etc. They added a whole new copy/pasted
block, and then modified a key-value in the subsequent one.

But the diff engine doesn't know about any of that, and will "helpfully"
proceed to "steal" parts of the previous block.

All of the cases where users have asked me (in person) about some bad
diffs have pretty much come down to this sort of thing.

In those cases one of the algorithms sometimes *happened* to find a
"better" diff, but in my experience it's been a
wrong-clock-is-right-twice-a-day sort of thing.

I've wondered if we couldn't have some much more stupid but effective
solution to these common cases. All of the ones I remember could
basically be expressed as a hypothetical:

	[diff "c"]
        balanceBrackets = true

Where we'd try as hard as we could not to produce diffs that had
un-balanced brackets. I.e. in this case (I manually produced this by
converting the brackets[1] to []'s, then changing them back:
	
	diff --git a/a b/b
	index 186017c1f38..05cdd03bfa4 100644
	--- a/a
	+++ b/b
	@@ -2,7 +2,10 @@
	        thing => 'foo',
	        a => 'b',
	 },
	+{
	+       x => 'y',
	+},
	 {
	-       c => 'd',
	+       c => 'f',
	        thing => 'bar',
	 }

That's a much worse diff to a computer (now 11 lines, v.s. 8 lines
before), but I'd think to most users that's *much* more understandable,
just by knowing just a bit about the language (although I'd argue we
could go as far as assuming this in general, with how common balanced
brackets[1] are across languages).

P.S.: Funny story: at <pastjob> I once helped a user who'd been
      struggling for hours to turn their already working change into
      something that made more sense with "git diff".

      We eventually managed to come up with something that looked
      "right", I can't remember how, probably some mixture of -U<n>,
      diff algorithm etc.

      Their next question was "Ok, so how do I commit this?", referring
      to "this particular version of the diff".

      Which, having already spent more time than I'd like to admit in
      trying to "help" them was a good reminder to first ask what
      problem we're trying to solve :)

1. By "balanced bracket" I'm referring not just to "{}", but what's
   considered a "mirrored" character in Unicode. I.e. not just ()[]{}<>
   etc., but also ∈∋ and the like (see
   e.g. https://www.compart.com/en/unicode/U+2208)

   For better or worse the Perl 6 language has this as part of its
   grammar, see e.g.:
   https://andrewshitov.com/2018/01/23/embedded-comment-delimiters-in-perl-6/

  reply	other threads:[~2023-02-09 10:58 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-05  3:46 [PATCH 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-05  3:46 ` [PATCH 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-06 16:20   ` Phillip Wood
2023-02-05  3:46 ` [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm John Cai via GitGitGadget
2023-02-05 17:50   ` Eric Sunshine
2023-02-06 13:10     ` John Cai
2023-02-06 16:27   ` Phillip Wood
2023-02-06 18:14     ` Eric Sunshine
2023-02-06 19:50     ` John Cai
2023-02-09  8:26       ` Elijah Newren
2023-02-09 10:31         ` Ævar Arnfjörð Bjarmason [this message]
2023-02-09 16:37         ` John Cai
2023-02-06 16:39   ` Ævar Arnfjörð Bjarmason
2023-02-06 20:37     ` John Cai
2023-02-07 14:55       ` Phillip Wood
2023-02-07 17:00         ` John Cai
2023-02-09  9:09           ` Elijah Newren
2023-02-09 14:44             ` Phillip Wood
2023-02-10  9:57               ` Elijah Newren
2023-02-11 17:39                 ` Phillip Wood
2023-02-11  1:59               ` Jeff King
2023-02-15  2:35                 ` Elijah Newren
2023-02-15  4:21                   ` Jeff King
2023-02-15  5:20                     ` Junio C Hamano
2023-02-15 14:44                 ` Phillip Wood
2023-02-15 15:00                   ` Jeff King
2023-02-07 17:27         ` Ævar Arnfjörð Bjarmason
2023-02-15 14:47           ` Phillip Wood
2023-02-09  8:44       ` Elijah Newren
2023-02-14 21:16         ` John Cai
2023-02-15  3:41           ` Elijah Newren
2023-02-09  7:50     ` Elijah Newren
2023-02-09  9:41       ` Ævar Arnfjörð Bjarmason
2023-02-11  2:04         ` Jeff King
2023-02-07 17:56   ` Jeff King
2023-02-07 20:18     ` Ævar Arnfjörð Bjarmason
2023-02-07 20:47       ` Junio C Hamano
2023-02-07 21:05         ` Ævar Arnfjörð Bjarmason
2023-02-07 21:28           ` Junio C Hamano
2023-02-07 21:44             ` Ævar Arnfjörð Bjarmason
2023-02-09 16:34     ` John Cai
2023-02-11  1:39       ` Jeff King
2023-02-14 21:40 ` [PATCH v2 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-14 21:40   ` [PATCH v2 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-15  2:38     ` Junio C Hamano
2023-02-15 23:34       ` John Cai
2023-02-15 23:42         ` Junio C Hamano
2023-02-16  2:14           ` Jeff King
2023-02-16  2:57             ` Junio C Hamano
2023-02-16 20:34               ` John Cai
2023-02-14 21:40   ` [PATCH v2 2/2] diff: teach diff to read gitattribute diff-algorithm John Cai via GitGitGadget
2023-02-15  2:56     ` Junio C Hamano
2023-02-15  3:20       ` Junio C Hamano
2023-02-16 20:37         ` John Cai
2023-02-17 20:21   ` [PATCH v3 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-17 20:21     ` [PATCH v3 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-17 21:27       ` Junio C Hamano
2023-02-18  1:36       ` Elijah Newren
2023-02-17 20:21     ` [PATCH v3 2/2] diff: teach diff to read algorithm from diff driver John Cai via GitGitGadget
2023-02-17 21:50       ` Junio C Hamano
2023-02-18  2:56       ` Elijah Newren
2023-02-20 15:32         ` John Cai
2023-02-20 16:21           ` Elijah Newren
2023-02-20 16:49             ` John Cai
2023-02-20 17:32               ` Elijah Newren
2023-02-20 20:53                 ` John Cai
2023-02-22 19:47                 ` Jeff King
2023-02-24 17:44                   ` John Cai
2023-02-18  1:16     ` [PATCH v3 0/2] Teach diff to honor diff algorithms set through git attributes Elijah Newren
2023-02-20 13:37       ` John Cai
2023-02-20 21:04     ` [PATCH v4 " John Cai via GitGitGadget
2023-02-20 21:04       ` [PATCH v4 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-20 21:04       ` [PATCH v4 2/2] diff: teach diff to read algorithm from diff driver John Cai via GitGitGadget
2023-02-21 17:34       ` [PATCH v4 0/2] Teach diff to honor diff algorithms set through git attributes Junio C Hamano
2023-02-21 18:05         ` Elijah Newren
2023-02-21 18:51           ` Junio C Hamano
2023-02-21 19:36             ` John Cai
2023-02-21 20:16               ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=230209.868rh7xg14.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=johncai86@gmail.com \
    --cc=newren@gmail.com \
    --cc=phillip.wood@dunelm.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).