Re: weird diff output?

From: Stefan Beller <sbeller@google.com>
To: Jeff King <peff@peff.net>
Cc: Jacob Keller <jacob.keller@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	Git mailing list <git@vger.kernel.org>,
	Jens Lehmann <Jens.Lehmann@web.de>
Subject: Re: weird diff output?
Date: Tue, 29 Mar 2016 23:05:09 -0700	[thread overview]
Message-ID: <CAGZ79kZgyFVzTRpP1k7kj8GVCPBukS_muhd=Xs4gxcA3maW0fw@mail.gmail.com> (raw)
In-Reply-To: <20160330045554.GA11007@sigill.intra.peff.net>

On Tue, Mar 29, 2016 at 9:55 PM, Jeff King <peff@peff.net> wrote:
> On Tue, Mar 29, 2016 at 04:05:57PM -0700, Jacob Keller wrote:
>
>> > This is what we want in both cases.
>> > And I would argue it would appease many other kinds of text as well, because
>> > an empty line is usually a strong indicator for any text that a
>> > different thing comes along.
>> > (Other programming languages, such as Java, C++ and any other C like
>> > language behaves
>> > that way; even when writing latex figures you'd rather want to break
>> > at new lines?)
>> >
>> > Thanks,
>> > Stefan
>>
>> This seems like a good heuristic. Can we think of any examples where
>> it would produce wildly confusing diffs? I don't think it necessarily
>> needs to be default but just a possible option when formatting diffs,
>> much like we already have today.
>
> One thing I like to do when playing with new diff ideas is to pipe all
> of "log -p" for a real project through it and see what differences it
> produces.
>
> Below is a perl script that implements Stefan's heuristic. I checked its
> output on git.git with:
>
>   git log --format='commit %H' -p >old
>   perl /path/to/script <old >new
>   diff -F ^commit -u old new | less

Wow, that's amazing! I'll toy around with it tomorrow. :)

>
> which shows the differences, with the commit id in the hunk header
> (which makes it easy to "git show $commit | perl /path/to/script" to
> see the new diff with more context.
>
> In addition to the cases discussed, it seems to improve C comments by
> turning:
>
>    /*
>   + * new function
>   + */
>   +void foo(void);
>   +
>   +/*
>     * old function
>     ...
>
> into:
>
>   +/*
>   + * my function
>   + */
>   +void foo(void);
>   +
>    /*
>     * old function
>     ...
>
> See 47fe3f6e for an example.
>
> It also seems to do OK with shell scripts. Commit e6bb5f78 is an example
> where it improves a here-doc, as in the motivating example from this
> thread. Similarly, the headers in 4df1e79 are much improved (though I'm
> confused why the final one in that diff doesn't seem to have been
> caught).
>
> I also ran into an interesting case in 86d26f24, where we have:
>
>   + test_expect_success '
>   +   foo
>   +
>   +'
>   +

That's an interesting case :)
I was trying to generalize my thoughts on it. (How is an empty line special?)

Instead of empty line we could go with the line with the least amount
of characters
in the lines which can be shifted up or down instead as well. Why so?

  The more characters are in a line, the more interesting the line is.
  (the more information is in there). Assuming one patch carries information
  that is highly relevant in itself, but may not be relevant to the surrounding
  (think adding a new function to a C file. The surrounding functions are not
  interesting for the diff, but rather you want to have all "relevant"
information
  bundled into that one diff reasonably.

  Going by the rule of splitting at the shortest line instead of just
at empty lines,
  is a generalization of

So instead of looking at

>   + test_expect_success '
>   +   foo
>   +
>   +'
>   +

we rather want to look at the string lengths of each line:

>   21
>   5
>   0
>   1
>   0

and then take the minimum (so instead of only acting on the 'last 0'
as shown by Jeff,
we'd go to the minimum of those numbers.)

Now on tie breaking (i.e two empty lines):

We need to understand the "pattern" of whether the lonely 1 char line
belongs above
or below the chunk. I do not think we can do that just from the diff alone.

* We either need to check the file ("Does the file start with the 0 1 0 pattern
or does it end with that?" That would be a strong hint on whether to
put the 1 line
above or below the chunk.) However a typical file has noise at the top
and bottom,
so this heuristic is not often applicable. With noise I mean license headers or
a java class or namespace ending with another brace or such. So probably
this second order heuristic on which of the empty lines to pick for
breaking needs more thoughts.

* Go through the history of the file and check for occurrences (how
was such a pattern
added in the past? Ideally we want to find the first time such a
pattern is added and
then decide based on that whether to break at the first or second empty line)

I guess both ways are expensive. Probably too expensive.

So for now we can just go with "take first or last empty line (shortest line) of
overlapping lines" and inspect that further.

>
> and there are _two_ blank lines to choose from. It looks really terrible
> if you use the first one, but the second one looks good (and the script
> below chooses the second, as it's closest to the hunk boundary). There
> may be cases where that's bad, though.
>
> This is just a proof of concept. I guess we'd want to somehow integrate
> the heuristic into git.
>
> -- >8 --
> #!/usr/bin/perl
>
> use strict;
> use warnings 'all';
>
> use constant {
>   STATE_NONE => 0,
>   STATE_LEADING_CONTEXT => 1,
>   STATE_IN_CHUNK => 2,
> };
> my $state = STATE_NONE;
> my @hunk;
> while(<>) {
>   if ($state == STATE_NONE) {
>     print;
>     if (/^@/) {
>       $state = STATE_LEADING_CONTEXT;
>     }
>   } else {
>     if (/^ /) {
>       flush_hunk() if $state != STATE_LEADING_CONTEXT;
>       push @hunk, $_;
>     } elsif(/^[-+]/) {
>       push @hunk, $_;
>       $state = STATE_IN_CHUNK;
>     } else {
>       flush_hunk();
>       $state = STATE_NONE;
>       print;
>     }
>   }
> }
> flush_hunk();
>
> sub flush_hunk {
>   my $context_len = 0;
>   while ($context_len < @hunk && $hunk[$context_len] =~ /^ /) {
>     $context_len++;
>   }
>
>   # Find the length of the ambiguous portion.
>   # Assumes our hunks have context first, and ambiguous additions at the end,
>   # which is how git generates them
>   my $ambig_len = 0;
>   while ($ambig_len < $context_len) {
>     my $i = $context_len - $ambig_len - 1;
>     my $j = @hunk - $ambig_len - 1;
>     if ($hunk[$j] =~ /^\+/ && substr($hunk[$i], 1) eq substr($hunk[$j], 1)) {
>       $ambig_len++;
>     } else {
>       last;
>     }
>   }
>
>   # Now look for an empty line in the ambiguous portion (we can just look in
>   # the context side, as it is equivalent to the addition side at the end).
>   # We count down, though, as we prefer to use the line closest to the
>   # hunk as the cutoff.
>   my $empty;
>   for (my $i = $context_len - 1; $i >= $context_len - $ambig_len; $i--) {
>     if (length($hunk[$i]) == 2) {
>       $empty = $i;
>       last;
>     }
>   }
>
>   if (defined $empty) {
>     # move empty lines after the chunk to be part of it
>     for (my $i = $empty + 1; $i < $context_len; $i++) {
>       $hunk[$i] =~ s/^ /+/;
>       $hunk[@hunk - $context_len + $i] =~ s/^\+/ /;
>     }
>   }
>
>   print @hunk;
>   @hunk = ();
> }