[PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
@ 2015-06-15 17:20 Patrick Palka
  2015-06-18 15:50 ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Patrick Palka @ 2015-06-15 17:20 UTC (permalink / raw)
  To: git; +Cc: Patrick Palka

Currently the diff-highlight script does not try to highlight hunks that
have different numbers of removed/added lines.  But we can be a little
smarter than that, without introducing much magic and complexity.

In the case of unevenly-sized hunks, we could still highlight the first
few (lexicographical) add/remove pairs.  It is not uncommon for hunks to
have common "prefixes", and in such a case this change is very useful
for spotting differences.

Signed-off-by: Patrick Palka <patrick@parcs.ath.cx>
---
 contrib/diff-highlight/diff-highlight | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index ffefc31..0dfbebd 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -88,22 +88,30 @@ sub show_hunk {
 		return;
 	}
 
-	# If we have mismatched numbers of lines on each side, we could try to
-	# be clever and match up similar lines. But for now we are simple and
-	# stupid, and only handle multi-line hunks that remove and add the same
-	# number of lines.
-	if (@$a != @$b) {
-		print @$a, @$b;
-		return;
-	}
+	# We match up the first MIN(a, b) lines on each side.
+	my $c = @$a < @$b ? @$a : @$b;
 
+	# Highlight each pair, and print each removed line of that pair.
 	my @queue;
-	for (my $i = 0; $i < @$a; $i++) {
+	for (my $i = 0; $i < $c; $i++) {
 		my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]);
 		print $rm;
 		push @queue, $add;
 	}
+
+	# Print the remaining unmatched removed lines of the hunk.
+	for (my $i = $c; $i < @$a; $i++) {
+		print $a->[$i];
+	}
+
+	# Print the added lines of each highlighted pair.
 	print @queue;
+
+	# Print the remaining unmatched added lines of the hunk.
+	for (my $i = $c; $i < @$b; $i++) {
+		print $b->[$i];
+	}
+
 }
 
 sub highlight_pair {
-- 
2.4.3.413.ga5fe668

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-15 17:20 [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks Patrick Palka
@ 2015-06-18 15:50 ` Junio C Hamano
  2015-06-18 16:28   ` Patrick Palka
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2015-06-18 15:50 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Patrick Palka

Patrick Palka <patrick@parcs.ath.cx> writes:

> Currently the diff-highlight script does not try to highlight hunks that
> have different numbers of removed/added lines.  But we can be a little
> smarter than that, without introducing much magic and complexity.
>
> In the case of unevenly-sized hunks, we could still highlight the first
> few (lexicographical) add/remove pairs.  It is not uncommon for hunks to
> have common "prefixes", and in such a case this change is very useful
> for spotting differences.
>
> Signed-off-by: Patrick Palka <patrick@parcs.ath.cx>
> ---

Patrick, "git shortlog --no-merges contrib/diff-highlight/" is your
friend to see who may be able to give you a good feedback.

Jeff, what do you think?

I have this nagging feeling that it is just as likely that two
uneven hunks align at the top as they align at the bottom, so while
this might not hurt it may not be the right approach for a better
solution, in the sense that when somebody really wants to do a
better solution, this change and the original code may need to be
ripped out and redone from scratch.

>  contrib/diff-highlight/diff-highlight | 26 +++++++++++++++++---------
>  1 file changed, 17 insertions(+), 9 deletions(-)
>
> diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
> index ffefc31..0dfbebd 100755
> --- a/contrib/diff-highlight/diff-highlight
> +++ b/contrib/diff-highlight/diff-highlight
> @@ -88,22 +88,30 @@ sub show_hunk {
>  		return;
>  	}
>  
> -	# If we have mismatched numbers of lines on each side, we could try to
> -	# be clever and match up similar lines. But for now we are simple and
> -	# stupid, and only handle multi-line hunks that remove and add the same
> -	# number of lines.
> -	if (@$a != @$b) {
> -		print @$a, @$b;
> -		return;
> -	}
> +	# We match up the first MIN(a, b) lines on each side.
> +	my $c = @$a < @$b ? @$a : @$b;
>  
> +	# Highlight each pair, and print each removed line of that pair.
>  	my @queue;
> -	for (my $i = 0; $i < @$a; $i++) {
> +	for (my $i = 0; $i < $c; $i++) {
>  		my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]);
>  		print $rm;
>  		push @queue, $add;
>  	}
> +
> +	# Print the remaining unmatched removed lines of the hunk.
> +	for (my $i = $c; $i < @$a; $i++) {
> +		print $a->[$i];
> +	}
> +
> +	# Print the added lines of each highlighted pair.
>  	print @queue;
> +
> +	# Print the remaining unmatched added lines of the hunk.
> +	for (my $i = $c; $i < @$b; $i++) {
> +		print $b->[$i];
> +	}
> +
>  }
>  
>  sub highlight_pair {

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 15:50 ` Junio C Hamano
@ 2015-06-18 16:28   ` Patrick Palka
  2015-06-18 18:08     ` Junio C Hamano
  2015-06-18 19:08     ` Jeff King
  0 siblings, 2 replies; 20+ messages in thread
From: Patrick Palka @ 2015-06-18 16:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On Thu, Jun 18, 2015 at 11:50 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Patrick Palka <patrick@parcs.ath.cx> writes:
>
>> Currently the diff-highlight script does not try to highlight hunks that
>> have different numbers of removed/added lines.  But we can be a little
>> smarter than that, without introducing much magic and complexity.
>>
>> In the case of unevenly-sized hunks, we could still highlight the first
>> few (lexicographical) add/remove pairs.  It is not uncommon for hunks to
>> have common "prefixes", and in such a case this change is very useful
>> for spotting differences.
>>
>> Signed-off-by: Patrick Palka <patrick@parcs.ath.cx>
>> ---
>
> Patrick, "git shortlog --no-merges contrib/diff-highlight/" is your
> friend to see who may be able to give you a good feedback.

Sorry about that.  I admit the sending of this patch was rushed for no
good reason.

>
> Jeff, what do you think?
>
> I have this nagging feeling that it is just as likely that two
> uneven hunks align at the top as they align at the bottom, so while
> this might not hurt it may not be the right approach for a better
> solution, in the sense that when somebody really wants to do a
> better solution, this change and the original code may need to be
> ripped out and redone from scratch.

Hmm, maybe. I stuck with assuming hunks are top-aligned because it
required less code to implement :)

The benefits of a simple dumb solution like assuming hunks align at
the top or bottom is that it remains very easy to visually match up
each highlighted deleted slice with its corresponding highlighted
added slice. If we start matching up similar lines or something like
that then it seems we would have to mostly forsake this benefit.  A
stupid algorithm in this case is nice because its output is
predictable.

A direct improvement upon this patch that would not require redoing
the whole script from scratch would be to first to calculate the
highlighting assuming the hunk aligns at the top, then to calculate
the highlighting assuming the hunk aligns at the bottom, and to pick
out of the two the highlighting with the least "noise".  Though we
would still be out of luck if the hunk is more complicated than being
top-aligned or bottom-aligned.

By the way, what would it take to get something like this script into
git proper?  It is IMHO immensely useful even in its current form, yet
because it's not baked into the application hardly anybody knows about
it.

>
>>  contrib/diff-highlight/diff-highlight | 26 +++++++++++++++++---------
>>  1 file changed, 17 insertions(+), 9 deletions(-)
>>
>> diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
>> index ffefc31..0dfbebd 100755
>> --- a/contrib/diff-highlight/diff-highlight
>> +++ b/contrib/diff-highlight/diff-highlight
>> @@ -88,22 +88,30 @@ sub show_hunk {
>>               return;
>>       }
>>
>> -     # If we have mismatched numbers of lines on each side, we could try to
>> -     # be clever and match up similar lines. But for now we are simple and
>> -     # stupid, and only handle multi-line hunks that remove and add the same
>> -     # number of lines.
>> -     if (@$a != @$b) {
>> -             print @$a, @$b;
>> -             return;
>> -     }
>> +     # We match up the first MIN(a, b) lines on each side.
>> +     my $c = @$a < @$b ? @$a : @$b;
>>
>> +     # Highlight each pair, and print each removed line of that pair.
>>       my @queue;
>> -     for (my $i = 0; $i < @$a; $i++) {
>> +     for (my $i = 0; $i < $c; $i++) {
>>               my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]);
>>               print $rm;
>>               push @queue, $add;
>>       }
>> +
>> +     # Print the remaining unmatched removed lines of the hunk.
>> +     for (my $i = $c; $i < @$a; $i++) {
>> +             print $a->[$i];
>> +     }
>> +
>> +     # Print the added lines of each highlighted pair.
>>       print @queue;
>> +
>> +     # Print the remaining unmatched added lines of the hunk.
>> +     for (my $i = $c; $i < @$b; $i++) {
>> +             print $b->[$i];
>> +     }
>> +
>>  }
>>
>>  sub highlight_pair {

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 16:28   ` Patrick Palka
@ 2015-06-18 18:08     ` Junio C Hamano
  2015-06-18 19:04       ` Jeff King
  2015-06-18 20:23       ` Patrick Palka
  2015-06-18 19:08     ` Jeff King
  1 sibling, 2 replies; 20+ messages in thread
From: Junio C Hamano @ 2015-06-18 18:08 UTC (permalink / raw)
  To: Patrick Palka; +Cc: Jeff King, git

Patrick Palka <patrick@parcs.ath.cx> writes:

>> I have this nagging feeling that it is just as likely that two
>> uneven hunks align at the top as they align at the bottom, so while
>> this might not hurt it may not be the right approach for a better
>> solution, in the sense that when somebody really wants to do a
>> better solution, this change and the original code may need to be
>> ripped out and redone from scratch.
>
> Hmm, maybe. I stuck with assuming hunks are top-aligned because it
> required less code to implement :)

Yeah, I understand that.

If we will need to rip out only this change but keep the original in
order to implement a future better solution, then we might be better
off not having this change (if we anticipate such a better solution
to come reasonably soon), because it would make it more work for the
final improved solution.  But if we need to rip out the original as
well as this change while we do so, then having this patch would not
make it more work, either.

So as I said, I do not think it would hurt to have this as an
incremental improvement (albeit going in a possibly wrong
direction).

Of course, it is a separate question if this change makes the output
worse, by comparing unmatched early parts of two hunks and making
nonsense highlight by calling highlight_pair() more often.  As long
as that is not an issue, I am not opposed to this change, which was
what I meant to say by "this might not hurt".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 18:08     ` Junio C Hamano
@ 2015-06-18 19:04       ` Jeff King
  2015-06-18 20:14         ` Patrick Palka
  2015-06-18 20:23       ` Patrick Palka
  1 sibling, 1 reply; 20+ messages in thread
From: Jeff King @ 2015-06-18 19:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Palka, git

On Thu, Jun 18, 2015 at 11:08:16AM -0700, Junio C Hamano wrote:

> So as I said, I do not think it would hurt to have this as an
> incremental improvement (albeit going in a possibly wrong
> direction).
> 
> Of course, it is a separate question if this change makes the output
> worse, by comparing unmatched early parts of two hunks and making
> nonsense highlight by calling highlight_pair() more often.  As long
> as that is not an issue, I am not opposed to this change, which was
> what I meant to say by "this might not hurt".

Yes, that is my big concern, and why I punted on mismatched-size hunks
in the first place. Now that we have a patch, it is easy enough to "git
log -p | diff-highlight" with the old and new versions to compare the
results.

It certainly does improve some cases. E.g.:

  -foo
  +foo &&
  +bar

in a test script becomes more clear. But some of the output is not so
great. For instance, the very commit under discussion has a
confusing and useless highlight. Or take a documentation patch like
5c31acfb, where I find the highlights actively distracting. We are saved
a little by the "if the whole line is different, do not highlight at
all" behavior of 097128d1bc.

So I dunno. IMHO this does more harm than good, and I would not want to
use it myself. But it is somewhat a matter of taste; I am not opposed to
making it a configurable option.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 19:04       ` Jeff King
@ 2015-06-18 20:14         ` Patrick Palka
  2015-06-18 20:45           ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Patrick Palka @ 2015-06-18 20:14 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Patrick Palka, git

On Thu, 18 Jun 2015, Jeff King wrote:

> On Thu, Jun 18, 2015 at 11:08:16AM -0700, Junio C Hamano wrote:
>
>> So as I said, I do not think it would hurt to have this as an
>> incremental improvement (albeit going in a possibly wrong
>> direction).
>>
>> Of course, it is a separate question if this change makes the output
>> worse, by comparing unmatched early parts of two hunks and making
>> nonsense highlight by calling highlight_pair() more often.  As long
>> as that is not an issue, I am not opposed to this change, which was
>> what I meant to say by "this might not hurt".
>
> Yes, that is my big concern, and why I punted on mismatched-size hunks
> in the first place. Now that we have a patch, it is easy enough to "git
> log -p | diff-highlight" with the old and new versions to compare the
> results.
>
> It certainly does improve some cases. E.g.:
>
>  -foo
>  +foo &&
>  +bar
>
> in a test script becomes more clear. But some of the output is not so
> great. For instance, the very commit under discussion has a
> confusing and useless highlight. Or take a documentation patch like
> 5c31acfb, where I find the highlights actively distracting. We are saved
> a little by the "if the whole line is different, do not highlight at
> all" behavior of 097128d1bc.

To fix the useless highlights for both evenly and unevenly sized hunks
(like when all but a semicolon on a line changes), one can loosen the
criterion for not highlighting from "do not highlight if 0% of the
before and after lines are common between them" to, say, "do not
highlight if less than 10% of the before and after lines are common
between them".  Then most of these useless highlights are gone for both
evenly and unevenly sized hunks.

Here is a patch that changes the criterion as mentioned.  Testing this
change on the documentation patch 5c31acfb, only two pairs of lines are
highlighted instead of six.  On my original patch, the useless highlight
is gone.  The useless semicolon-related highlights on e.g. commit
99a2cfb are gone.

Ten percent is a modest threshold, and perhaps it should be increased
when highlighting unevenly sized hunks and decreased when highlighting
evenly sized hunks.

Of course, these patches are both hacks but they seem to be surprisingly
effective hacks especially when paired together.

>
> So I dunno. IMHO this does more harm than good, and I would not want to
> use it myself. But it is somewhat a matter of taste; I am not opposed to
> making it a configurable option.

That is something I can do :)

>
> -Peff
>

-- >8 --

Subject: [PATCH] diff-highlight: don't highlight lines that have little in
  common

---
  contrib/diff-highlight/diff-highlight | 13 +++++++++----
  1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index 85d2eb0..e4829ec 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -218,8 +218,13 @@ sub is_pair_interesting {
  	my $suffix_a = join('', @$a[($sa+1)..$#$a]);
  	my $suffix_b = join('', @$b[($sb+1)..$#$b]);

-	return $prefix_a !~ /^$COLOR*-$BORING*$/ ||
-	       $prefix_b !~ /^$COLOR*\+$BORING*$/ ||
-	       $suffix_a !~ /^$BORING*$/ ||
-	       $suffix_b !~ /^$BORING*$/;
+	$prefix_a =~ s/^$COLOR*-$BORING*//;
+	$prefix_b =~ s/^$COLOR*\+$BORING*//;
+	$suffix_a =~ s/$BORING*$//;
+	$suffix_b =~ s/$BORING*$//;
+
+	# Only bother highlighting if at least 10% of each line is common among
+	# the lines.
+	return ((length($prefix_a)+length($suffix_a))*100 >= @$a*10) &&
+	       ((length($prefix_b)+length($suffix_b))*100 >= @$b*10);
  }
-- 
2.4.4.410.g43ed522.dirty

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 20:14         ` Patrick Palka
@ 2015-06-18 20:45           ` Jeff King
  2015-06-18 21:23             ` Jeff King
  2015-06-18 23:06             ` Patrick Palka
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff King @ 2015-06-18 20:45 UTC (permalink / raw)
  To: Patrick Palka; +Cc: Junio C Hamano, git

On Thu, Jun 18, 2015 at 04:14:19PM -0400, Patrick Palka wrote:

> >in a test script becomes more clear. But some of the output is not so
> >great. For instance, the very commit under discussion has a
> >confusing and useless highlight. Or take a documentation patch like
> >5c31acfb, where I find the highlights actively distracting. We are saved
> >a little by the "if the whole line is different, do not highlight at
> >all" behavior of 097128d1bc.
> 
> To fix the useless highlights for both evenly and unevenly sized hunks
> (like when all but a semicolon on a line changes), one can loosen the
> criterion for not highlighting from "do not highlight if 0% of the
> before and after lines are common between them" to, say, "do not
> highlight if less than 10% of the before and after lines are common
> between them".  Then most of these useless highlights are gone for both
> evenly and unevenly sized hunks.

Yeah, this is an idea I had considered but never actually experimented
with. It does make some things better, but it also makes some a little
worse. For example, in 8dbf3eb, the hunk:

-               const char *plain = diff_get_color(ecb->color_diff,
-                                                  DIFF_PLAIN);
+               const char *context = diff_get_color(ecb->color_diff,
+                                                    DIFF_CONTEXT);

currently gets the plain/context change in the first line highlighted,
as well as the DIFF_PLAIN/DIFF_CONTEXT in the second line. With a 10%
limit, the second line isn't highlighted. That's correct by the
heuristic, but it's a bit harder to read, because the highlight draws
your eye to the first change, and it is easy to miss the second.

Still, I think this is probably a minority case, and it may be
outweighed by the improvements. The "real" solution is to consider the
hunk as a whole and do an LCS diff on it, which would show that yes,
it's worth highlighting both of those spots, as they are a small
percentage of the total hunk.

> Here is a patch that changes the criterion as mentioned.  Testing this
> change on the documentation patch 5c31acfb, only two pairs of lines are
> highlighted instead of six.  On my original patch, the useless highlight
> is gone.  The useless semicolon-related highlights on e.g. commit
> 99a2cfb are gone.

Nice, the ones like 99a2cfb are definitely wrong (I had though to fix
them eventually by treating some punctuation as uninteresting, but I
suspect the percentage heuristic covers that reasonably well in
practice).

> Of course, these patches are both hacks but they seem to be surprisingly
> effective hacks especially when paired together.

The whole script is a (surprisingly effective) hack. ;)

> >So I dunno. IMHO this does more harm than good, and I would not want to
> >use it myself. But it is somewhat a matter of taste; I am not opposed to
> >making it a configurable option.
> 
> That is something I can do :)

Coupled with the 10%-threshold patch, I think it would be OK to include
it unconditionally. So far we've just been diffing the two outputs and
micro-analyzing them. The real test to me will be using it in practice
and seeing if it's helpful or annoying.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 20:45           ` Jeff King
@ 2015-06-18 21:23             ` Jeff King
  2015-06-18 21:39               ` Junio C Hamano
                                 ` (2 more replies)
  2015-06-18 23:06             ` Patrick Palka
  1 sibling, 3 replies; 20+ messages in thread
From: Jeff King @ 2015-06-18 21:23 UTC (permalink / raw)
  To: Patrick Palka; +Cc: Junio C Hamano, git

On Thu, Jun 18, 2015 at 04:45:05PM -0400, Jeff King wrote:

> Still, I think this is probably a minority case, and it may be
> outweighed by the improvements. The "real" solution is to consider the
> hunk as a whole and do an LCS diff on it, which would show that yes,
> it's worth highlighting both of those spots, as they are a small
> percentage of the total hunk.

I've been meaning to play with this for years, so I took the opportunity
to spend a little time on it. :)

Below is a (slightly hacky) patch I came up with. It seems to work, and
produces really great output in some cases. For instance, in 99a2cfb, it
produces (I put highlighted bits in angle brackets):

  -               <hash>cpy(peeled, <sha1>);
  +               <oid>cpy(<&>peeled, <oid>);

It also produces nonsense like:

  -       <un>s<ign>ed <char >peeled<[20]>;
  +       s<truct obj>e<ct_i>d peeled;

but I think that is simply because my splitting function is terrible (it
splits each byte, whereas we'd probably want to use whitespace and
punctuation, or something content-specific).

---
diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index ffefc31..7165518 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -3,6 +3,7 @@
 use 5.008;
 use warnings FATAL => 'all';
 use strict;
+use Algorithm::Diff;
 
 # Highlight by reversing foreground and background. You could do
 # other things like bold or underline if you prefer.
@@ -88,131 +89,54 @@ sub show_hunk {
 		return;
 	}
 
-	# If we have mismatched numbers of lines on each side, we could try to
-	# be clever and match up similar lines. But for now we are simple and
-	# stupid, and only handle multi-line hunks that remove and add the same
-	# number of lines.
-	if (@$a != @$b) {
-		print @$a, @$b;
-		return;
-	}
-
-	my @queue;
-	for (my $i = 0; $i < @$a; $i++) {
-		my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]);
-		print $rm;
-		push @queue, $add;
-	}
-	print @queue;
-}
-
-sub highlight_pair {
-	my @a = split_line(shift);
-	my @b = split_line(shift);
+	my ($prefix_a, $suffix_a, @hunk_a) = split_hunk(@$a);
+	my ($prefix_b, $suffix_b, @hunk_b) = split_hunk(@$b);
 
-	# Find common prefix, taking care to skip any ansi
-	# color codes.
-	my $seen_plusminus;
-	my ($pa, $pb) = (0, 0);
-	while ($pa < @a && $pb < @b) {
-		if ($a[$pa] =~ /$COLOR/) {
-			$pa++;
-		}
-		elsif ($b[$pb] =~ /$COLOR/) {
-			$pb++;
-		}
-		elsif ($a[$pa] eq $b[$pb]) {
-			$pa++;
-			$pb++;
-		}
-		elsif (!$seen_plusminus && $a[$pa] eq '-' && $b[$pb] eq '+') {
-			$seen_plusminus = 1;
-			$pa++;
-			$pb++;
-		}
-		else {
-			last;
-		}
-	}
+	my $diff = Algorithm::Diff->new(\@hunk_a, \@hunk_b);
+	my (@out_a, @out_b);
+	while ($diff->Next()) {
+		my $bits = $diff->Diff();
 
-	# Find common suffix, ignoring colors.
-	my ($sa, $sb) = ($#a, $#b);
-	while ($sa >= $pa && $sb >= $pb) {
-		if ($a[$sa] =~ /$COLOR/) {
-			$sa--;
-		}
-		elsif ($b[$sb] =~ /$COLOR/) {
-			$sb--;
-		}
-		elsif ($a[$sa] eq $b[$sb]) {
-			$sa--;
-			$sb--;
-		}
-		else {
-			last;
-		}
-	}
+		push @out_a, $OLD_HIGHLIGHT[1] if $bits & 1;
+		push @out_a, $diff->Items(1);
+		push @out_a, $OLD_HIGHLIGHT[2] if $bits & 1;
 
-	if (is_pair_interesting(\@a, $pa, $sa, \@b, $pb, $sb)) {
-		return highlight_line(\@a, $pa, $sa, \@OLD_HIGHLIGHT),
-		       highlight_line(\@b, $pb, $sb, \@NEW_HIGHLIGHT);
-	}
-	else {
-		return join('', @a),
-		       join('', @b);
+		push @out_b, $NEW_HIGHLIGHT[1] if $bits & 2;
+		push @out_b, $diff->Items(2);
+		push @out_b, $NEW_HIGHLIGHT[2] if $bits & 2;
 	}
-}
 
-sub split_line {
-	local $_ = shift;
-	return utf8::decode($_) ?
-		map { utf8::encode($_); $_ }
-			map { /$COLOR/ ? $_ : (split //) }
-			split /($COLOR+)/ :
-		map { /$COLOR/ ? $_ : (split //) }
-		split /($COLOR+)/;
+	output_split_hunk($prefix_a, $suffix_a, @out_a);
+	output_split_hunk($prefix_b, $suffix_b, @out_b);
 }
 
-sub highlight_line {
-	my ($line, $prefix, $suffix, $theme) = @_;
-
-	my $start = join('', @{$line}[0..($prefix-1)]);
-	my $mid = join('', @{$line}[$prefix..$suffix]);
-	my $end = join('', @{$line}[($suffix+1)..$#$line]);
-
-	# If we have a "normal" color specified, then take over the whole line.
-	# Otherwise, we try to just manipulate the highlighted bits.
-	if (defined $theme->[0]) {
-		s/$COLOR//g for ($start, $mid, $end);
-		chomp $end;
-		return join('',
-			$theme->[0], $start, $RESET,
-			$theme->[1], $mid, $RESET,
-			$theme->[0], $end, $RESET,
-			"\n"
-		);
-	} else {
-		return join('',
-			$start,
-			$theme->[1], $mid, $theme->[2],
-			$end
-		);
+# Return the individual diff-able items of the hunk, but with any
+# diff or color prefix/suffix for each line split out (we assume that the
+# prefix/suffix for each line will be the same).
+sub split_hunk {
+	my ($prefix, $suffix, @r);
+	foreach my $line (@_) {
+		$line =~ /^($COLOR*[+-]$COLOR*)(.*)($COLOR*)/
+			or die "eh, this is supposed to match everything!";
+
+		# overwrite the old values; we assume they're all the same
+		# anyway
+		$prefix = $1;
+		$suffix = $3;
+
+		# do a straight character split. This almost certainly isn't
+		# ideal, but it's a good starting point. We should at the very
+		# least be utf8-aware, and probably use color-words regexes.
+		push @r, split(//, $2), "\n";
 	}
+	return ($prefix, $suffix, @r);
 }
 
-# Pairs are interesting to highlight only if we are going to end up
-# highlighting a subset (i.e., not the whole line). Otherwise, the highlighting
-# is just useless noise. We can detect this by finding either a matching prefix
-# or suffix (disregarding boring bits like whitespace and colorization).
-sub is_pair_interesting {
-	my ($a, $pa, $sa, $b, $pb, $sb) = @_;
-	my $prefix_a = join('', @$a[0..($pa-1)]);
-	my $prefix_b = join('', @$b[0..($pb-1)]);
-	my $suffix_a = join('', @$a[($sa+1)..$#$a]);
-	my $suffix_b = join('', @$b[($sb+1)..$#$b]);
-
-	return $prefix_a !~ /^$COLOR*-$BORING*$/ ||
-	       $prefix_b !~ /^$COLOR*\+$BORING*$/ ||
-	       $suffix_a !~ /^$BORING*$/ ||
-	       $suffix_b !~ /^$BORING*$/;
+sub output_split_hunk {
+	my $prefix = shift;
+	my $suffix = shift;
+	my $str = join('', @_);
+	$str =~ s/^/$prefix/mg;
+	$str =~ s/$/$suffix/mg;
+	print $str;
 }

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 21:23             ` Jeff King
@ 2015-06-18 21:39               ` Junio C Hamano
  2015-06-18 22:25               ` Patrick Palka
  2015-06-19  3:54               ` Jeff King
  2 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2015-06-18 21:39 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Palka, git

Jeff King <peff@peff.net> writes:

> On Thu, Jun 18, 2015 at 04:45:05PM -0400, Jeff King wrote:
>
>> Still, I think this is probably a minority case, and it may be
>> outweighed by the improvements. The "real" solution is to consider the
>> hunk as a whole and do an LCS diff on it, which would show that yes,
>> it's worth highlighting both of those spots, as they are a small
>> percentage of the total hunk.
>
> I've been meaning to play with this for years, so I took the opportunity
> to spend a little time on it. :)

I envy you ;-)  It certainly looks like a fun side project.

> Below is a (slightly hacky) patch I came up with. It seems to work, and
> produces really great output in some cases. For instance, in 99a2cfb, it
> produces (I put highlighted bits in angle brackets):
>
>   -               <hash>cpy(peeled, <sha1>);
>   +               <oid>cpy(<&>peeled, <oid>);
>
> It also produces nonsense like:
>
>   -       <un>s<ign>ed <char >peeled<[20]>;
>   +       s<truct obj>e<ct_i>d peeled;

ROTFL ;-)

And the change does not look bad at all.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 21:23             ` Jeff King
  2015-06-18 21:39               ` Junio C Hamano
@ 2015-06-18 22:25               ` Patrick Palka
  2015-06-19  3:54               ` Jeff King
  2 siblings, 0 replies; 20+ messages in thread
From: Patrick Palka @ 2015-06-18 22:25 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On Thu, Jun 18, 2015 at 5:23 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Jun 18, 2015 at 04:45:05PM -0400, Jeff King wrote:
>
>> Still, I think this is probably a minority case, and it may be
>> outweighed by the improvements. The "real" solution is to consider the
>> hunk as a whole and do an LCS diff on it, which would show that yes,
>> it's worth highlighting both of those spots, as they are a small
>> percentage of the total hunk.
>
> I've been meaning to play with this for years, so I took the opportunity
> to spend a little time on it. :)

Cool!

>
> Below is a (slightly hacky) patch I came up with. It seems to work, and
> produces really great output in some cases. For instance, in 99a2cfb, it
> produces (I put highlighted bits in angle brackets):
>
>   -               <hash>cpy(peeled, <sha1>);
>   +               <oid>cpy(<&>peeled, <oid>);
>
> It also produces nonsense like:
>
>   -       <un>s<ign>ed <char >peeled<[20]>;
>   +       s<truct obj>e<ct_i>d peeled;

That's not even so bad.  The diff of the change itself is... interesting.

>
> but I think that is simply because my splitting function is terrible (it
> splits each byte, whereas we'd probably want to use whitespace and
> punctuation, or something content-specific).

I hope you can polish this.  It definitely has potential.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 21:23             ` Jeff King
  2015-06-18 21:39               ` Junio C Hamano
  2015-06-18 22:25               ` Patrick Palka
@ 2015-06-19  3:54               ` Jeff King
  2015-06-19  4:49                 ` Junio C Hamano
  2 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2015-06-19  3:54 UTC (permalink / raw)
  To: Patrick Palka; +Cc: Junio C Hamano, git

On Thu, Jun 18, 2015 at 05:23:56PM -0400, Jeff King wrote:

> +# Return the individual diff-able items of the hunk, but with any
> +# diff or color prefix/suffix for each line split out (we assume that the
> +# prefix/suffix for each line will be the same).
> +sub split_hunk {
> +	my ($prefix, $suffix, @r);
> +	foreach my $line (@_) {
> +		$line =~ /^($COLOR*[+-]$COLOR*)(.*)($COLOR*)/
> +			or die "eh, this is supposed to match everything!";

This isn't quite right. We'll never match the suffix, because it gets
sucked up by the greedy (.*). But making that non-greedy matches
nothing, unless we also add "$" at the end.

_But_, there is still something else weird going on. I replaced this
split:

> +		push @r, split(//, $2), "\n";

with:

  push @r, split(/([[:space:][:punct:]])/, $2), "\n";

which behaves much better. With that, 99a2cfb shows:

  -       <if> (!peel_ref(path, peeled)) {
  -               <is>_annotated = !!<hashcmp>(<sha1>, peeled);
  +       <if> (!peel_ref(path, peeled<.hash>)) {
  +               <is>_annotated = !!<oidcmp>(<oid>, <&>peeled);

The latter half of both lines looks perfect. But what's that weird
highlighting of the initial "if" and "is" on those lines?

It turns out that the colored output we produce is quite odd:

  $ git show --color 99a2cfb | grep peel_ref | cat -A
  ^[[31m-^Iif (!peel_ref(path, peeled)) {^[[m$
  ^[[32m+^[[m^I^[[32mif (!peel_ref(path, peeled.hash)) {^[[m$

For the pre-image, we print the color, the "-", and then the line data.
Makes sense.

For the post-image, we print the color, "+", a reset, then the initial
whitespace, then the color again!

So of course the diff algorithm says "hey, there's this weird color in
here". The original implementation of diff-highlight didn't care,
because it skipped leading whitespace and colors as "boring". But this
one cannot do that. It pulls the strict prefix out of each line (and it
must, because it must get the same prefix for each line of a hunk).

I think git is actually wrong here; it's mingling the ANSI colors and
the actual content. Nobody ever noticed because it looks OK to a human,
and who would be foolish enough to try machine-parsing colorized diff
output?

The fix is:

diff --git a/diff.c b/diff.c
index 87b16d5..a80b5b4 100644
--- a/diff.c
+++ b/diff.c
@@ -501,9 +501,9 @@ static void emit_line_checked(const char *reset,
 		emit_line_0(ecbdata->opt, ws, reset, sign, line, len);
 	else {
 		/* Emit just the prefix, then the rest. */
-		emit_line_0(ecbdata->opt, set, reset, sign, "", 0);
+		emit_line_0(ecbdata->opt, set, "", sign, "", 0);
 		ws_check_emit(line, len, ecbdata->ws_rule,
-			      ecbdata->opt->file, set, reset, ws);
+			      ecbdata->opt->file, "", reset, ws);
 	}
 }

But I'm a little worried it may interfere with the way the
whitespace-checker emits colors (e.g., it may emit its own colors for
the leading spaces, and we would need to re-assert our color before
showing the rest of the line). So I think you could also argue that
because of whitespace-highlighting, colorized diffs are fundamentally
going to have colors intermingled with the content and should not be
parsed this way.

All the more reason to try to move this inside diff.c, I guess.

-Peff

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-19  3:54               ` Jeff King
@ 2015-06-19  4:49                 ` Junio C Hamano
  2015-06-19  5:32                   ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2015-06-19  4:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Palka, git

Jeff King <peff@peff.net> writes:

> ... I think you could also argue that
> because of whitespace-highlighting, colorized diffs are fundamentally
> going to have colors intermingled with the content and should not be
> parsed this way.

Painting of whitespace breakages is asymmetric [*1*].  If you change
something on a badly indented line without fixing the indentation,
e.g.

	-<SP><TAB>hello-world
        +<SP><TAB>hello-people

only the space-before-tab on the latter line is painted.

For the purpose of your diff highlighting, however, you would want
to treat the leading "<SP><TAB>hello-" on preimage and postimage
lines as unchanged.

Which means that you shouldn't be reading a colored output without
ignoring the color-control sequences.

So I think you arrived at the correct conclusion.

> All the more reason to try to move this inside diff.c, I guess.

That too, probably.

If we were to do this, I think it may make sense to separate the
logic to compute which span of the string need to be painted in what
color and the routine to actually emit the colored output.  That is,
instead of letting ws-check-emit to decide which part should be in
what color _and_ emitting the result, we should have a helper that
reads <line, len>, and give us an array of spans to flag as
whitespace violation.  Then an optional diff-highlight code would
scan the same <line, len> (or a collection of <line, len>) without
getting confused by the whitespace errors and annotate the spans to
be highlighted.  After all that happens, the output routine would
coalesce the spans and produce colored output.

Or something like that ;-)

[Footnote]

*1* We recently added a knob to allow us paint them on preimage and
common lines, too, but that is not the default.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-19  4:49                 ` Junio C Hamano
@ 2015-06-19  5:32                   ` Jeff King
  2015-06-19  7:34                     ` Jeff King
  2015-06-19 17:20                     ` Junio C Hamano
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff King @ 2015-06-19  5:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Palka, git

On Thu, Jun 18, 2015 at 09:49:19PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > ... I think you could also argue that
> > because of whitespace-highlighting, colorized diffs are fundamentally
> > going to have colors intermingled with the content and should not be
> > parsed this way.
> 
> Painting of whitespace breakages is asymmetric [*1*].  If you change
> something on a badly indented line without fixing the indentation,
> e.g.
> 
> 	-<SP><TAB>hello-world
>         +<SP><TAB>hello-people
> 
> only the space-before-tab on the latter line is painted.
> 
> For the purpose of your diff highlighting, however, you would want
> to treat the leading "<SP><TAB>hello-" on preimage and postimage
> lines as unchanged.

I do strip it off, so it is OK for it to be different in both the
pre-image and post-image. But what I can't tolerate is the
intermingling with actual data:

  +\t\t\x1b[32m;foo
  +\t\x1b[32m;bar

Those are both post-image lines. I can strip off the "+" from each side
to compare their inner parts to the pre-image. But the intermingled
color gets in my way. I can simply strip all colors from the whole line,
but then information is lost; how do I know where to put them back
again? It is not just "add back the color at the beginning" (which is
what I do with the prefix).

I think the answer is that I must strip them out, retaining the colors
found in each line along with their offset into the line, and then as I
write out the lines, re-add them at the appropriate spots (taking care
to use the original offsets, not the ones with the highlighting added
in).

> > All the more reason to try to move this inside diff.c, I guess.
> 
> That too, probably.

Hmm, I thought that would solve all my problems by operating on the
pre-color version without much more work. But...

> If we were to do this, I think it may make sense to separate the
> logic to compute which span of the string need to be painted in what
> color and the routine to actually emit the colored output.  That is,
> instead of letting ws-check-emit to decide which part should be in
> what color _and_ emitting the result, we should have a helper that
> reads <line, len>, and give us an array of spans to flag as
> whitespace violation.  Then an optional diff-highlight code would
> scan the same <line, len> (or a collection of <line, len>) without
> getting confused by the whitespace errors and annotate the spans to
> be highlighted.  After all that happens, the output routine would
> coalesce the spans and produce colored output.
> 
> Or something like that ;-)

I think this "array of spans" is the only way to go. Otherwise whichever
markup scheme processes the hunk first ruins the data for the next
processor.

So it is a lot more work to make the two work together. The --word-diff
code would have the same issue, except that I imagine it just skips
whitespace-highlighting altogether.

The least-work thing may actually be teaching the separate
diff-highlight script to strip out the colorizing and re-add it by
offset.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-19  5:32                   ` Jeff King
@ 2015-06-19  7:34                     ` Jeff King
  2015-06-19 11:38                       ` Jeff King
  2015-06-19 17:20                     ` Junio C Hamano
  1 sibling, 1 reply; 20+ messages in thread
From: Jeff King @ 2015-06-19  7:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Palka, git

On Fri, Jun 19, 2015 at 01:32:23AM -0400, Jeff King wrote:

> The least-work thing may actually be teaching the separate
> diff-highlight script to strip out the colorizing and re-add it by
> offset.

OK, here is that patch. It seems to work OK, and should preserve
existing colors produced by git anywhere within the input.

It should also serve as a stepping stone to using the same technique
inside git itself. I am not planning to do that anytime soon, though. I
took a look at the word-diff code earlier today, and my brain partially
melted. I'm sure it's possible, but I've already spent enough time and
brain cells on this for now.

I think this patch does need some more polishing, or more heuristics,
though. Some of the output is really fantastic. Here's a simple one from
69f9a6e54:

  -* link:v2.4.<2>/git.html[documentation for release 2.4.<2>]
  +* link:v2.4.<3>/git.html[documentation for release 2.4.<3>]

The angle brackets show what is highlighted, but that really doesn't do
it justice. I encourage people to apply the patch and

  git show --color $commit |
  contrib/diff-highlight/diff-highlight

to see for yourselves (I'm hesitant to assume that sending ANSI color
codes via email will end up readable for anyone).

What's interesting about that example (that didn't work before) is that
we can find multiple changes and highlight them independently (whereas
before, we would have highlighted everything between them, too).

Here's a better code example from 9cc2b07a7:

  -int parse_commit(struct commit *item)
  +int parse_commit<_gently>(struct commit *item<, int quiet_on_missing>)

It highlights the change in the function name and the matching change to
the parameter list. And from a few lines below in the same diff:

  -               return error("Could not read %s",
  +               return <quiet_on_missing ? -1 :>
  +                       error("Could not read %s",

This demonstrates it working on multiple lines, and in particular one
where the matching content actually crosses a line boundary.

So that's all the good news. Here's some bad news, from f0e7f11d054:

  -       return offset1 - offset2;
  +       return offset1 < offset2 ? -1 :
  +              offset1 > offset2 ?  1 :
  +              0;

I don't think adding angle brackets to this one would do it justice, but
you can look for yourself. What happens is that we match offset1, but
then the "-" from the pre-image matches the first character of "-1" from
the post-image. And all of "< offset2 ? " is highlighted, then not "-",
then "1 :", and so on. The result is rather confusing to read.

I think that can probably be fixed by smarter tokenizing of what we feed
to the per-hunk diff (i.e., "-1" should be a full token).

It also does a weird thing where the "offset1" from the second
post-image line is _not_ highlighted, and it should be (even though it's
ugly, it is the correct thing to do). I think what is happening is that
git's existing colorizing is getting in the way. It does not say "turn
on green; OK, now turn off green". It says "turn on green; OK, now
reset all colors". So at the beginning of each post-image line, we will
accidentally turn off any highlighting carried over from the previous
line. That's probably OK in general, because highlighting across lines
is a good indication that it's not helpful. But it is a little hacky.

And here's some more bad news. If you look at the diff for this
patch itself, it's terribly unreadable (the regular diff already is
pretty bad, but the highlights make it much worse). There are big chunks
where we take away 5 or 10 lines from the old code, and replace them
with totally unrelated lines. We end up highlighting almost the entire
thing, except for spaces and punctuation.

We might be able to solve this with a percentage heuristic similar to
the one Patrick proposed. It's not really interesting to highlight
unless we're doing it on probably 20% or less of the diff (where 20% is
a number I just made up).

I also have a feeling that we might get better results if we don't feed
the delimiters into the diff algorithm. I think the word-diff code
actually tokenizes "foo bar baz" into the list (foo, bar, baz) and diffs
only that. I think it then _loses_ the original whitespace, but here we
would need to record and restore it. We might be able to use the same
mechanism as for stripping and restoring the colors. I'm not sure.

Anyway, here's the patch. It's rather messy, so I'd recommend just
reading the resulting code; I wish there was a good way to convince the
diff engine to produce larger hunks.

---
 contrib/diff-highlight/diff-highlight | 202 +++++++++++++++-------------------
 1 file changed, 86 insertions(+), 116 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index ffefc31..1525ccc 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -3,6 +3,7 @@
 use 5.008;
 use warnings FATAL => 'all';
 use strict;
+use Algorithm::Diff;

 # Highlight by reversing foreground and background. You could do
 # other things like bold or underline if you prefer.
@@ -80,139 +81,108 @@ sub color_config {
 }

 sub show_hunk {
-	my ($a, $b) = @_;
+	my ($lines_a, $lines_b) = @_;

 	# If one side is empty, then there is nothing to compare or highlight.
-	if (!@$a || !@$b) {
-		print @$a, @$b;
+	if (!@$lines_a || !@$lines_b) {
+		print @$lines_a, @$lines_b;
 		return;
 	}

-	# If we have mismatched numbers of lines on each side, we could try to
-	# be clever and match up similar lines. But for now we are simple and
-	# stupid, and only handle multi-line hunks that remove and add the same
-	# number of lines.
-	if (@$a != @$b) {
-		print @$a, @$b;
-		return;
+	# Strip out any cruft so we can do the real diff on $a and $b.
+	my ($a, @stripped_a) = strip_image(@$lines_a);
+	my ($b, @stripped_b) = strip_image(@$lines_b);
+
+	# Now we do the actual diff. Our highlight list is in the same
+	# annotation format as the @stripped data.
+	my $diff = Algorithm::Diff->new([split_image($a)], [split_image($b)]);
+	my ($offset_a, $offset_b) = (0, 0);
+	my (@highlight_a, @highlight_b);
+	while ($diff->Next()) {
+		my $bits = $diff->Diff();
+
+		push @highlight_a, [$offset_a, $OLD_HIGHLIGHT[1]]
+			if $bits & 1;
+		$offset_a += length($_) for $diff->Items(1);
+		push @highlight_a, [$offset_a, $OLD_HIGHLIGHT[2]]
+			if $bits & 1;
+
+		push @highlight_b, [$offset_b, $NEW_HIGHLIGHT[1]]
+			if $bits & 2;
+		$offset_b += length($_) for $diff->Items(2);
+		push @highlight_b, [$offset_b, $NEW_HIGHLIGHT[2]]
+			if $bits & 2;
 	}

-	my @queue;
-	for (my $i = 0; $i < @$a; $i++) {
-		my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]);
-		print $rm;
-		push @queue, $add;
-	}
-	print @queue;
+	# And now show the output both with the original stripped annotations,
+	# as well as our new highlights.
+	show_image($a, [merge_annotations(\@stripped_a, \@highlight_a)]);
+	show_image($b, [merge_annotations(\@stripped_b, \@highlight_b)]);
 }

-sub highlight_pair {
-	my @a = split_line(shift);
-	my @b = split_line(shift);
-
-	# Find common prefix, taking care to skip any ansi
-	# color codes.
-	my $seen_plusminus;
-	my ($pa, $pb) = (0, 0);
-	while ($pa < @a && $pb < @b) {
-		if ($a[$pa] =~ /$COLOR/) {
-			$pa++;
-		}
-		elsif ($b[$pb] =~ /$COLOR/) {
-			$pb++;
-		}
-		elsif ($a[$pa] eq $b[$pb]) {
-			$pa++;
-			$pb++;
-		}
-		elsif (!$seen_plusminus && $a[$pa] eq '-' && $b[$pb] eq '+') {
-			$seen_plusminus = 1;
-			$pa++;
-			$pb++;
-		}
-		else {
-			last;
+# Strip out any diff syntax (i.e., leading +/-), along with any ANSI color
+# codes from the pre- or post-image of a hunk. The result is a string of text
+# suitable for diffing against the other side of the hunk.
+#
+# In addition to returning the hunk itself, we also return an arrayref that
+# contains the stripped data.  Each element is itself an arrayref containing
+# the offset into the stripped hunk, along with the stripped data that belongs
+# there.
+sub strip_image {
+	my $image = '';
+	my @stripped;
+	foreach my $line (@_) {
+		$line =~ s/^$COLOR*[+-]$COLOR*//
+			or die "BUG: line was not +/-: $line";
+		push @stripped, [length($image), $&];
+
+		while (length($line)) {
+			if ($line =~ s/^$COLOR+//) {
+				push @stripped, [length($image), $&];
+			} elsif ($line =~ s/^(.+?)($COLOR|$)/$2/s) {
+				$image .= $1;
+			} else {
+				die "BUG: we should have matched _something_";
+			}
 		}
 	}

-	# Find common suffix, ignoring colors.
-	my ($sa, $sb) = ($#a, $#b);
-	while ($sa >= $pa && $sb >= $pb) {
-		if ($a[$sa] =~ /$COLOR/) {
-			$sa--;
-		}
-		elsif ($b[$sb] =~ /$COLOR/) {
-			$sb--;
-		}
-		elsif ($a[$sa] eq $b[$sb]) {
-			$sa--;
-			$sb--;
-		}
-		else {
-			last;
-		}
-	}
-
-	if (is_pair_interesting(\@a, $pa, $sa, \@b, $pb, $sb)) {
-		return highlight_line(\@a, $pa, $sa, \@OLD_HIGHLIGHT),
-		       highlight_line(\@b, $pb, $sb, \@NEW_HIGHLIGHT);
-	}
-	else {
-		return join('', @a),
-		       join('', @b);
-	}
+	return $image, @stripped;
 }

-sub split_line {
-	local $_ = shift;
-	return utf8::decode($_) ?
-		map { utf8::encode($_); $_ }
-			map { /$COLOR/ ? $_ : (split //) }
-			split /($COLOR+)/ :
-		map { /$COLOR/ ? $_ : (split //) }
-		split /($COLOR+)/;
+# Split the pre- or post-image into diffable elements. Returns
+sub split_image {
+	return split(/([[:space:]]+|[[:punct:]]+)/, shift);
 }

-sub highlight_line {
-	my ($line, $prefix, $suffix, $theme) = @_;
-
-	my $start = join('', @{$line}[0..($prefix-1)]);
-	my $mid = join('', @{$line}[$prefix..$suffix]);
-	my $end = join('', @{$line}[($suffix+1)..$#$line]);
-
-	# If we have a "normal" color specified, then take over the whole line.
-	# Otherwise, we try to just manipulate the highlighted bits.
-	if (defined $theme->[0]) {
-		s/$COLOR//g for ($start, $mid, $end);
-		chomp $end;
-		return join('',
-			$theme->[0], $start, $RESET,
-			$theme->[1], $mid, $RESET,
-			$theme->[0], $end, $RESET,
-			"\n"
-		);
-	} else {
-		return join('',
-			$start,
-			$theme->[1], $mid, $theme->[2],
-			$end
-		);
+sub merge_annotations {
+	my ($a, $b) = @_;
+	my @r;
+	while (@$a && @$b) {
+		if ($a->[0]->[0] <= $b->[0]->[0]) {
+			push @r, shift @$a;
+		} else {
+			push @r, shift @$b;
+		}
 	}
+	push @r, @$a;
+	push @r, @$b;
+	return @r;
 }

-# Pairs are interesting to highlight only if we are going to end up
-# highlighting a subset (i.e., not the whole line). Otherwise, the highlighting
-# is just useless noise. We can detect this by finding either a matching prefix
-# or suffix (disregarding boring bits like whitespace and colorization).
-sub is_pair_interesting {
-	my ($a, $pa, $sa, $b, $pb, $sb) = @_;
-	my $prefix_a = join('', @$a[0..($pa-1)]);
-	my $prefix_b = join('', @$b[0..($pb-1)]);
-	my $suffix_a = join('', @$a[($sa+1)..$#$a]);
-	my $suffix_b = join('', @$b[($sb+1)..$#$b]);
-
-	return $prefix_a !~ /^$COLOR*-$BORING*$/ ||
-	       $prefix_b !~ /^$COLOR*\+$BORING*$/ ||
-	       $suffix_a !~ /^$BORING*$/ ||
-	       $suffix_b !~ /^$BORING*$/;
+sub show_image {
+	my ($image, $annotations) = @_;
+	my $pos = 0;
+
+	foreach my $an (@$annotations) {
+		if ($pos < $an->[0]) {
+			print substr($image, $pos, $an->[0] - $pos);
+			$pos = $an->[0];
+		}
+		print $an->[1];
+	}
+
+	if ($pos < length($image)) {
+		print substr($image, $pos);
+	}
 }
-- 
2.4.4.719.g3984bc6

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-19  7:34                     ` Jeff King
@ 2015-06-19 11:38                       ` Jeff King
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff King @ 2015-06-19 11:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Palka, git

On Fri, Jun 19, 2015 at 03:34:55AM -0400, Jeff King wrote:

> And here's some more bad news. If you look at the diff for this
> patch itself, it's terribly unreadable (the regular diff already is
> pretty bad, but the highlights make it much worse). There are big chunks
> where we take away 5 or 10 lines from the old code, and replace them
> with totally unrelated lines. We end up highlighting almost the entire
> thing, except for spaces and punctuation.
> 
> We might be able to solve this with a percentage heuristic similar to
> the one Patrick proposed. It's not really interesting to highlight
> unless we're doing it on probably 20% or less of the diff (where 20% is
> a number I just made up).

That turned out to be pretty easy; patch is below (on top of what I sent
earlier). I set the percentage at 50% based on eyeballing "git log -p"
in git.git, and it seems to give good results.

So I think the big remaining issue is improved tokenizing. Maybe Patrick
will want to take a stab at it.

---
diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index 1525ccc..9454446 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -114,12 +114,32 @@ sub show_hunk {
 			if $bits & 2;
 	}
 
+	my $highlighted = count_highlight(@highlight_a) +
+			  count_highlight(@highlight_b);
+	my $total = length($a) + length($b);
+	my $pct = $highlighted / $total;
+
+	if ($pct > 0.5) {
+		@highlight_a = ();
+		@highlight_b = ();
+	}
+
 	# And now show the output both with the original stripped annotations,
 	# as well as our new highlights.
 	show_image($a, [merge_annotations(\@stripped_a, \@highlight_a)]);
 	show_image($b, [merge_annotations(\@stripped_b, \@highlight_b)]);
 }
 
+sub count_highlight {
+	my $total = 0;
+	while (@_) {
+		my $from = shift;
+		my $to = shift;
+		$total += $to->[0] - $from->[0];
+	}
+	return $total;
+}
+
 # Strip out any diff syntax (i.e., leading +/-), along with any ANSI color
 # codes from the pre- or post-image of a hunk. The result is a string of text
 # suitable for diffing against the other side of the hunk.

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-19  5:32                   ` Jeff King
  2015-06-19  7:34                     ` Jeff King
@ 2015-06-19 17:20                     ` Junio C Hamano
  1 sibling, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2015-06-19 17:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Palka, git

Jeff King <peff@peff.net> writes:

> I do strip it off, so it is OK for it to be different in both the
> pre-image and post-image. But what I can't tolerate is the
> intermingling with actual data:
>
>   +\t\t\x1b[32m;foo
>   +\t\x1b[32m;bar

I think that depends on the definition of "strip it off" ;-)  What I
meant was that whitespace coloring can appear anywhere on the line,
e.g.

    echo COPYING whitespace=tab-in-indent,-space-before-tab >.gitattributes
    printf " \tHeh\n" >>COPYING
    git diff

will give you

    <new> + <reset> SP <wserror> HT <reset> <new> Heh <reset> LF

Obviously, stripping "+" painted in "new" is not sufficient.

	Side note: hmm, shouldn't that SP painted in <new>, though?

> I think this "array of spans" is the only way to go. Otherwise whichever
> markup scheme processes the hunk first ruins the data for the next
> processor.

Yes, I agree with that 100%.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 20:45           ` Jeff King
  2015-06-18 21:23             ` Jeff King
@ 2015-06-18 23:06             ` Patrick Palka
  1 sibling, 0 replies; 20+ messages in thread
From: Patrick Palka @ 2015-06-18 23:06 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Palka, Junio C Hamano, git

On Thu, 18 Jun 2015, Jeff King wrote:

> On Thu, Jun 18, 2015 at 04:14:19PM -0400, Patrick Palka wrote:
>
>>> in a test script becomes more clear. But some of the output is not so
>>> great. For instance, the very commit under discussion has a
>>> confusing and useless highlight. Or take a documentation patch like
>>> 5c31acfb, where I find the highlights actively distracting. We are saved
>>> a little by the "if the whole line is different, do not highlight at
>>> all" behavior of 097128d1bc.
>>
>> To fix the useless highlights for both evenly and unevenly sized hunks
>> (like when all but a semicolon on a line changes), one can loosen the
>> criterion for not highlighting from "do not highlight if 0% of the
>> before and after lines are common between them" to, say, "do not
>> highlight if less than 10% of the before and after lines are common
>> between them".  Then most of these useless highlights are gone for both
>> evenly and unevenly sized hunks.
>
> Yeah, this is an idea I had considered but never actually experimented
> with. It does make some things better, but it also makes some a little
> worse. For example, in 8dbf3eb, the hunk:
>
> -               const char *plain = diff_get_color(ecb->color_diff,
> -                                                  DIFF_PLAIN);
> +               const char *context = diff_get_color(ecb->color_diff,
> +                                                    DIFF_CONTEXT);
>
> currently gets the plain/context change in the first line highlighted,
> as well as the DIFF_PLAIN/DIFF_CONTEXT in the second line. With a 10%
> limit, the second line isn't highlighted. That's correct by the
> heuristic, but it's a bit harder to read, because the highlight draws
> your eye to the first change, and it is easy to miss the second.

Good example, this actually exposes a "bug" in the heuristic.  Each line
is around 15 characters long and the common affix ");" is 2 characters
long, which is about 15% of 15.  So the DIFF_PLAIN/DIFF_CONTEXT pair
ought to be highlighted.

The patch was unintentionally comparing the lengths of the common
affixes against the length of the entire line, whitespace and color
codes and all.  In effect this meant that the 10% threshold was much
higher.  We should compare against the length of the non-boring parts of
the whole line when determining the percentage in common.  Attached is a
revised patch.

>
> Still, I think this is probably a minority case, and it may be
> outweighed by the improvements. The "real" solution is to consider the
> hunk as a whole and do an LCS diff on it, which would show that yes,
> it's worth highlighting both of those spots, as they are a small
> percentage of the total hunk.
>
>> Here is a patch that changes the criterion as mentioned.  Testing this
>> change on the documentation patch 5c31acfb, only two pairs of lines are
>> highlighted instead of six.  On my original patch, the useless highlight
>> is gone.  The useless semicolon-related highlights on e.g. commit
>> 99a2cfb are gone.
>
> Nice, the ones like 99a2cfb are definitely wrong (I had though to fix
> them eventually by treating some punctuation as uninteresting, but I
> suspect the percentage heuristic covers that reasonably well in
> practice).
>
>> Of course, these patches are both hacks but they seem to be surprisingly
>> effective hacks especially when paired together.
>
> The whole script is a (surprisingly effective) hack. ;)
>
>>> So I dunno. IMHO this does more harm than good, and I would not want to
>>> use it myself. But it is somewhat a matter of taste; I am not opposed to
>>> making it a configurable option.
>>
>> That is something I can do :)
>
> Coupled with the 10%-threshold patch, I think it would be OK to include
> it unconditionally. So far we've just been diffing the two outputs and
> micro-analyzing them. The real test to me will be using it in practice
> and seeing if it's helpful or annoying.

-- >8 --

Subject: [PATCH] diff-highlight: don't highlight lines that have little in
  common

---
  contrib/diff-highlight/diff-highlight | 21 +++++++++++++++++----
  1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index 85d2eb0..0cc2b60 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -218,8 +218,21 @@ sub is_pair_interesting {
  	my $suffix_a = join('', @$a[($sa+1)..$#$a]);
  	my $suffix_b = join('', @$b[($sb+1)..$#$b]);

-	return $prefix_a !~ /^$COLOR*-$BORING*$/ ||
-	       $prefix_b !~ /^$COLOR*\+$BORING*$/ ||
-	       $suffix_a !~ /^$BORING*$/ ||
-	       $suffix_b !~ /^$BORING*$/;
+	$prefix_a =~ s/^$COLOR*-$BORING*//;
+	$prefix_b =~ s/^$COLOR*\+$BORING*//;
+	$suffix_a =~ s/$BORING*$//;
+	$suffix_b =~ s/$BORING*$//;
+
+	my $whole_a = join ('', @$a);
+	$whole_a =~ s/^$COLOR*-$BORING*//;
+	$whole_a =~ s/$BORING*$//;
+
+	my $whole_b = join ('', @$b);
+	$whole_b =~ s/^$COLOR*\+$BORING*//;
+	$whole_b =~ s/$BORING*$//;
+
+	# Only bother highlighting if at least 10% of each line is common among
+	# the lines.
+	return ((length($prefix_a)+length($suffix_a))*100 >= length($whole_a)*10) &&
+	       ((length($prefix_b)+length($suffix_b))*100 >= length($whole_b)*10);
  }
-- 
2.4.4.410.g43ed522.dirty

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 18:08     ` Junio C Hamano
  2015-06-18 19:04       ` Jeff King
@ 2015-06-18 20:23       ` Patrick Palka
  1 sibling, 0 replies; 20+ messages in thread
From: Patrick Palka @ 2015-06-18 20:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On Thu, Jun 18, 2015 at 2:08 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Patrick Palka <patrick@parcs.ath.cx> writes:
>
>>> I have this nagging feeling that it is just as likely that two
>>> uneven hunks align at the top as they align at the bottom, so while
>>> this might not hurt it may not be the right approach for a better
>>> solution, in the sense that when somebody really wants to do a
>>> better solution, this change and the original code may need to be
>>> ripped out and redone from scratch.
>>
>> Hmm, maybe. I stuck with assuming hunks are top-aligned because it
>> required less code to implement :)
>
> Yeah, I understand that.
>
> If we will need to rip out only this change but keep the original in
> order to implement a future better solution, then we might be better
> off not having this change (if we anticipate such a better solution
> to come reasonably soon), because it would make it more work for the
> final improved solution.  But if we need to rip out the original as
> well as this change while we do so, then having this patch would not
> make it more work, either.
>
> So as I said, I do not think it would hurt to have this as an
> incremental improvement (albeit going in a possibly wrong
> direction).
>
> Of course, it is a separate question if this change makes the output
> worse, by comparing unmatched early parts of two hunks and making
> nonsense highlight by calling highlight_pair() more often.  As long
> as that is not an issue, I am not opposed to this change, which was
> what I meant to say by "this might not hurt".
>

That makes sense.  The extra useless highlighting indeed is currently
a problem but it may yet be worked around.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 16:28   ` Patrick Palka
  2015-06-18 18:08     ` Junio C Hamano
@ 2015-06-18 19:08     ` Jeff King
  2015-06-18 20:27       ` Patrick Palka
  1 sibling, 1 reply; 20+ messages in thread
From: Jeff King @ 2015-06-18 19:08 UTC (permalink / raw)
  To: Patrick Palka; +Cc: Junio C Hamano, git

On Thu, Jun 18, 2015 at 12:28:58PM -0400, Patrick Palka wrote:

> By the way, what would it take to get something like this script into
> git proper?  It is IMHO immensely useful even in its current form, yet
> because it's not baked into the application hardly anybody knows about
> it.

I think if we were going to make it more official, it would make sense
to do it inside the diff code itself (i.e., not as a separate script),
and it might be reasonable at that point to actually do a "real"
character-based diff rather than the hacky prefix/suffix thing (or
possibly even integrate with the color-words patterns to find
syntactically interesting breaks). There is some discussion in the
"Bugs" section of contrib/diff-highlight/README.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
  2015-06-18 19:08     ` Jeff King
@ 2015-06-18 20:27       ` Patrick Palka
  0 siblings, 0 replies; 20+ messages in thread
From: Patrick Palka @ 2015-06-18 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On Thu, Jun 18, 2015 at 3:08 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Jun 18, 2015 at 12:28:58PM -0400, Patrick Palka wrote:
>
>> By the way, what would it take to get something like this script into
>> git proper?  It is IMHO immensely useful even in its current form, yet
>> because it's not baked into the application hardly anybody knows about
>> it.
>
> I think if we were going to make it more official, it would make sense
> to do it inside the diff code itself (i.e., not as a separate script),
> and it might be reasonable at that point to actually do a "real"
> character-based diff rather than the hacky prefix/suffix thing (or
> possibly even integrate with the color-words patterns to find
> syntactically interesting breaks). There is some discussion in the
> "Bugs" section of contrib/diff-highlight/README.
>
> -Peff

Thanks for the pointers.  This is something I am interested in
implementing (though not any time soon).  I was actually in the
process of familiarizing myself with the diff code before I discovered
the existence of diff-highlight by accident.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-06-19 17:20 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-15 17:20 [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks Patrick Palka
2015-06-18 15:50 ` Junio C Hamano
2015-06-18 16:28   ` Patrick Palka
2015-06-18 18:08     ` Junio C Hamano
2015-06-18 19:04       ` Jeff King
2015-06-18 20:14         ` Patrick Palka
2015-06-18 20:45           ` Jeff King
2015-06-18 21:23             ` Jeff King
2015-06-18 21:39               ` Junio C Hamano
2015-06-18 22:25               ` Patrick Palka
2015-06-19  3:54               ` Jeff King
2015-06-19  4:49                 ` Junio C Hamano
2015-06-19  5:32                   ` Jeff King
2015-06-19  7:34                     ` Jeff King
2015-06-19 11:38                       ` Jeff King
2015-06-19 17:20                     ` Junio C Hamano
2015-06-18 23:06             ` Patrick Palka
2015-06-18 20:23       ` Patrick Palka
2015-06-18 19:08     ` Jeff King
2015-06-18 20:27       ` Patrick Palka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.