* [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks @ 2015-06-15 17:20 Patrick Palka 2015-06-18 15:50 ` Junio C Hamano 0 siblings, 1 reply; 20+ messages in thread From: Patrick Palka @ 2015-06-15 17:20 UTC (permalink / raw) To: git; +Cc: Patrick Palka Currently the diff-highlight script does not try to highlight hunks that have different numbers of removed/added lines. But we can be a little smarter than that, without introducing much magic and complexity. In the case of unevenly-sized hunks, we could still highlight the first few (lexicographical) add/remove pairs. It is not uncommon for hunks to have common "prefixes", and in such a case this change is very useful for spotting differences. Signed-off-by: Patrick Palka <patrick@parcs.ath.cx> --- contrib/diff-highlight/diff-highlight | 26 +++++++++++++++++--------- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight index ffefc31..0dfbebd 100755 --- a/contrib/diff-highlight/diff-highlight +++ b/contrib/diff-highlight/diff-highlight @@ -88,22 +88,30 @@ sub show_hunk { return; } - # If we have mismatched numbers of lines on each side, we could try to - # be clever and match up similar lines. But for now we are simple and - # stupid, and only handle multi-line hunks that remove and add the same - # number of lines. - if (@$a != @$b) { - print @$a, @$b; - return; - } + # We match up the first MIN(a, b) lines on each side. + my $c = @$a < @$b ? @$a : @$b; + # Highlight each pair, and print each removed line of that pair. my @queue; - for (my $i = 0; $i < @$a; $i++) { + for (my $i = 0; $i < $c; $i++) { my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]); print $rm; push @queue, $add; } + + # Print the remaining unmatched removed lines of the hunk. + for (my $i = $c; $i < @$a; $i++) { + print $a->[$i]; + } + + # Print the added lines of each highlighted pair. print @queue; + + # Print the remaining unmatched added lines of the hunk. + for (my $i = $c; $i < @$b; $i++) { + print $b->[$i]; + } + } sub highlight_pair { -- 2.4.3.413.ga5fe668 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-15 17:20 [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks Patrick Palka @ 2015-06-18 15:50 ` Junio C Hamano 2015-06-18 16:28 ` Patrick Palka 0 siblings, 1 reply; 20+ messages in thread From: Junio C Hamano @ 2015-06-18 15:50 UTC (permalink / raw) To: Jeff King; +Cc: git, Patrick Palka Patrick Palka <patrick@parcs.ath.cx> writes: > Currently the diff-highlight script does not try to highlight hunks that > have different numbers of removed/added lines. But we can be a little > smarter than that, without introducing much magic and complexity. > > In the case of unevenly-sized hunks, we could still highlight the first > few (lexicographical) add/remove pairs. It is not uncommon for hunks to > have common "prefixes", and in such a case this change is very useful > for spotting differences. > > Signed-off-by: Patrick Palka <patrick@parcs.ath.cx> > --- Patrick, "git shortlog --no-merges contrib/diff-highlight/" is your friend to see who may be able to give you a good feedback. Jeff, what do you think? I have this nagging feeling that it is just as likely that two uneven hunks align at the top as they align at the bottom, so while this might not hurt it may not be the right approach for a better solution, in the sense that when somebody really wants to do a better solution, this change and the original code may need to be ripped out and redone from scratch. > contrib/diff-highlight/diff-highlight | 26 +++++++++++++++++--------- > 1 file changed, 17 insertions(+), 9 deletions(-) > > diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight > index ffefc31..0dfbebd 100755 > --- a/contrib/diff-highlight/diff-highlight > +++ b/contrib/diff-highlight/diff-highlight > @@ -88,22 +88,30 @@ sub show_hunk { > return; > } > > - # If we have mismatched numbers of lines on each side, we could try to > - # be clever and match up similar lines. But for now we are simple and > - # stupid, and only handle multi-line hunks that remove and add the same > - # number of lines. > - if (@$a != @$b) { > - print @$a, @$b; > - return; > - } > + # We match up the first MIN(a, b) lines on each side. > + my $c = @$a < @$b ? @$a : @$b; > > + # Highlight each pair, and print each removed line of that pair. > my @queue; > - for (my $i = 0; $i < @$a; $i++) { > + for (my $i = 0; $i < $c; $i++) { > my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]); > print $rm; > push @queue, $add; > } > + > + # Print the remaining unmatched removed lines of the hunk. > + for (my $i = $c; $i < @$a; $i++) { > + print $a->[$i]; > + } > + > + # Print the added lines of each highlighted pair. > print @queue; > + > + # Print the remaining unmatched added lines of the hunk. > + for (my $i = $c; $i < @$b; $i++) { > + print $b->[$i]; > + } > + > } > > sub highlight_pair { ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 15:50 ` Junio C Hamano @ 2015-06-18 16:28 ` Patrick Palka 2015-06-18 18:08 ` Junio C Hamano 2015-06-18 19:08 ` Jeff King 0 siblings, 2 replies; 20+ messages in thread From: Patrick Palka @ 2015-06-18 16:28 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, git On Thu, Jun 18, 2015 at 11:50 AM, Junio C Hamano <gitster@pobox.com> wrote: > Patrick Palka <patrick@parcs.ath.cx> writes: > >> Currently the diff-highlight script does not try to highlight hunks that >> have different numbers of removed/added lines. But we can be a little >> smarter than that, without introducing much magic and complexity. >> >> In the case of unevenly-sized hunks, we could still highlight the first >> few (lexicographical) add/remove pairs. It is not uncommon for hunks to >> have common "prefixes", and in such a case this change is very useful >> for spotting differences. >> >> Signed-off-by: Patrick Palka <patrick@parcs.ath.cx> >> --- > > Patrick, "git shortlog --no-merges contrib/diff-highlight/" is your > friend to see who may be able to give you a good feedback. Sorry about that. I admit the sending of this patch was rushed for no good reason. > > Jeff, what do you think? > > I have this nagging feeling that it is just as likely that two > uneven hunks align at the top as they align at the bottom, so while > this might not hurt it may not be the right approach for a better > solution, in the sense that when somebody really wants to do a > better solution, this change and the original code may need to be > ripped out and redone from scratch. Hmm, maybe. I stuck with assuming hunks are top-aligned because it required less code to implement :) The benefits of a simple dumb solution like assuming hunks align at the top or bottom is that it remains very easy to visually match up each highlighted deleted slice with its corresponding highlighted added slice. If we start matching up similar lines or something like that then it seems we would have to mostly forsake this benefit. A stupid algorithm in this case is nice because its output is predictable. A direct improvement upon this patch that would not require redoing the whole script from scratch would be to first to calculate the highlighting assuming the hunk aligns at the top, then to calculate the highlighting assuming the hunk aligns at the bottom, and to pick out of the two the highlighting with the least "noise". Though we would still be out of luck if the hunk is more complicated than being top-aligned or bottom-aligned. By the way, what would it take to get something like this script into git proper? It is IMHO immensely useful even in its current form, yet because it's not baked into the application hardly anybody knows about it. > >> contrib/diff-highlight/diff-highlight | 26 +++++++++++++++++--------- >> 1 file changed, 17 insertions(+), 9 deletions(-) >> >> diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight >> index ffefc31..0dfbebd 100755 >> --- a/contrib/diff-highlight/diff-highlight >> +++ b/contrib/diff-highlight/diff-highlight >> @@ -88,22 +88,30 @@ sub show_hunk { >> return; >> } >> >> - # If we have mismatched numbers of lines on each side, we could try to >> - # be clever and match up similar lines. But for now we are simple and >> - # stupid, and only handle multi-line hunks that remove and add the same >> - # number of lines. >> - if (@$a != @$b) { >> - print @$a, @$b; >> - return; >> - } >> + # We match up the first MIN(a, b) lines on each side. >> + my $c = @$a < @$b ? @$a : @$b; >> >> + # Highlight each pair, and print each removed line of that pair. >> my @queue; >> - for (my $i = 0; $i < @$a; $i++) { >> + for (my $i = 0; $i < $c; $i++) { >> my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]); >> print $rm; >> push @queue, $add; >> } >> + >> + # Print the remaining unmatched removed lines of the hunk. >> + for (my $i = $c; $i < @$a; $i++) { >> + print $a->[$i]; >> + } >> + >> + # Print the added lines of each highlighted pair. >> print @queue; >> + >> + # Print the remaining unmatched added lines of the hunk. >> + for (my $i = $c; $i < @$b; $i++) { >> + print $b->[$i]; >> + } >> + >> } >> >> sub highlight_pair { ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 16:28 ` Patrick Palka @ 2015-06-18 18:08 ` Junio C Hamano 2015-06-18 19:04 ` Jeff King 2015-06-18 20:23 ` Patrick Palka 2015-06-18 19:08 ` Jeff King 1 sibling, 2 replies; 20+ messages in thread From: Junio C Hamano @ 2015-06-18 18:08 UTC (permalink / raw) To: Patrick Palka; +Cc: Jeff King, git Patrick Palka <patrick@parcs.ath.cx> writes: >> I have this nagging feeling that it is just as likely that two >> uneven hunks align at the top as they align at the bottom, so while >> this might not hurt it may not be the right approach for a better >> solution, in the sense that when somebody really wants to do a >> better solution, this change and the original code may need to be >> ripped out and redone from scratch. > > Hmm, maybe. I stuck with assuming hunks are top-aligned because it > required less code to implement :) Yeah, I understand that. If we will need to rip out only this change but keep the original in order to implement a future better solution, then we might be better off not having this change (if we anticipate such a better solution to come reasonably soon), because it would make it more work for the final improved solution. But if we need to rip out the original as well as this change while we do so, then having this patch would not make it more work, either. So as I said, I do not think it would hurt to have this as an incremental improvement (albeit going in a possibly wrong direction). Of course, it is a separate question if this change makes the output worse, by comparing unmatched early parts of two hunks and making nonsense highlight by calling highlight_pair() more often. As long as that is not an issue, I am not opposed to this change, which was what I meant to say by "this might not hurt". ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 18:08 ` Junio C Hamano @ 2015-06-18 19:04 ` Jeff King 2015-06-18 20:14 ` Patrick Palka 2015-06-18 20:23 ` Patrick Palka 1 sibling, 1 reply; 20+ messages in thread From: Jeff King @ 2015-06-18 19:04 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Palka, git On Thu, Jun 18, 2015 at 11:08:16AM -0700, Junio C Hamano wrote: > So as I said, I do not think it would hurt to have this as an > incremental improvement (albeit going in a possibly wrong > direction). > > Of course, it is a separate question if this change makes the output > worse, by comparing unmatched early parts of two hunks and making > nonsense highlight by calling highlight_pair() more often. As long > as that is not an issue, I am not opposed to this change, which was > what I meant to say by "this might not hurt". Yes, that is my big concern, and why I punted on mismatched-size hunks in the first place. Now that we have a patch, it is easy enough to "git log -p | diff-highlight" with the old and new versions to compare the results. It certainly does improve some cases. E.g.: -foo +foo && +bar in a test script becomes more clear. But some of the output is not so great. For instance, the very commit under discussion has a confusing and useless highlight. Or take a documentation patch like 5c31acfb, where I find the highlights actively distracting. We are saved a little by the "if the whole line is different, do not highlight at all" behavior of 097128d1bc. So I dunno. IMHO this does more harm than good, and I would not want to use it myself. But it is somewhat a matter of taste; I am not opposed to making it a configurable option. -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 19:04 ` Jeff King @ 2015-06-18 20:14 ` Patrick Palka 2015-06-18 20:45 ` Jeff King 0 siblings, 1 reply; 20+ messages in thread From: Patrick Palka @ 2015-06-18 20:14 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Patrick Palka, git On Thu, 18 Jun 2015, Jeff King wrote: > On Thu, Jun 18, 2015 at 11:08:16AM -0700, Junio C Hamano wrote: > >> So as I said, I do not think it would hurt to have this as an >> incremental improvement (albeit going in a possibly wrong >> direction). >> >> Of course, it is a separate question if this change makes the output >> worse, by comparing unmatched early parts of two hunks and making >> nonsense highlight by calling highlight_pair() more often. As long >> as that is not an issue, I am not opposed to this change, which was >> what I meant to say by "this might not hurt". > > Yes, that is my big concern, and why I punted on mismatched-size hunks > in the first place. Now that we have a patch, it is easy enough to "git > log -p | diff-highlight" with the old and new versions to compare the > results. > > It certainly does improve some cases. E.g.: > > -foo > +foo && > +bar > > in a test script becomes more clear. But some of the output is not so > great. For instance, the very commit under discussion has a > confusing and useless highlight. Or take a documentation patch like > 5c31acfb, where I find the highlights actively distracting. We are saved > a little by the "if the whole line is different, do not highlight at > all" behavior of 097128d1bc. To fix the useless highlights for both evenly and unevenly sized hunks (like when all but a semicolon on a line changes), one can loosen the criterion for not highlighting from "do not highlight if 0% of the before and after lines are common between them" to, say, "do not highlight if less than 10% of the before and after lines are common between them". Then most of these useless highlights are gone for both evenly and unevenly sized hunks. Here is a patch that changes the criterion as mentioned. Testing this change on the documentation patch 5c31acfb, only two pairs of lines are highlighted instead of six. On my original patch, the useless highlight is gone. The useless semicolon-related highlights on e.g. commit 99a2cfb are gone. Ten percent is a modest threshold, and perhaps it should be increased when highlighting unevenly sized hunks and decreased when highlighting evenly sized hunks. Of course, these patches are both hacks but they seem to be surprisingly effective hacks especially when paired together. > > So I dunno. IMHO this does more harm than good, and I would not want to > use it myself. But it is somewhat a matter of taste; I am not opposed to > making it a configurable option. That is something I can do :) > > -Peff > -- >8 -- Subject: [PATCH] diff-highlight: don't highlight lines that have little in common --- contrib/diff-highlight/diff-highlight | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight index 85d2eb0..e4829ec 100755 --- a/contrib/diff-highlight/diff-highlight +++ b/contrib/diff-highlight/diff-highlight @@ -218,8 +218,13 @@ sub is_pair_interesting { my $suffix_a = join('', @$a[($sa+1)..$#$a]); my $suffix_b = join('', @$b[($sb+1)..$#$b]); - return $prefix_a !~ /^$COLOR*-$BORING*$/ || - $prefix_b !~ /^$COLOR*\+$BORING*$/ || - $suffix_a !~ /^$BORING*$/ || - $suffix_b !~ /^$BORING*$/; + $prefix_a =~ s/^$COLOR*-$BORING*//; + $prefix_b =~ s/^$COLOR*\+$BORING*//; + $suffix_a =~ s/$BORING*$//; + $suffix_b =~ s/$BORING*$//; + + # Only bother highlighting if at least 10% of each line is common among + # the lines. + return ((length($prefix_a)+length($suffix_a))*100 >= @$a*10) && + ((length($prefix_b)+length($suffix_b))*100 >= @$b*10); } -- 2.4.4.410.g43ed522.dirty ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 20:14 ` Patrick Palka @ 2015-06-18 20:45 ` Jeff King 2015-06-18 21:23 ` Jeff King 2015-06-18 23:06 ` Patrick Palka 0 siblings, 2 replies; 20+ messages in thread From: Jeff King @ 2015-06-18 20:45 UTC (permalink / raw) To: Patrick Palka; +Cc: Junio C Hamano, git On Thu, Jun 18, 2015 at 04:14:19PM -0400, Patrick Palka wrote: > >in a test script becomes more clear. But some of the output is not so > >great. For instance, the very commit under discussion has a > >confusing and useless highlight. Or take a documentation patch like > >5c31acfb, where I find the highlights actively distracting. We are saved > >a little by the "if the whole line is different, do not highlight at > >all" behavior of 097128d1bc. > > To fix the useless highlights for both evenly and unevenly sized hunks > (like when all but a semicolon on a line changes), one can loosen the > criterion for not highlighting from "do not highlight if 0% of the > before and after lines are common between them" to, say, "do not > highlight if less than 10% of the before and after lines are common > between them". Then most of these useless highlights are gone for both > evenly and unevenly sized hunks. Yeah, this is an idea I had considered but never actually experimented with. It does make some things better, but it also makes some a little worse. For example, in 8dbf3eb, the hunk: - const char *plain = diff_get_color(ecb->color_diff, - DIFF_PLAIN); + const char *context = diff_get_color(ecb->color_diff, + DIFF_CONTEXT); currently gets the plain/context change in the first line highlighted, as well as the DIFF_PLAIN/DIFF_CONTEXT in the second line. With a 10% limit, the second line isn't highlighted. That's correct by the heuristic, but it's a bit harder to read, because the highlight draws your eye to the first change, and it is easy to miss the second. Still, I think this is probably a minority case, and it may be outweighed by the improvements. The "real" solution is to consider the hunk as a whole and do an LCS diff on it, which would show that yes, it's worth highlighting both of those spots, as they are a small percentage of the total hunk. > Here is a patch that changes the criterion as mentioned. Testing this > change on the documentation patch 5c31acfb, only two pairs of lines are > highlighted instead of six. On my original patch, the useless highlight > is gone. The useless semicolon-related highlights on e.g. commit > 99a2cfb are gone. Nice, the ones like 99a2cfb are definitely wrong (I had though to fix them eventually by treating some punctuation as uninteresting, but I suspect the percentage heuristic covers that reasonably well in practice). > Of course, these patches are both hacks but they seem to be surprisingly > effective hacks especially when paired together. The whole script is a (surprisingly effective) hack. ;) > >So I dunno. IMHO this does more harm than good, and I would not want to > >use it myself. But it is somewhat a matter of taste; I am not opposed to > >making it a configurable option. > > That is something I can do :) Coupled with the 10%-threshold patch, I think it would be OK to include it unconditionally. So far we've just been diffing the two outputs and micro-analyzing them. The real test to me will be using it in practice and seeing if it's helpful or annoying. -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 20:45 ` Jeff King @ 2015-06-18 21:23 ` Jeff King 2015-06-18 21:39 ` Junio C Hamano ` (2 more replies) 2015-06-18 23:06 ` Patrick Palka 1 sibling, 3 replies; 20+ messages in thread From: Jeff King @ 2015-06-18 21:23 UTC (permalink / raw) To: Patrick Palka; +Cc: Junio C Hamano, git On Thu, Jun 18, 2015 at 04:45:05PM -0400, Jeff King wrote: > Still, I think this is probably a minority case, and it may be > outweighed by the improvements. The "real" solution is to consider the > hunk as a whole and do an LCS diff on it, which would show that yes, > it's worth highlighting both of those spots, as they are a small > percentage of the total hunk. I've been meaning to play with this for years, so I took the opportunity to spend a little time on it. :) Below is a (slightly hacky) patch I came up with. It seems to work, and produces really great output in some cases. For instance, in 99a2cfb, it produces (I put highlighted bits in angle brackets): - <hash>cpy(peeled, <sha1>); + <oid>cpy(<&>peeled, <oid>); It also produces nonsense like: - <un>s<ign>ed <char >peeled<[20]>; + s<truct obj>e<ct_i>d peeled; but I think that is simply because my splitting function is terrible (it splits each byte, whereas we'd probably want to use whitespace and punctuation, or something content-specific). --- diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight index ffefc31..7165518 100755 --- a/contrib/diff-highlight/diff-highlight +++ b/contrib/diff-highlight/diff-highlight @@ -3,6 +3,7 @@ use 5.008; use warnings FATAL => 'all'; use strict; +use Algorithm::Diff; # Highlight by reversing foreground and background. You could do # other things like bold or underline if you prefer. @@ -88,131 +89,54 @@ sub show_hunk { return; } - # If we have mismatched numbers of lines on each side, we could try to - # be clever and match up similar lines. But for now we are simple and - # stupid, and only handle multi-line hunks that remove and add the same - # number of lines. - if (@$a != @$b) { - print @$a, @$b; - return; - } - - my @queue; - for (my $i = 0; $i < @$a; $i++) { - my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]); - print $rm; - push @queue, $add; - } - print @queue; -} - -sub highlight_pair { - my @a = split_line(shift); - my @b = split_line(shift); + my ($prefix_a, $suffix_a, @hunk_a) = split_hunk(@$a); + my ($prefix_b, $suffix_b, @hunk_b) = split_hunk(@$b); - # Find common prefix, taking care to skip any ansi - # color codes. - my $seen_plusminus; - my ($pa, $pb) = (0, 0); - while ($pa < @a && $pb < @b) { - if ($a[$pa] =~ /$COLOR/) { - $pa++; - } - elsif ($b[$pb] =~ /$COLOR/) { - $pb++; - } - elsif ($a[$pa] eq $b[$pb]) { - $pa++; - $pb++; - } - elsif (!$seen_plusminus && $a[$pa] eq '-' && $b[$pb] eq '+') { - $seen_plusminus = 1; - $pa++; - $pb++; - } - else { - last; - } - } + my $diff = Algorithm::Diff->new(\@hunk_a, \@hunk_b); + my (@out_a, @out_b); + while ($diff->Next()) { + my $bits = $diff->Diff(); - # Find common suffix, ignoring colors. - my ($sa, $sb) = ($#a, $#b); - while ($sa >= $pa && $sb >= $pb) { - if ($a[$sa] =~ /$COLOR/) { - $sa--; - } - elsif ($b[$sb] =~ /$COLOR/) { - $sb--; - } - elsif ($a[$sa] eq $b[$sb]) { - $sa--; - $sb--; - } - else { - last; - } - } + push @out_a, $OLD_HIGHLIGHT[1] if $bits & 1; + push @out_a, $diff->Items(1); + push @out_a, $OLD_HIGHLIGHT[2] if $bits & 1; - if (is_pair_interesting(\@a, $pa, $sa, \@b, $pb, $sb)) { - return highlight_line(\@a, $pa, $sa, \@OLD_HIGHLIGHT), - highlight_line(\@b, $pb, $sb, \@NEW_HIGHLIGHT); - } - else { - return join('', @a), - join('', @b); + push @out_b, $NEW_HIGHLIGHT[1] if $bits & 2; + push @out_b, $diff->Items(2); + push @out_b, $NEW_HIGHLIGHT[2] if $bits & 2; } -} -sub split_line { - local $_ = shift; - return utf8::decode($_) ? - map { utf8::encode($_); $_ } - map { /$COLOR/ ? $_ : (split //) } - split /($COLOR+)/ : - map { /$COLOR/ ? $_ : (split //) } - split /($COLOR+)/; + output_split_hunk($prefix_a, $suffix_a, @out_a); + output_split_hunk($prefix_b, $suffix_b, @out_b); } -sub highlight_line { - my ($line, $prefix, $suffix, $theme) = @_; - - my $start = join('', @{$line}[0..($prefix-1)]); - my $mid = join('', @{$line}[$prefix..$suffix]); - my $end = join('', @{$line}[($suffix+1)..$#$line]); - - # If we have a "normal" color specified, then take over the whole line. - # Otherwise, we try to just manipulate the highlighted bits. - if (defined $theme->[0]) { - s/$COLOR//g for ($start, $mid, $end); - chomp $end; - return join('', - $theme->[0], $start, $RESET, - $theme->[1], $mid, $RESET, - $theme->[0], $end, $RESET, - "\n" - ); - } else { - return join('', - $start, - $theme->[1], $mid, $theme->[2], - $end - ); +# Return the individual diff-able items of the hunk, but with any +# diff or color prefix/suffix for each line split out (we assume that the +# prefix/suffix for each line will be the same). +sub split_hunk { + my ($prefix, $suffix, @r); + foreach my $line (@_) { + $line =~ /^($COLOR*[+-]$COLOR*)(.*)($COLOR*)/ + or die "eh, this is supposed to match everything!"; + + # overwrite the old values; we assume they're all the same + # anyway + $prefix = $1; + $suffix = $3; + + # do a straight character split. This almost certainly isn't + # ideal, but it's a good starting point. We should at the very + # least be utf8-aware, and probably use color-words regexes. + push @r, split(//, $2), "\n"; } + return ($prefix, $suffix, @r); } -# Pairs are interesting to highlight only if we are going to end up -# highlighting a subset (i.e., not the whole line). Otherwise, the highlighting -# is just useless noise. We can detect this by finding either a matching prefix -# or suffix (disregarding boring bits like whitespace and colorization). -sub is_pair_interesting { - my ($a, $pa, $sa, $b, $pb, $sb) = @_; - my $prefix_a = join('', @$a[0..($pa-1)]); - my $prefix_b = join('', @$b[0..($pb-1)]); - my $suffix_a = join('', @$a[($sa+1)..$#$a]); - my $suffix_b = join('', @$b[($sb+1)..$#$b]); - - return $prefix_a !~ /^$COLOR*-$BORING*$/ || - $prefix_b !~ /^$COLOR*\+$BORING*$/ || - $suffix_a !~ /^$BORING*$/ || - $suffix_b !~ /^$BORING*$/; +sub output_split_hunk { + my $prefix = shift; + my $suffix = shift; + my $str = join('', @_); + $str =~ s/^/$prefix/mg; + $str =~ s/$/$suffix/mg; + print $str; } ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 21:23 ` Jeff King @ 2015-06-18 21:39 ` Junio C Hamano 2015-06-18 22:25 ` Patrick Palka 2015-06-19 3:54 ` Jeff King 2 siblings, 0 replies; 20+ messages in thread From: Junio C Hamano @ 2015-06-18 21:39 UTC (permalink / raw) To: Jeff King; +Cc: Patrick Palka, git Jeff King <peff@peff.net> writes: > On Thu, Jun 18, 2015 at 04:45:05PM -0400, Jeff King wrote: > >> Still, I think this is probably a minority case, and it may be >> outweighed by the improvements. The "real" solution is to consider the >> hunk as a whole and do an LCS diff on it, which would show that yes, >> it's worth highlighting both of those spots, as they are a small >> percentage of the total hunk. > > I've been meaning to play with this for years, so I took the opportunity > to spend a little time on it. :) I envy you ;-) It certainly looks like a fun side project. > Below is a (slightly hacky) patch I came up with. It seems to work, and > produces really great output in some cases. For instance, in 99a2cfb, it > produces (I put highlighted bits in angle brackets): > > - <hash>cpy(peeled, <sha1>); > + <oid>cpy(<&>peeled, <oid>); > > It also produces nonsense like: > > - <un>s<ign>ed <char >peeled<[20]>; > + s<truct obj>e<ct_i>d peeled; ROTFL ;-) And the change does not look bad at all. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 21:23 ` Jeff King 2015-06-18 21:39 ` Junio C Hamano @ 2015-06-18 22:25 ` Patrick Palka 2015-06-19 3:54 ` Jeff King 2 siblings, 0 replies; 20+ messages in thread From: Patrick Palka @ 2015-06-18 22:25 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, git On Thu, Jun 18, 2015 at 5:23 PM, Jeff King <peff@peff.net> wrote: > On Thu, Jun 18, 2015 at 04:45:05PM -0400, Jeff King wrote: > >> Still, I think this is probably a minority case, and it may be >> outweighed by the improvements. The "real" solution is to consider the >> hunk as a whole and do an LCS diff on it, which would show that yes, >> it's worth highlighting both of those spots, as they are a small >> percentage of the total hunk. > > I've been meaning to play with this for years, so I took the opportunity > to spend a little time on it. :) Cool! > > Below is a (slightly hacky) patch I came up with. It seems to work, and > produces really great output in some cases. For instance, in 99a2cfb, it > produces (I put highlighted bits in angle brackets): > > - <hash>cpy(peeled, <sha1>); > + <oid>cpy(<&>peeled, <oid>); > > It also produces nonsense like: > > - <un>s<ign>ed <char >peeled<[20]>; > + s<truct obj>e<ct_i>d peeled; That's not even so bad. The diff of the change itself is... interesting. > > but I think that is simply because my splitting function is terrible (it > splits each byte, whereas we'd probably want to use whitespace and > punctuation, or something content-specific). I hope you can polish this. It definitely has potential. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 21:23 ` Jeff King 2015-06-18 21:39 ` Junio C Hamano 2015-06-18 22:25 ` Patrick Palka @ 2015-06-19 3:54 ` Jeff King 2015-06-19 4:49 ` Junio C Hamano 2 siblings, 1 reply; 20+ messages in thread From: Jeff King @ 2015-06-19 3:54 UTC (permalink / raw) To: Patrick Palka; +Cc: Junio C Hamano, git On Thu, Jun 18, 2015 at 05:23:56PM -0400, Jeff King wrote: > +# Return the individual diff-able items of the hunk, but with any > +# diff or color prefix/suffix for each line split out (we assume that the > +# prefix/suffix for each line will be the same). > +sub split_hunk { > + my ($prefix, $suffix, @r); > + foreach my $line (@_) { > + $line =~ /^($COLOR*[+-]$COLOR*)(.*)($COLOR*)/ > + or die "eh, this is supposed to match everything!"; This isn't quite right. We'll never match the suffix, because it gets sucked up by the greedy (.*). But making that non-greedy matches nothing, unless we also add "$" at the end. _But_, there is still something else weird going on. I replaced this split: > + push @r, split(//, $2), "\n"; with: push @r, split(/([[:space:][:punct:]])/, $2), "\n"; which behaves much better. With that, 99a2cfb shows: - <if> (!peel_ref(path, peeled)) { - <is>_annotated = !!<hashcmp>(<sha1>, peeled); + <if> (!peel_ref(path, peeled<.hash>)) { + <is>_annotated = !!<oidcmp>(<oid>, <&>peeled); The latter half of both lines looks perfect. But what's that weird highlighting of the initial "if" and "is" on those lines? It turns out that the colored output we produce is quite odd: $ git show --color 99a2cfb | grep peel_ref | cat -A ^[[31m-^Iif (!peel_ref(path, peeled)) {^[[m$ ^[[32m+^[[m^I^[[32mif (!peel_ref(path, peeled.hash)) {^[[m$ For the pre-image, we print the color, the "-", and then the line data. Makes sense. For the post-image, we print the color, "+", a reset, then the initial whitespace, then the color again! So of course the diff algorithm says "hey, there's this weird color in here". The original implementation of diff-highlight didn't care, because it skipped leading whitespace and colors as "boring". But this one cannot do that. It pulls the strict prefix out of each line (and it must, because it must get the same prefix for each line of a hunk). I think git is actually wrong here; it's mingling the ANSI colors and the actual content. Nobody ever noticed because it looks OK to a human, and who would be foolish enough to try machine-parsing colorized diff output? The fix is: diff --git a/diff.c b/diff.c index 87b16d5..a80b5b4 100644 --- a/diff.c +++ b/diff.c @@ -501,9 +501,9 @@ static void emit_line_checked(const char *reset, emit_line_0(ecbdata->opt, ws, reset, sign, line, len); else { /* Emit just the prefix, then the rest. */ - emit_line_0(ecbdata->opt, set, reset, sign, "", 0); + emit_line_0(ecbdata->opt, set, "", sign, "", 0); ws_check_emit(line, len, ecbdata->ws_rule, - ecbdata->opt->file, set, reset, ws); + ecbdata->opt->file, "", reset, ws); } } But I'm a little worried it may interfere with the way the whitespace-checker emits colors (e.g., it may emit its own colors for the leading spaces, and we would need to re-assert our color before showing the rest of the line). So I think you could also argue that because of whitespace-highlighting, colorized diffs are fundamentally going to have colors intermingled with the content and should not be parsed this way. All the more reason to try to move this inside diff.c, I guess. -Peff ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-19 3:54 ` Jeff King @ 2015-06-19 4:49 ` Junio C Hamano 2015-06-19 5:32 ` Jeff King 0 siblings, 1 reply; 20+ messages in thread From: Junio C Hamano @ 2015-06-19 4:49 UTC (permalink / raw) To: Jeff King; +Cc: Patrick Palka, git Jeff King <peff@peff.net> writes: > ... I think you could also argue that > because of whitespace-highlighting, colorized diffs are fundamentally > going to have colors intermingled with the content and should not be > parsed this way. Painting of whitespace breakages is asymmetric [*1*]. If you change something on a badly indented line without fixing the indentation, e.g. -<SP><TAB>hello-world +<SP><TAB>hello-people only the space-before-tab on the latter line is painted. For the purpose of your diff highlighting, however, you would want to treat the leading "<SP><TAB>hello-" on preimage and postimage lines as unchanged. Which means that you shouldn't be reading a colored output without ignoring the color-control sequences. So I think you arrived at the correct conclusion. > All the more reason to try to move this inside diff.c, I guess. That too, probably. If we were to do this, I think it may make sense to separate the logic to compute which span of the string need to be painted in what color and the routine to actually emit the colored output. That is, instead of letting ws-check-emit to decide which part should be in what color _and_ emitting the result, we should have a helper that reads <line, len>, and give us an array of spans to flag as whitespace violation. Then an optional diff-highlight code would scan the same <line, len> (or a collection of <line, len>) without getting confused by the whitespace errors and annotate the spans to be highlighted. After all that happens, the output routine would coalesce the spans and produce colored output. Or something like that ;-) [Footnote] *1* We recently added a knob to allow us paint them on preimage and common lines, too, but that is not the default. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-19 4:49 ` Junio C Hamano @ 2015-06-19 5:32 ` Jeff King 2015-06-19 7:34 ` Jeff King 2015-06-19 17:20 ` Junio C Hamano 0 siblings, 2 replies; 20+ messages in thread From: Jeff King @ 2015-06-19 5:32 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Palka, git On Thu, Jun 18, 2015 at 09:49:19PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > ... I think you could also argue that > > because of whitespace-highlighting, colorized diffs are fundamentally > > going to have colors intermingled with the content and should not be > > parsed this way. > > Painting of whitespace breakages is asymmetric [*1*]. If you change > something on a badly indented line without fixing the indentation, > e.g. > > -<SP><TAB>hello-world > +<SP><TAB>hello-people > > only the space-before-tab on the latter line is painted. > > For the purpose of your diff highlighting, however, you would want > to treat the leading "<SP><TAB>hello-" on preimage and postimage > lines as unchanged. I do strip it off, so it is OK for it to be different in both the pre-image and post-image. But what I can't tolerate is the intermingling with actual data: +\t\t\x1b[32m;foo +\t\x1b[32m;bar Those are both post-image lines. I can strip off the "+" from each side to compare their inner parts to the pre-image. But the intermingled color gets in my way. I can simply strip all colors from the whole line, but then information is lost; how do I know where to put them back again? It is not just "add back the color at the beginning" (which is what I do with the prefix). I think the answer is that I must strip them out, retaining the colors found in each line along with their offset into the line, and then as I write out the lines, re-add them at the appropriate spots (taking care to use the original offsets, not the ones with the highlighting added in). > > All the more reason to try to move this inside diff.c, I guess. > > That too, probably. Hmm, I thought that would solve all my problems by operating on the pre-color version without much more work. But... > If we were to do this, I think it may make sense to separate the > logic to compute which span of the string need to be painted in what > color and the routine to actually emit the colored output. That is, > instead of letting ws-check-emit to decide which part should be in > what color _and_ emitting the result, we should have a helper that > reads <line, len>, and give us an array of spans to flag as > whitespace violation. Then an optional diff-highlight code would > scan the same <line, len> (or a collection of <line, len>) without > getting confused by the whitespace errors and annotate the spans to > be highlighted. After all that happens, the output routine would > coalesce the spans and produce colored output. > > Or something like that ;-) I think this "array of spans" is the only way to go. Otherwise whichever markup scheme processes the hunk first ruins the data for the next processor. So it is a lot more work to make the two work together. The --word-diff code would have the same issue, except that I imagine it just skips whitespace-highlighting altogether. The least-work thing may actually be teaching the separate diff-highlight script to strip out the colorizing and re-add it by offset. -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-19 5:32 ` Jeff King @ 2015-06-19 7:34 ` Jeff King 2015-06-19 11:38 ` Jeff King 2015-06-19 17:20 ` Junio C Hamano 1 sibling, 1 reply; 20+ messages in thread From: Jeff King @ 2015-06-19 7:34 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Palka, git On Fri, Jun 19, 2015 at 01:32:23AM -0400, Jeff King wrote: > The least-work thing may actually be teaching the separate > diff-highlight script to strip out the colorizing and re-add it by > offset. OK, here is that patch. It seems to work OK, and should preserve existing colors produced by git anywhere within the input. It should also serve as a stepping stone to using the same technique inside git itself. I am not planning to do that anytime soon, though. I took a look at the word-diff code earlier today, and my brain partially melted. I'm sure it's possible, but I've already spent enough time and brain cells on this for now. I think this patch does need some more polishing, or more heuristics, though. Some of the output is really fantastic. Here's a simple one from 69f9a6e54: -* link:v2.4.<2>/git.html[documentation for release 2.4.<2>] +* link:v2.4.<3>/git.html[documentation for release 2.4.<3>] The angle brackets show what is highlighted, but that really doesn't do it justice. I encourage people to apply the patch and git show --color $commit | contrib/diff-highlight/diff-highlight to see for yourselves (I'm hesitant to assume that sending ANSI color codes via email will end up readable for anyone). What's interesting about that example (that didn't work before) is that we can find multiple changes and highlight them independently (whereas before, we would have highlighted everything between them, too). Here's a better code example from 9cc2b07a7: -int parse_commit(struct commit *item) +int parse_commit<_gently>(struct commit *item<, int quiet_on_missing>) It highlights the change in the function name and the matching change to the parameter list. And from a few lines below in the same diff: - return error("Could not read %s", + return <quiet_on_missing ? -1 :> + error("Could not read %s", This demonstrates it working on multiple lines, and in particular one where the matching content actually crosses a line boundary. So that's all the good news. Here's some bad news, from f0e7f11d054: - return offset1 - offset2; + return offset1 < offset2 ? -1 : + offset1 > offset2 ? 1 : + 0; I don't think adding angle brackets to this one would do it justice, but you can look for yourself. What happens is that we match offset1, but then the "-" from the pre-image matches the first character of "-1" from the post-image. And all of "< offset2 ? " is highlighted, then not "-", then "1 :", and so on. The result is rather confusing to read. I think that can probably be fixed by smarter tokenizing of what we feed to the per-hunk diff (i.e., "-1" should be a full token). It also does a weird thing where the "offset1" from the second post-image line is _not_ highlighted, and it should be (even though it's ugly, it is the correct thing to do). I think what is happening is that git's existing colorizing is getting in the way. It does not say "turn on green; OK, now turn off green". It says "turn on green; OK, now reset all colors". So at the beginning of each post-image line, we will accidentally turn off any highlighting carried over from the previous line. That's probably OK in general, because highlighting across lines is a good indication that it's not helpful. But it is a little hacky. And here's some more bad news. If you look at the diff for this patch itself, it's terribly unreadable (the regular diff already is pretty bad, but the highlights make it much worse). There are big chunks where we take away 5 or 10 lines from the old code, and replace them with totally unrelated lines. We end up highlighting almost the entire thing, except for spaces and punctuation. We might be able to solve this with a percentage heuristic similar to the one Patrick proposed. It's not really interesting to highlight unless we're doing it on probably 20% or less of the diff (where 20% is a number I just made up). I also have a feeling that we might get better results if we don't feed the delimiters into the diff algorithm. I think the word-diff code actually tokenizes "foo bar baz" into the list (foo, bar, baz) and diffs only that. I think it then _loses_ the original whitespace, but here we would need to record and restore it. We might be able to use the same mechanism as for stripping and restoring the colors. I'm not sure. Anyway, here's the patch. It's rather messy, so I'd recommend just reading the resulting code; I wish there was a good way to convince the diff engine to produce larger hunks. --- contrib/diff-highlight/diff-highlight | 202 +++++++++++++++------------------- 1 file changed, 86 insertions(+), 116 deletions(-) diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight index ffefc31..1525ccc 100755 --- a/contrib/diff-highlight/diff-highlight +++ b/contrib/diff-highlight/diff-highlight @@ -3,6 +3,7 @@ use 5.008; use warnings FATAL => 'all'; use strict; +use Algorithm::Diff; # Highlight by reversing foreground and background. You could do # other things like bold or underline if you prefer. @@ -80,139 +81,108 @@ sub color_config { } sub show_hunk { - my ($a, $b) = @_; + my ($lines_a, $lines_b) = @_; # If one side is empty, then there is nothing to compare or highlight. - if (!@$a || !@$b) { - print @$a, @$b; + if (!@$lines_a || !@$lines_b) { + print @$lines_a, @$lines_b; return; } - # If we have mismatched numbers of lines on each side, we could try to - # be clever and match up similar lines. But for now we are simple and - # stupid, and only handle multi-line hunks that remove and add the same - # number of lines. - if (@$a != @$b) { - print @$a, @$b; - return; + # Strip out any cruft so we can do the real diff on $a and $b. + my ($a, @stripped_a) = strip_image(@$lines_a); + my ($b, @stripped_b) = strip_image(@$lines_b); + + # Now we do the actual diff. Our highlight list is in the same + # annotation format as the @stripped data. + my $diff = Algorithm::Diff->new([split_image($a)], [split_image($b)]); + my ($offset_a, $offset_b) = (0, 0); + my (@highlight_a, @highlight_b); + while ($diff->Next()) { + my $bits = $diff->Diff(); + + push @highlight_a, [$offset_a, $OLD_HIGHLIGHT[1]] + if $bits & 1; + $offset_a += length($_) for $diff->Items(1); + push @highlight_a, [$offset_a, $OLD_HIGHLIGHT[2]] + if $bits & 1; + + push @highlight_b, [$offset_b, $NEW_HIGHLIGHT[1]] + if $bits & 2; + $offset_b += length($_) for $diff->Items(2); + push @highlight_b, [$offset_b, $NEW_HIGHLIGHT[2]] + if $bits & 2; } - my @queue; - for (my $i = 0; $i < @$a; $i++) { - my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]); - print $rm; - push @queue, $add; - } - print @queue; + # And now show the output both with the original stripped annotations, + # as well as our new highlights. + show_image($a, [merge_annotations(\@stripped_a, \@highlight_a)]); + show_image($b, [merge_annotations(\@stripped_b, \@highlight_b)]); } -sub highlight_pair { - my @a = split_line(shift); - my @b = split_line(shift); - - # Find common prefix, taking care to skip any ansi - # color codes. - my $seen_plusminus; - my ($pa, $pb) = (0, 0); - while ($pa < @a && $pb < @b) { - if ($a[$pa] =~ /$COLOR/) { - $pa++; - } - elsif ($b[$pb] =~ /$COLOR/) { - $pb++; - } - elsif ($a[$pa] eq $b[$pb]) { - $pa++; - $pb++; - } - elsif (!$seen_plusminus && $a[$pa] eq '-' && $b[$pb] eq '+') { - $seen_plusminus = 1; - $pa++; - $pb++; - } - else { - last; +# Strip out any diff syntax (i.e., leading +/-), along with any ANSI color +# codes from the pre- or post-image of a hunk. The result is a string of text +# suitable for diffing against the other side of the hunk. +# +# In addition to returning the hunk itself, we also return an arrayref that +# contains the stripped data. Each element is itself an arrayref containing +# the offset into the stripped hunk, along with the stripped data that belongs +# there. +sub strip_image { + my $image = ''; + my @stripped; + foreach my $line (@_) { + $line =~ s/^$COLOR*[+-]$COLOR*// + or die "BUG: line was not +/-: $line"; + push @stripped, [length($image), $&]; + + while (length($line)) { + if ($line =~ s/^$COLOR+//) { + push @stripped, [length($image), $&]; + } elsif ($line =~ s/^(.+?)($COLOR|$)/$2/s) { + $image .= $1; + } else { + die "BUG: we should have matched _something_"; + } } } - # Find common suffix, ignoring colors. - my ($sa, $sb) = ($#a, $#b); - while ($sa >= $pa && $sb >= $pb) { - if ($a[$sa] =~ /$COLOR/) { - $sa--; - } - elsif ($b[$sb] =~ /$COLOR/) { - $sb--; - } - elsif ($a[$sa] eq $b[$sb]) { - $sa--; - $sb--; - } - else { - last; - } - } - - if (is_pair_interesting(\@a, $pa, $sa, \@b, $pb, $sb)) { - return highlight_line(\@a, $pa, $sa, \@OLD_HIGHLIGHT), - highlight_line(\@b, $pb, $sb, \@NEW_HIGHLIGHT); - } - else { - return join('', @a), - join('', @b); - } + return $image, @stripped; } -sub split_line { - local $_ = shift; - return utf8::decode($_) ? - map { utf8::encode($_); $_ } - map { /$COLOR/ ? $_ : (split //) } - split /($COLOR+)/ : - map { /$COLOR/ ? $_ : (split //) } - split /($COLOR+)/; +# Split the pre- or post-image into diffable elements. Returns +sub split_image { + return split(/([[:space:]]+|[[:punct:]]+)/, shift); } -sub highlight_line { - my ($line, $prefix, $suffix, $theme) = @_; - - my $start = join('', @{$line}[0..($prefix-1)]); - my $mid = join('', @{$line}[$prefix..$suffix]); - my $end = join('', @{$line}[($suffix+1)..$#$line]); - - # If we have a "normal" color specified, then take over the whole line. - # Otherwise, we try to just manipulate the highlighted bits. - if (defined $theme->[0]) { - s/$COLOR//g for ($start, $mid, $end); - chomp $end; - return join('', - $theme->[0], $start, $RESET, - $theme->[1], $mid, $RESET, - $theme->[0], $end, $RESET, - "\n" - ); - } else { - return join('', - $start, - $theme->[1], $mid, $theme->[2], - $end - ); +sub merge_annotations { + my ($a, $b) = @_; + my @r; + while (@$a && @$b) { + if ($a->[0]->[0] <= $b->[0]->[0]) { + push @r, shift @$a; + } else { + push @r, shift @$b; + } } + push @r, @$a; + push @r, @$b; + return @r; } -# Pairs are interesting to highlight only if we are going to end up -# highlighting a subset (i.e., not the whole line). Otherwise, the highlighting -# is just useless noise. We can detect this by finding either a matching prefix -# or suffix (disregarding boring bits like whitespace and colorization). -sub is_pair_interesting { - my ($a, $pa, $sa, $b, $pb, $sb) = @_; - my $prefix_a = join('', @$a[0..($pa-1)]); - my $prefix_b = join('', @$b[0..($pb-1)]); - my $suffix_a = join('', @$a[($sa+1)..$#$a]); - my $suffix_b = join('', @$b[($sb+1)..$#$b]); - - return $prefix_a !~ /^$COLOR*-$BORING*$/ || - $prefix_b !~ /^$COLOR*\+$BORING*$/ || - $suffix_a !~ /^$BORING*$/ || - $suffix_b !~ /^$BORING*$/; +sub show_image { + my ($image, $annotations) = @_; + my $pos = 0; + + foreach my $an (@$annotations) { + if ($pos < $an->[0]) { + print substr($image, $pos, $an->[0] - $pos); + $pos = $an->[0]; + } + print $an->[1]; + } + + if ($pos < length($image)) { + print substr($image, $pos); + } } -- 2.4.4.719.g3984bc6 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-19 7:34 ` Jeff King @ 2015-06-19 11:38 ` Jeff King 0 siblings, 0 replies; 20+ messages in thread From: Jeff King @ 2015-06-19 11:38 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Palka, git On Fri, Jun 19, 2015 at 03:34:55AM -0400, Jeff King wrote: > And here's some more bad news. If you look at the diff for this > patch itself, it's terribly unreadable (the regular diff already is > pretty bad, but the highlights make it much worse). There are big chunks > where we take away 5 or 10 lines from the old code, and replace them > with totally unrelated lines. We end up highlighting almost the entire > thing, except for spaces and punctuation. > > We might be able to solve this with a percentage heuristic similar to > the one Patrick proposed. It's not really interesting to highlight > unless we're doing it on probably 20% or less of the diff (where 20% is > a number I just made up). That turned out to be pretty easy; patch is below (on top of what I sent earlier). I set the percentage at 50% based on eyeballing "git log -p" in git.git, and it seems to give good results. So I think the big remaining issue is improved tokenizing. Maybe Patrick will want to take a stab at it. --- diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight index 1525ccc..9454446 100755 --- a/contrib/diff-highlight/diff-highlight +++ b/contrib/diff-highlight/diff-highlight @@ -114,12 +114,32 @@ sub show_hunk { if $bits & 2; } + my $highlighted = count_highlight(@highlight_a) + + count_highlight(@highlight_b); + my $total = length($a) + length($b); + my $pct = $highlighted / $total; + + if ($pct > 0.5) { + @highlight_a = (); + @highlight_b = (); + } + # And now show the output both with the original stripped annotations, # as well as our new highlights. show_image($a, [merge_annotations(\@stripped_a, \@highlight_a)]); show_image($b, [merge_annotations(\@stripped_b, \@highlight_b)]); } +sub count_highlight { + my $total = 0; + while (@_) { + my $from = shift; + my $to = shift; + $total += $to->[0] - $from->[0]; + } + return $total; +} + # Strip out any diff syntax (i.e., leading +/-), along with any ANSI color # codes from the pre- or post-image of a hunk. The result is a string of text # suitable for diffing against the other side of the hunk. ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-19 5:32 ` Jeff King 2015-06-19 7:34 ` Jeff King @ 2015-06-19 17:20 ` Junio C Hamano 1 sibling, 0 replies; 20+ messages in thread From: Junio C Hamano @ 2015-06-19 17:20 UTC (permalink / raw) To: Jeff King; +Cc: Patrick Palka, git Jeff King <peff@peff.net> writes: > I do strip it off, so it is OK for it to be different in both the > pre-image and post-image. But what I can't tolerate is the > intermingling with actual data: > > +\t\t\x1b[32m;foo > +\t\x1b[32m;bar I think that depends on the definition of "strip it off" ;-) What I meant was that whitespace coloring can appear anywhere on the line, e.g. echo COPYING whitespace=tab-in-indent,-space-before-tab >.gitattributes printf " \tHeh\n" >>COPYING git diff will give you <new> + <reset> SP <wserror> HT <reset> <new> Heh <reset> LF Obviously, stripping "+" painted in "new" is not sufficient. Side note: hmm, shouldn't that SP painted in <new>, though? > I think this "array of spans" is the only way to go. Otherwise whichever > markup scheme processes the hunk first ruins the data for the next > processor. Yes, I agree with that 100%. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 20:45 ` Jeff King 2015-06-18 21:23 ` Jeff King @ 2015-06-18 23:06 ` Patrick Palka 1 sibling, 0 replies; 20+ messages in thread From: Patrick Palka @ 2015-06-18 23:06 UTC (permalink / raw) To: Jeff King; +Cc: Patrick Palka, Junio C Hamano, git On Thu, 18 Jun 2015, Jeff King wrote: > On Thu, Jun 18, 2015 at 04:14:19PM -0400, Patrick Palka wrote: > >>> in a test script becomes more clear. But some of the output is not so >>> great. For instance, the very commit under discussion has a >>> confusing and useless highlight. Or take a documentation patch like >>> 5c31acfb, where I find the highlights actively distracting. We are saved >>> a little by the "if the whole line is different, do not highlight at >>> all" behavior of 097128d1bc. >> >> To fix the useless highlights for both evenly and unevenly sized hunks >> (like when all but a semicolon on a line changes), one can loosen the >> criterion for not highlighting from "do not highlight if 0% of the >> before and after lines are common between them" to, say, "do not >> highlight if less than 10% of the before and after lines are common >> between them". Then most of these useless highlights are gone for both >> evenly and unevenly sized hunks. > > Yeah, this is an idea I had considered but never actually experimented > with. It does make some things better, but it also makes some a little > worse. For example, in 8dbf3eb, the hunk: > > - const char *plain = diff_get_color(ecb->color_diff, > - DIFF_PLAIN); > + const char *context = diff_get_color(ecb->color_diff, > + DIFF_CONTEXT); > > currently gets the plain/context change in the first line highlighted, > as well as the DIFF_PLAIN/DIFF_CONTEXT in the second line. With a 10% > limit, the second line isn't highlighted. That's correct by the > heuristic, but it's a bit harder to read, because the highlight draws > your eye to the first change, and it is easy to miss the second. Good example, this actually exposes a "bug" in the heuristic. Each line is around 15 characters long and the common affix ");" is 2 characters long, which is about 15% of 15. So the DIFF_PLAIN/DIFF_CONTEXT pair ought to be highlighted. The patch was unintentionally comparing the lengths of the common affixes against the length of the entire line, whitespace and color codes and all. In effect this meant that the 10% threshold was much higher. We should compare against the length of the non-boring parts of the whole line when determining the percentage in common. Attached is a revised patch. > > Still, I think this is probably a minority case, and it may be > outweighed by the improvements. The "real" solution is to consider the > hunk as a whole and do an LCS diff on it, which would show that yes, > it's worth highlighting both of those spots, as they are a small > percentage of the total hunk. > >> Here is a patch that changes the criterion as mentioned. Testing this >> change on the documentation patch 5c31acfb, only two pairs of lines are >> highlighted instead of six. On my original patch, the useless highlight >> is gone. The useless semicolon-related highlights on e.g. commit >> 99a2cfb are gone. > > Nice, the ones like 99a2cfb are definitely wrong (I had though to fix > them eventually by treating some punctuation as uninteresting, but I > suspect the percentage heuristic covers that reasonably well in > practice). > >> Of course, these patches are both hacks but they seem to be surprisingly >> effective hacks especially when paired together. > > The whole script is a (surprisingly effective) hack. ;) > >>> So I dunno. IMHO this does more harm than good, and I would not want to >>> use it myself. But it is somewhat a matter of taste; I am not opposed to >>> making it a configurable option. >> >> That is something I can do :) > > Coupled with the 10%-threshold patch, I think it would be OK to include > it unconditionally. So far we've just been diffing the two outputs and > micro-analyzing them. The real test to me will be using it in practice > and seeing if it's helpful or annoying. -- >8 -- Subject: [PATCH] diff-highlight: don't highlight lines that have little in common --- contrib/diff-highlight/diff-highlight | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight index 85d2eb0..0cc2b60 100755 --- a/contrib/diff-highlight/diff-highlight +++ b/contrib/diff-highlight/diff-highlight @@ -218,8 +218,21 @@ sub is_pair_interesting { my $suffix_a = join('', @$a[($sa+1)..$#$a]); my $suffix_b = join('', @$b[($sb+1)..$#$b]); - return $prefix_a !~ /^$COLOR*-$BORING*$/ || - $prefix_b !~ /^$COLOR*\+$BORING*$/ || - $suffix_a !~ /^$BORING*$/ || - $suffix_b !~ /^$BORING*$/; + $prefix_a =~ s/^$COLOR*-$BORING*//; + $prefix_b =~ s/^$COLOR*\+$BORING*//; + $suffix_a =~ s/$BORING*$//; + $suffix_b =~ s/$BORING*$//; + + my $whole_a = join ('', @$a); + $whole_a =~ s/^$COLOR*-$BORING*//; + $whole_a =~ s/$BORING*$//; + + my $whole_b = join ('', @$b); + $whole_b =~ s/^$COLOR*\+$BORING*//; + $whole_b =~ s/$BORING*$//; + + # Only bother highlighting if at least 10% of each line is common among + # the lines. + return ((length($prefix_a)+length($suffix_a))*100 >= length($whole_a)*10) && + ((length($prefix_b)+length($suffix_b))*100 >= length($whole_b)*10); } -- 2.4.4.410.g43ed522.dirty ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 18:08 ` Junio C Hamano 2015-06-18 19:04 ` Jeff King @ 2015-06-18 20:23 ` Patrick Palka 1 sibling, 0 replies; 20+ messages in thread From: Patrick Palka @ 2015-06-18 20:23 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, git On Thu, Jun 18, 2015 at 2:08 PM, Junio C Hamano <gitster@pobox.com> wrote: > Patrick Palka <patrick@parcs.ath.cx> writes: > >>> I have this nagging feeling that it is just as likely that two >>> uneven hunks align at the top as they align at the bottom, so while >>> this might not hurt it may not be the right approach for a better >>> solution, in the sense that when somebody really wants to do a >>> better solution, this change and the original code may need to be >>> ripped out and redone from scratch. >> >> Hmm, maybe. I stuck with assuming hunks are top-aligned because it >> required less code to implement :) > > Yeah, I understand that. > > If we will need to rip out only this change but keep the original in > order to implement a future better solution, then we might be better > off not having this change (if we anticipate such a better solution > to come reasonably soon), because it would make it more work for the > final improved solution. But if we need to rip out the original as > well as this change while we do so, then having this patch would not > make it more work, either. > > So as I said, I do not think it would hurt to have this as an > incremental improvement (albeit going in a possibly wrong > direction). > > Of course, it is a separate question if this change makes the output > worse, by comparing unmatched early parts of two hunks and making > nonsense highlight by calling highlight_pair() more often. As long > as that is not an issue, I am not opposed to this change, which was > what I meant to say by "this might not hurt". > That makes sense. The extra useless highlighting indeed is currently a problem but it may yet be worked around. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 16:28 ` Patrick Palka 2015-06-18 18:08 ` Junio C Hamano @ 2015-06-18 19:08 ` Jeff King 2015-06-18 20:27 ` Patrick Palka 1 sibling, 1 reply; 20+ messages in thread From: Jeff King @ 2015-06-18 19:08 UTC (permalink / raw) To: Patrick Palka; +Cc: Junio C Hamano, git On Thu, Jun 18, 2015 at 12:28:58PM -0400, Patrick Palka wrote: > By the way, what would it take to get something like this script into > git proper? It is IMHO immensely useful even in its current form, yet > because it's not baked into the application hardly anybody knows about > it. I think if we were going to make it more official, it would make sense to do it inside the diff code itself (i.e., not as a separate script), and it might be reasonable at that point to actually do a "real" character-based diff rather than the hacky prefix/suffix thing (or possibly even integrate with the color-words patterns to find syntactically interesting breaks). There is some discussion in the "Bugs" section of contrib/diff-highlight/README. -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks 2015-06-18 19:08 ` Jeff King @ 2015-06-18 20:27 ` Patrick Palka 0 siblings, 0 replies; 20+ messages in thread From: Patrick Palka @ 2015-06-18 20:27 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, git On Thu, Jun 18, 2015 at 3:08 PM, Jeff King <peff@peff.net> wrote: > On Thu, Jun 18, 2015 at 12:28:58PM -0400, Patrick Palka wrote: > >> By the way, what would it take to get something like this script into >> git proper? It is IMHO immensely useful even in its current form, yet >> because it's not baked into the application hardly anybody knows about >> it. > > I think if we were going to make it more official, it would make sense > to do it inside the diff code itself (i.e., not as a separate script), > and it might be reasonable at that point to actually do a "real" > character-based diff rather than the hacky prefix/suffix thing (or > possibly even integrate with the color-words patterns to find > syntactically interesting breaks). There is some discussion in the > "Bugs" section of contrib/diff-highlight/README. > > -Peff Thanks for the pointers. This is something I am interested in implementing (though not any time soon). I was actually in the process of familiarizing myself with the diff code before I discovered the existence of diff-highlight by accident. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2015-06-19 17:20 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-06-15 17:20 [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks Patrick Palka 2015-06-18 15:50 ` Junio C Hamano 2015-06-18 16:28 ` Patrick Palka 2015-06-18 18:08 ` Junio C Hamano 2015-06-18 19:04 ` Jeff King 2015-06-18 20:14 ` Patrick Palka 2015-06-18 20:45 ` Jeff King 2015-06-18 21:23 ` Jeff King 2015-06-18 21:39 ` Junio C Hamano 2015-06-18 22:25 ` Patrick Palka 2015-06-19 3:54 ` Jeff King 2015-06-19 4:49 ` Junio C Hamano 2015-06-19 5:32 ` Jeff King 2015-06-19 7:34 ` Jeff King 2015-06-19 11:38 ` Jeff King 2015-06-19 17:20 ` Junio C Hamano 2015-06-18 23:06 ` Patrick Palka 2015-06-18 20:23 ` Patrick Palka 2015-06-18 19:08 ` Jeff King 2015-06-18 20:27 ` Patrick Palka
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.