All of lore.kernel.org
 help / color / mirror / Atom feed
* weird diff output?
@ 2016-03-29  0:26 Jacob Keller
  2016-03-29 17:37 ` Stefan Beller
  0 siblings, 1 reply; 27+ messages in thread
From: Jacob Keller @ 2016-03-29  0:26 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Git mailing list, Junio C Hamano, Jens Lehmann

On Mon, Mar 28, 2016 at 4:28 PM, Stefan Beller <sbeller@google.com> wrote:
>  cat > expect <<EOF
> +Entering '../nested1'
> +Entering '../nested1/nested2'
> +Entering '../nested1/nested2/nested3'
> +Entering '../nested1/nested2/nested3/submodule'
> +Entering '../sub1'
> +Entering '../sub2'
> +Entering '../sub3'
> +EOF
> +
> +test_expect_failure 'test messages from "foreach --recursive" from subdirectory' '
> +       (
> +               cd clone2 &&
> +               mkdir untracked &&
> +               cd untracked &&
> +               git submodule foreach --recursive >../../actual
> +       ) &&
> +       test_i18ncmp expect actual
> +'
> +
> +cat > expect <<EOF
>  nested1-nested1
>  nested2-nested2
>  nested3-nested3

Complete tangent here. The diff above looks like

<old-line>
+
+
+
+
+<old-line>

is it possible to get diff output that would look more like

+<old-line>
+
+
+
+
+
<old-line>

instead? This is one of those huge readability issues with diff
formatting that seems like both are completely correct, but the second
way is much easier in general to read what was added.

I don't understand why diff algorithms result in the former instead of
the latter, and am curious if anyone knows whether this has ever been
thought about or solved by someone.

I've tried using various diffing algorithms (histogram, etc) and they
always produce the same result above, and never what I would prefer.

Regards,
Jake

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-29  0:26 weird diff output? Jacob Keller
@ 2016-03-29 17:37 ` Stefan Beller
  2016-03-29 17:54   ` Junio C Hamano
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Beller @ 2016-03-29 17:37 UTC (permalink / raw)
  To: Jacob Keller; +Cc: Git mailing list, Junio C Hamano, Jens Lehmann

On Mon, Mar 28, 2016 at 5:26 PM, Jacob Keller <jacob.keller@gmail.com> wrote:
> On Mon, Mar 28, 2016 at 4:28 PM, Stefan Beller <sbeller@google.com> wrote:
>>  cat > expect <<EOF
>> +Entering '../nested1'
>> +Entering '../nested1/nested2'
>> +Entering '../nested1/nested2/nested3'
>> +Entering '../nested1/nested2/nested3/submodule'
>> +Entering '../sub1'
>> +Entering '../sub2'
>> +Entering '../sub3'
>> +EOF
>> +
>> +test_expect_failure 'test messages from "foreach --recursive" from subdirectory' '
>> +       (
>> +               cd clone2 &&
>> +               mkdir untracked &&
>> +               cd untracked &&
>> +               git submodule foreach --recursive >../../actual
>> +       ) &&
>> +       test_i18ncmp expect actual
>> +'
>> +
>> +cat > expect <<EOF
>>  nested1-nested1
>>  nested2-nested2
>>  nested3-nested3
>
> Complete tangent here. The diff above looks like
>
> <old-line>
> +
> +
> +
> +
> +<old-line>
>
> is it possible to get diff output that would look more like
>
> +<old-line>
> +
> +
> +
> +
> +
> <old-line>
>
> instead? This is one of those huge readability issues with diff
> formatting that seems like both are completely correct, but the second
> way is much easier in general to read what was added.
>
> I don't understand why diff algorithms result in the former instead of
> the latter, and am curious if anyone knows whether this has ever been
> thought about or solved by someone.

I thought this is an optimization for C code where you have a diff like:

    int existingStuff1(..) {
    ...
    }
    +
    + int foo(..) {
    +...
    +}

    int existingStuff2(...) {
    ...

Note that the closing '}' could be taken from the method existingStuff1 instead
of correctly closing foo. So the correct heuristic really depends on
what kind of text
we are diffing.

Maybe we need the opposite of the 'patience' algorithm in format-patch?

Another heuristic would be to check for empty lines and use that as a
strong hint,
whether to use the first or last line. (Rule: Try to use that last or
first line such that
the lines at the edges of the diff are empty lines, it needs to be
formalized a bit more)

>
> I've tried using various diffing algorithms (histogram, etc) and they
> always produce the same result above, and never what I would prefer.
>
> Regards,
> Jake

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-29 17:37 ` Stefan Beller
@ 2016-03-29 17:54   ` Junio C Hamano
  2016-03-29 18:16     ` Stefan Beller
  0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2016-03-29 17:54 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Jacob Keller, Git mailing list, Jens Lehmann

Stefan Beller <sbeller@google.com> writes:

> I thought this is an optimization for C code where you have a diff like:
>
>     int existingStuff1(..) {
>     ...
>     }
>     +
>     + int foo(..) {
>     +...
>     +}
>
>     int existingStuff2(...) {
>     ...
>
> Note that the closing '}' could be taken from the method existingStuff1 instead
> of correctly closing foo.

That is a less optimal output.  Another possible output would be
like so:

      int existingStuff1(..) {
      ...
      }
     
     + int foo(..) {
     +...
     +}
     +
      int existingStuff2(...) {

All three are valid output, and ...

> So the correct heuristic really depends on what kind of text we
> are diffing.

... this realization is correct.

I have a feeling that any heuristic would be correct half of the
time, including the ehuristic implemented in the current code.  The
readers of patches have inherent bias.  They do not notice when the
hunk is formed to match their expectation, but they will notice and
remember when they see something less optimal.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-29 17:54   ` Junio C Hamano
@ 2016-03-29 18:16     ` Stefan Beller
  2016-03-29 23:05       ` Jacob Keller
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Beller @ 2016-03-29 18:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jacob Keller, Git mailing list, Jens Lehmann

On Tue, Mar 29, 2016 at 10:54 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Stefan Beller <sbeller@google.com> writes:
>
>> I thought this is an optimization for C code where you have a diff like:
>>
>>     int existingStuff1(..) {
>>     ...
>>     }
>>     +
>>     + int foo(..) {
>>     +...
>>     +}
>>
>>     int existingStuff2(...) {
>>     ...
>>
>> Note that the closing '}' could be taken from the method existingStuff1 instead
>> of correctly closing foo.
>
> That is a less optimal output.  Another possible output would be
> like so:
>
>       int existingStuff1(..) {
>       ...
>       }
>
>      + int foo(..) {
>      +...
>      +}
>      +
>       int existingStuff2(...) {
>
> All three are valid output, and ...
>
>> So the correct heuristic really depends on what kind of text we
>> are diffing.
>
> ... this realization is correct.
>
> I have a feeling that any heuristic would be correct half of the
> time, including the ehuristic implemented in the current code.  The
> readers of patches have inherent bias.  They do not notice when the
> hunk is formed to match their expectation, but they will notice and
> remember when they see something less optimal.
>

We have 3 possible diffs:
1) closing brace and newline before the chunk
2) newline before, closing brace after the chunk
3) closing brace and newline after the chunk

For C code we may want to conclude that 3) is best. (appeals the bias of
most people) 2 is slightly worse, whereas 1) is absolutely worst.

Now looking at the code Jacob found strange:

>  cat > expect <<EOF
> + expected results ...
> + EOF
> +test_expect_failure  ... '
> + ...
> + '
> +
> +cat > expect <<EOF

This can be written in two ways:

1) "cat > expect <<EOF" before the diff chunk
2) "cat > expect <<EOF" after the diff chunk

We claim 1) is better than 2).
This is different from the C code as now we want to have the
same lines before not after.

To find a heuristic, which appeals both the C code
and the shell code, we could take the empty line
as a strong hint for the divider:

1) determine the amount of diff which is ambiguous, i.e. can
   go before or after the chunk.
2) Does the ambiguous part contain an empty line?
3) If not, I have no offer for you, stop.
4) divide the ambiguous chunk by the empty line,
5) put the lines *after* the empty line in front of the chunk
6) put the part before (including) the empty line after the
   chunk
7) Observe output:

>       }
>
>      + int foo(..) {
>      +...
>      +}
>      +
>       int existingStuff2(...) {

> test_expect_failure ... '
> existing test ...
> '
>
> + cat > expect <<EOF
> + expected results ...
> + EOF
> +test_expect_failure  ... '
> + ...
> + '
> +
> cat > expect <<EOF

This is what we want in both cases.
And I would argue it would appease many other kinds of text as well, because
an empty line is usually a strong indicator for any text that a
different thing comes along.
(Other programming languages, such as Java, C++ and any other C like
language behaves
that way; even when writing latex figures you'd rather want to break
at new lines?)

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-29 18:16     ` Stefan Beller
@ 2016-03-29 23:05       ` Jacob Keller
  2016-03-30  0:04         ` Junio C Hamano
  2016-03-30  4:55         ` Jeff King
  0 siblings, 2 replies; 27+ messages in thread
From: Jacob Keller @ 2016-03-29 23:05 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Junio C Hamano, Git mailing list, Jens Lehmann

On Tue, Mar 29, 2016 at 11:16 AM, Stefan Beller <sbeller@google.com> wrote:
> On Tue, Mar 29, 2016 at 10:54 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> Stefan Beller <sbeller@google.com> writes:
>>
>>> I thought this is an optimization for C code where you have a diff like:
>>>
>>>     int existingStuff1(..) {
>>>     ...
>>>     }
>>>     +
>>>     + int foo(..) {
>>>     +...
>>>     +}
>>>
>>>     int existingStuff2(...) {
>>>     ...
>>>
>>> Note that the closing '}' could be taken from the method existingStuff1 instead
>>> of correctly closing foo.
>>
>> That is a less optimal output.  Another possible output would be
>> like so:
>>
>>       int existingStuff1(..) {
>>       ...
>>       }
>>
>>      + int foo(..) {
>>      +...
>>      +}
>>      +
>>       int existingStuff2(...) {
>>
>> All three are valid output, and ...
>>
>>> So the correct heuristic really depends on what kind of text we
>>> are diffing.
>>
>> ... this realization is correct.
>>
>> I have a feeling that any heuristic would be correct half of the
>> time, including the ehuristic implemented in the current code.  The
>> readers of patches have inherent bias.  They do not notice when the
>> hunk is formed to match their expectation, but they will notice and
>> remember when they see something less optimal.
>>
>
> We have 3 possible diffs:
> 1) closing brace and newline before the chunk
> 2) newline before, closing brace after the chunk
> 3) closing brace and newline after the chunk
>
> For C code we may want to conclude that 3) is best. (appeals the bias of
> most people) 2 is slightly worse, whereas 1) is absolutely worst.
>
> Now looking at the code Jacob found strange:
>
>>  cat > expect <<EOF
>> + expected results ...
>> + EOF
>> +test_expect_failure  ... '
>> + ...
>> + '
>> +
>> +cat > expect <<EOF
>
> This can be written in two ways:
>
> 1) "cat > expect <<EOF" before the diff chunk
> 2) "cat > expect <<EOF" after the diff chunk
>
> We claim 1) is better than 2).
> This is different from the C code as now we want to have the
> same lines before not after.
>
> To find a heuristic, which appeals both the C code
> and the shell code, we could take the empty line
> as a strong hint for the divider:
>
> 1) determine the amount of diff which is ambiguous, i.e. can
>    go before or after the chunk.
> 2) Does the ambiguous part contain an empty line?
> 3) If not, I have no offer for you, stop.
> 4) divide the ambiguous chunk by the empty line,
> 5) put the lines *after* the empty line in front of the chunk
> 6) put the part before (including) the empty line after the
>    chunk
> 7) Observe output:
>
>>       }
>>
>>      + int foo(..) {
>>      +...
>>      +}
>>      +
>>       int existingStuff2(...) {
>
>> test_expect_failure ... '
>> existing test ...
>> '
>>
>> + cat > expect <<EOF
>> + expected results ...
>> + EOF
>> +test_expect_failure  ... '
>> + ...
>> + '
>> +
>> cat > expect <<EOF
>
> This is what we want in both cases.
> And I would argue it would appease many other kinds of text as well, because
> an empty line is usually a strong indicator for any text that a
> different thing comes along.
> (Other programming languages, such as Java, C++ and any other C like
> language behaves
> that way; even when writing latex figures you'd rather want to break
> at new lines?)
>
> Thanks,
> Stefan

This seems like a good heuristic. Can we think of any examples where
it would produce wildly confusing diffs? I don't think it necessarily
needs to be default but just a possible option when formatting diffs,
much like we already have today.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-29 23:05       ` Jacob Keller
@ 2016-03-30  0:04         ` Junio C Hamano
  2016-03-30  4:55         ` Jeff King
  1 sibling, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2016-03-30  0:04 UTC (permalink / raw)
  To: Jacob Keller; +Cc: Stefan Beller, Git mailing list, Jens Lehmann

Jacob Keller <jacob.keller@gmail.com> writes:

> On Tue, Mar 29, 2016 at 11:16 AM, Stefan Beller <sbeller@google.com> wrote:
>> ...
>> To find a heuristic, which appeals both the C code
>> and the shell code, we could take the empty line
>> as a strong hint for the divider:
>
> This seems like a good heuristic. Can we think of any examples where
> it would produce wildly confusing diffs? I don't think it necessarily
> needs to be default but just a possible option when formatting diffs,
> much like we already have today.

I earlier said "50% of the time it is correct, you just do not
remember", but such an option with configuration variable would let
somebody interested set it permanently for his daily use of Git, and
it would help him to find out (1) if he sees a "Huh?" division less
(or more) often than he used to, and (2) if it gives a better
division for the same change to view the diff with the plain-vanilla
heuristic.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-29 23:05       ` Jacob Keller
  2016-03-30  0:04         ` Junio C Hamano
@ 2016-03-30  4:55         ` Jeff King
  2016-03-30  6:05           ` Stefan Beller
  2016-03-30  6:05           ` Jacob Keller
  1 sibling, 2 replies; 27+ messages in thread
From: Jeff King @ 2016-03-30  4:55 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Stefan Beller, Junio C Hamano, Git mailing list, Jens Lehmann

On Tue, Mar 29, 2016 at 04:05:57PM -0700, Jacob Keller wrote:

> > This is what we want in both cases.
> > And I would argue it would appease many other kinds of text as well, because
> > an empty line is usually a strong indicator for any text that a
> > different thing comes along.
> > (Other programming languages, such as Java, C++ and any other C like
> > language behaves
> > that way; even when writing latex figures you'd rather want to break
> > at new lines?)
> >
> > Thanks,
> > Stefan
> 
> This seems like a good heuristic. Can we think of any examples where
> it would produce wildly confusing diffs? I don't think it necessarily
> needs to be default but just a possible option when formatting diffs,
> much like we already have today.

One thing I like to do when playing with new diff ideas is to pipe all
of "log -p" for a real project through it and see what differences it
produces.

Below is a perl script that implements Stefan's heuristic. I checked its
output on git.git with:

  git log --format='commit %H' -p >old
  perl /path/to/script <old >new
  diff -F ^commit -u old new | less

which shows the differences, with the commit id in the hunk header
(which makes it easy to "git show $commit | perl /path/to/script" to
see the new diff with more context.

In addition to the cases discussed, it seems to improve C comments by
turning:

   /*
  + * new function
  + */
  +void foo(void);
  +
  +/*
    * old function
    ...

into:

  +/*
  + * my function
  + */
  +void foo(void);
  +
   /*
    * old function
    ...

See 47fe3f6e for an example.

It also seems to do OK with shell scripts. Commit e6bb5f78 is an example
where it improves a here-doc, as in the motivating example from this
thread. Similarly, the headers in 4df1e79 are much improved (though I'm
confused why the final one in that diff doesn't seem to have been
caught).

I also ran into an interesting case in 86d26f24, where we have:

  + test_expect_success '
  +   foo
  +
  +'
  +

and there are _two_ blank lines to choose from. It looks really terrible
if you use the first one, but the second one looks good (and the script
below chooses the second, as it's closest to the hunk boundary). There
may be cases where that's bad, though.

This is just a proof of concept. I guess we'd want to somehow integrate
the heuristic into git.

-- >8 --
#!/usr/bin/perl

use strict;
use warnings 'all';

use constant {
  STATE_NONE => 0,
  STATE_LEADING_CONTEXT => 1,
  STATE_IN_CHUNK => 2,
};
my $state = STATE_NONE;
my @hunk;
while(<>) {
  if ($state == STATE_NONE) {
    print;
    if (/^@/) {
      $state = STATE_LEADING_CONTEXT;
    }
  } else {
    if (/^ /) {
      flush_hunk() if $state != STATE_LEADING_CONTEXT;
      push @hunk, $_;
    } elsif(/^[-+]/) {
      push @hunk, $_;
      $state = STATE_IN_CHUNK;
    } else {
      flush_hunk();
      $state = STATE_NONE;
      print;
    }
  }
}
flush_hunk();

sub flush_hunk {
  my $context_len = 0;
  while ($context_len < @hunk && $hunk[$context_len] =~ /^ /) {
    $context_len++;
  }

  # Find the length of the ambiguous portion.
  # Assumes our hunks have context first, and ambiguous additions at the end,
  # which is how git generates them
  my $ambig_len = 0;
  while ($ambig_len < $context_len) {
    my $i = $context_len - $ambig_len - 1;
    my $j = @hunk - $ambig_len - 1;
    if ($hunk[$j] =~ /^\+/ && substr($hunk[$i], 1) eq substr($hunk[$j], 1)) {
      $ambig_len++;
    } else {
      last;
    }
  }

  # Now look for an empty line in the ambiguous portion (we can just look in
  # the context side, as it is equivalent to the addition side at the end).
  # We count down, though, as we prefer to use the line closest to the
  # hunk as the cutoff.
  my $empty;
  for (my $i = $context_len - 1; $i >= $context_len - $ambig_len; $i--) {
    if (length($hunk[$i]) == 2) {
      $empty = $i;
      last;
    }
  }

  if (defined $empty) {
    # move empty lines after the chunk to be part of it
    for (my $i = $empty + 1; $i < $context_len; $i++) {
      $hunk[$i] =~ s/^ /+/;
      $hunk[@hunk - $context_len + $i] =~ s/^\+/ /;
    }
  }

  print @hunk;
  @hunk = ();
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-30  4:55         ` Jeff King
@ 2016-03-30  6:05           ` Stefan Beller
  2016-03-30  6:05           ` Jacob Keller
  1 sibling, 0 replies; 27+ messages in thread
From: Stefan Beller @ 2016-03-30  6:05 UTC (permalink / raw)
  To: Jeff King; +Cc: Jacob Keller, Junio C Hamano, Git mailing list, Jens Lehmann

On Tue, Mar 29, 2016 at 9:55 PM, Jeff King <peff@peff.net> wrote:
> On Tue, Mar 29, 2016 at 04:05:57PM -0700, Jacob Keller wrote:
>
>> > This is what we want in both cases.
>> > And I would argue it would appease many other kinds of text as well, because
>> > an empty line is usually a strong indicator for any text that a
>> > different thing comes along.
>> > (Other programming languages, such as Java, C++ and any other C like
>> > language behaves
>> > that way; even when writing latex figures you'd rather want to break
>> > at new lines?)
>> >
>> > Thanks,
>> > Stefan
>>
>> This seems like a good heuristic. Can we think of any examples where
>> it would produce wildly confusing diffs? I don't think it necessarily
>> needs to be default but just a possible option when formatting diffs,
>> much like we already have today.
>
> One thing I like to do when playing with new diff ideas is to pipe all
> of "log -p" for a real project through it and see what differences it
> produces.
>
> Below is a perl script that implements Stefan's heuristic. I checked its
> output on git.git with:
>
>   git log --format='commit %H' -p >old
>   perl /path/to/script <old >new
>   diff -F ^commit -u old new | less

Wow, that's amazing! I'll toy around with it tomorrow. :)

>
> which shows the differences, with the commit id in the hunk header
> (which makes it easy to "git show $commit | perl /path/to/script" to
> see the new diff with more context.
>
> In addition to the cases discussed, it seems to improve C comments by
> turning:
>
>    /*
>   + * new function
>   + */
>   +void foo(void);
>   +
>   +/*
>     * old function
>     ...
>
> into:
>
>   +/*
>   + * my function
>   + */
>   +void foo(void);
>   +
>    /*
>     * old function
>     ...
>
> See 47fe3f6e for an example.
>
> It also seems to do OK with shell scripts. Commit e6bb5f78 is an example
> where it improves a here-doc, as in the motivating example from this
> thread. Similarly, the headers in 4df1e79 are much improved (though I'm
> confused why the final one in that diff doesn't seem to have been
> caught).
>
> I also ran into an interesting case in 86d26f24, where we have:
>
>   + test_expect_success '
>   +   foo
>   +
>   +'
>   +

That's an interesting case :)
I was trying to generalize my thoughts on it. (How is an empty line special?)

Instead of empty line we could go with the line with the least amount
of characters
in the lines which can be shifted up or down instead as well. Why so?

  The more characters are in a line, the more interesting the line is.
  (the more information is in there). Assuming one patch carries information
  that is highly relevant in itself, but may not be relevant to the surrounding
  (think adding a new function to a C file. The surrounding functions are not
  interesting for the diff, but rather you want to have all "relevant"
information
  bundled into that one diff reasonably.

  Going by the rule of splitting at the shortest line instead of just
at empty lines,
  is a generalization of

So instead of looking at

>   + test_expect_success '
>   +   foo
>   +
>   +'
>   +

we rather want to look at the string lengths of each line:

>   21
>   5
>   0
>   1
>   0

and then take the minimum (so instead of only acting on the 'last 0'
as shown by Jeff,
we'd go to the minimum of those numbers.)

Now on tie breaking (i.e two empty lines):

We need to understand the "pattern" of whether the lonely 1 char line
belongs above
or below the chunk. I do not think we can do that just from the diff alone.

* We either need to check the file ("Does the file start with the 0 1 0 pattern
or does it end with that?" That would be a strong hint on whether to
put the 1 line
above or below the chunk.) However a typical file has noise at the top
and bottom,
so this heuristic is not often applicable. With noise I mean license headers or
a java class or namespace ending with another brace or such. So probably
this second order heuristic on which of the empty lines to pick for
breaking needs more thoughts.

* Go through the history of the file and check for occurrences (how
was such a pattern
added in the past? Ideally we want to find the first time such a
pattern is added and
then decide based on that whether to break at the first or second empty line)

I guess both ways are expensive. Probably too expensive.

So for now we can just go with "take first or last empty line (shortest line) of
overlapping lines" and inspect that further.






>
> and there are _two_ blank lines to choose from. It looks really terrible
> if you use the first one, but the second one looks good (and the script
> below chooses the second, as it's closest to the hunk boundary). There
> may be cases where that's bad, though.
>
> This is just a proof of concept. I guess we'd want to somehow integrate
> the heuristic into git.
>
> -- >8 --
> #!/usr/bin/perl
>
> use strict;
> use warnings 'all';
>
> use constant {
>   STATE_NONE => 0,
>   STATE_LEADING_CONTEXT => 1,
>   STATE_IN_CHUNK => 2,
> };
> my $state = STATE_NONE;
> my @hunk;
> while(<>) {
>   if ($state == STATE_NONE) {
>     print;
>     if (/^@/) {
>       $state = STATE_LEADING_CONTEXT;
>     }
>   } else {
>     if (/^ /) {
>       flush_hunk() if $state != STATE_LEADING_CONTEXT;
>       push @hunk, $_;
>     } elsif(/^[-+]/) {
>       push @hunk, $_;
>       $state = STATE_IN_CHUNK;
>     } else {
>       flush_hunk();
>       $state = STATE_NONE;
>       print;
>     }
>   }
> }
> flush_hunk();
>
> sub flush_hunk {
>   my $context_len = 0;
>   while ($context_len < @hunk && $hunk[$context_len] =~ /^ /) {
>     $context_len++;
>   }
>
>   # Find the length of the ambiguous portion.
>   # Assumes our hunks have context first, and ambiguous additions at the end,
>   # which is how git generates them
>   my $ambig_len = 0;
>   while ($ambig_len < $context_len) {
>     my $i = $context_len - $ambig_len - 1;
>     my $j = @hunk - $ambig_len - 1;
>     if ($hunk[$j] =~ /^\+/ && substr($hunk[$i], 1) eq substr($hunk[$j], 1)) {
>       $ambig_len++;
>     } else {
>       last;
>     }
>   }
>
>   # Now look for an empty line in the ambiguous portion (we can just look in
>   # the context side, as it is equivalent to the addition side at the end).
>   # We count down, though, as we prefer to use the line closest to the
>   # hunk as the cutoff.
>   my $empty;
>   for (my $i = $context_len - 1; $i >= $context_len - $ambig_len; $i--) {
>     if (length($hunk[$i]) == 2) {
>       $empty = $i;
>       last;
>     }
>   }
>
>   if (defined $empty) {
>     # move empty lines after the chunk to be part of it
>     for (my $i = $empty + 1; $i < $context_len; $i++) {
>       $hunk[$i] =~ s/^ /+/;
>       $hunk[@hunk - $context_len + $i] =~ s/^\+/ /;
>     }
>   }
>
>   print @hunk;
>   @hunk = ();
> }

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-30  4:55         ` Jeff King
  2016-03-30  6:05           ` Stefan Beller
@ 2016-03-30  6:05           ` Jacob Keller
  2016-03-30 19:14             ` Jacob Keller
  1 sibling, 1 reply; 27+ messages in thread
From: Jacob Keller @ 2016-03-30  6:05 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, Junio C Hamano, Git mailing list, Jens Lehmann

On Tue, Mar 29, 2016 at 9:55 PM, Jeff King <peff@peff.net> wrote:
> One thing I like to do when playing with new diff ideas is to pipe all
> of "log -p" for a real project through it and see what differences it
> produces.
>

Great idea!

> Below is a perl script that implements Stefan's heuristic. I checked its
> output on git.git with:
>
>   git log --format='commit %H' -p >old
>   perl /path/to/script <old >new
>   diff -F ^commit -u old new | less
>
> which shows the differences, with the commit id in the hunk header
> (which makes it easy to "git show $commit | perl /path/to/script" to
> see the new diff with more context.
>

I'll try to run this  against my projects and see what it looks like
to see if I can spot (m)any counter examples, which would indicate
it's a bad idea. I may have some time in the next few days to see how
hard it would be to fully integrate it into the diff machinery too.

Thanks for the help!

Regards,
Jake

> In addition to the cases discussed, it seems to improve C comments by
> turning:
>
>    /*
>   + * new function
>   + */
>   +void foo(void);
>   +
>   +/*
>     * old function
>     ...
>
> into:
>
>   +/*
>   + * my function
>   + */
>   +void foo(void);
>   +
>    /*
>     * old function
>     ...
>
> See 47fe3f6e for an example.
>
> It also seems to do OK with shell scripts. Commit e6bb5f78 is an example
> where it improves a here-doc, as in the motivating example from this
> thread. Similarly, the headers in 4df1e79 are much improved (though I'm
> confused why the final one in that diff doesn't seem to have been
> caught).
>
> I also ran into an interesting case in 86d26f24, where we have:
>
>   + test_expect_success '
>   +   foo
>   +
>   +'
>   +
>
> and there are _two_ blank lines to choose from. It looks really terrible
> if you use the first one, but the second one looks good (and the script
> below chooses the second, as it's closest to the hunk boundary). There
> may be cases where that's bad, though.
>
> This is just a proof of concept. I guess we'd want to somehow integrate
> the heuristic into git.
>
> -- >8 --
> #!/usr/bin/perl
>
> use strict;
> use warnings 'all';
>
> use constant {
>   STATE_NONE => 0,
>   STATE_LEADING_CONTEXT => 1,
>   STATE_IN_CHUNK => 2,
> };
> my $state = STATE_NONE;
> my @hunk;
> while(<>) {
>   if ($state == STATE_NONE) {
>     print;
>     if (/^@/) {
>       $state = STATE_LEADING_CONTEXT;
>     }
>   } else {
>     if (/^ /) {
>       flush_hunk() if $state != STATE_LEADING_CONTEXT;
>       push @hunk, $_;
>     } elsif(/^[-+]/) {
>       push @hunk, $_;
>       $state = STATE_IN_CHUNK;
>     } else {
>       flush_hunk();
>       $state = STATE_NONE;
>       print;
>     }
>   }
> }
> flush_hunk();
>
> sub flush_hunk {
>   my $context_len = 0;
>   while ($context_len < @hunk && $hunk[$context_len] =~ /^ /) {
>     $context_len++;
>   }
>
>   # Find the length of the ambiguous portion.
>   # Assumes our hunks have context first, and ambiguous additions at the end,
>   # which is how git generates them
>   my $ambig_len = 0;
>   while ($ambig_len < $context_len) {
>     my $i = $context_len - $ambig_len - 1;
>     my $j = @hunk - $ambig_len - 1;
>     if ($hunk[$j] =~ /^\+/ && substr($hunk[$i], 1) eq substr($hunk[$j], 1)) {
>       $ambig_len++;
>     } else {
>       last;
>     }
>   }
>
>   # Now look for an empty line in the ambiguous portion (we can just look in
>   # the context side, as it is equivalent to the addition side at the end).
>   # We count down, though, as we prefer to use the line closest to the
>   # hunk as the cutoff.
>   my $empty;
>   for (my $i = $context_len - 1; $i >= $context_len - $ambig_len; $i--) {
>     if (length($hunk[$i]) == 2) {
>       $empty = $i;
>       last;
>     }
>   }
>
>   if (defined $empty) {
>     # move empty lines after the chunk to be part of it
>     for (my $i = $empty + 1; $i < $context_len; $i++) {
>       $hunk[$i] =~ s/^ /+/;
>       $hunk[@hunk - $context_len + $i] =~ s/^\+/ /;
>     }
>   }
>
>   print @hunk;
>   @hunk = ();
> }

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-30  6:05           ` Jacob Keller
@ 2016-03-30 19:14             ` Jacob Keller
  2016-03-30 19:31               ` Jacob Keller
  0 siblings, 1 reply; 27+ messages in thread
From: Jacob Keller @ 2016-03-30 19:14 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, Junio C Hamano, Git mailing list, Jens Lehmann

On Tue, Mar 29, 2016 at 11:05 PM, Jacob Keller <jacob.keller@gmail.com> wrote:
> On Tue, Mar 29, 2016 at 9:55 PM, Jeff King <peff@peff.net> wrote:
>> One thing I like to do when playing with new diff ideas is to pipe all
>> of "log -p" for a real project through it and see what differences it
>> produces.
>>
>
> Great idea!
>
>> Below is a perl script that implements Stefan's heuristic. I checked its
>> output on git.git with:
>>
>>   git log --format='commit %H' -p >old
>>   perl /path/to/script <old >new
>>   diff -F ^commit -u old new | less
>>
>> which shows the differences, with the commit id in the hunk header
>> (which makes it easy to "git show $commit | perl /path/to/script" to
>> see the new diff with more context.
>>
>
> I'll try to run this  against my projects and see what it looks like
> to see if I can spot (m)any counter examples, which would indicate
> it's a bad idea. I may have some time in the next few days to see how
> hard it would be to fully integrate it into the diff machinery too.
>
> Thanks for the help!
>
> Regards,
> Jake
>

I ran this on a few of my local projects and it doesn't seem to
produce any false positives so far. Everything looks good. Of course
this is with just traditional C code. I am currently trying to run
this against the history of Linux as well and see if I can find
anything that seems bad there.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-30 19:14             ` Jacob Keller
@ 2016-03-30 19:31               ` Jacob Keller
  2016-03-30 19:40                 ` Stefan Beller
  2016-03-31 13:47                 ` Jeff King
  0 siblings, 2 replies; 27+ messages in thread
From: Jacob Keller @ 2016-03-30 19:31 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, Junio C Hamano, Git mailing list, Jens Lehmann

On Wed, Mar 30, 2016 at 12:14 PM, Jacob Keller <jacob.keller@gmail.com> wrote:
> I ran this on a few of my local projects and it doesn't seem to
> produce any false positives so far. Everything looks good. Of course
> this is with just traditional C code. I am currently trying to run
> this against the history of Linux as well and see if I can find
> anything that seems bad there.
>
> Thanks,
> Jake

So far I've only found a single location that ends up looking worse
within the Linux kernel. Diffs of some Kbuild settings result in
something like

before:

          If unsure, say Y.
+
+config RMI4_I2C
+       tristate "RMI4 I2C Support"
+       depends on RMI4_CORE && I2C
+       help
+         Say Y here if you want to support RMI4 devices connected to an I2C
+         bus.
+
+         If unsure, say Y.

after:

          required for all RMI4 device support.

+         If unsure, say Y.
+
+config RMI4_I2C
+       tristate "RMI4 I2C Support"
+       depends on RMI4_CORE && I2C
+       help
+         Say Y here if you want to support RMI4 devices connected to an I2C
+         bus.
+
          If unsure, say Y.

So in this particular instance which has multiple blank lines and is a
similar issue as with Stefan's note above, this is where the heuristic
falls apart. At least for C code this is basically vanishingly small
compared to the number of comment header fix ups.

I think it may be that Stefan's suggestions above may be on the right
track to resolve that too.

Regards,
Jake

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-30 19:31               ` Jacob Keller
@ 2016-03-30 19:40                 ` Stefan Beller
  2016-04-01 19:04                   ` Junio C Hamano
  2016-03-31 13:47                 ` Jeff King
  1 sibling, 1 reply; 27+ messages in thread
From: Stefan Beller @ 2016-03-30 19:40 UTC (permalink / raw)
  To: Jacob Keller; +Cc: Jeff King, Junio C Hamano, Git mailing list, Jens Lehmann

On Wed, Mar 30, 2016 at 12:31 PM, Jacob Keller <jacob.keller@gmail.com> wrote:
>
>           If unsure, say Y.
> +
> +config RMI4_I2C
> +       tristate "RMI4 I2C Support"
> +       depends on RMI4_CORE && I2C
> +       help
> +         Say Y here if you want to support RMI4 devices connected to an I2C
> +         bus.
> +
> +         If unsure, say Y.
>
> after:
>
>           required for all RMI4 device support.
>
> +         If unsure, say Y.
> +
> +config RMI4_I2C
> +       tristate "RMI4 I2C Support"
> +       depends on RMI4_CORE && I2C
> +       help
> +         Say Y here if you want to support RMI4 devices connected to an I2C
> +         bus.
> +
>           If unsure, say Y.

The optimum would be:

  >
  >           If unsure, say Y.
  >
  > +config RMI4_I2C
  > +       tristate "RMI4 I2C Support"
  > +       depends on RMI4_CORE && I2C
  > +       help
  > +         Say Y here if you want to support RMI4 devices connected to an I2C
  > +         bus.
  > +
  > +         If unsure, say Y.
  > +
  >  config BLA_I2C

The overlapping lines:
  > +
  > +         If unsure, say Y.
  > +

However that broke the lines at the first empty line, not the last
as Jeff claimed it. (Could there be a problem in the perl script when
empty lines are at the first or last overlapping line?)

Thanks for going through examples!
(I would, too. But fixing a submodule regression is more important
now; I only develop new features when there are no known regressions
caused by me)

Thanks,
Stefan

>
> So in this particular instance which has multiple blank lines and is a
> similar issue as with Stefan's note above, this is where the heuristic
> falls apart. At least for C code this is basically vanishingly small
> compared to the number of comment header fix ups.
>
> I think it may be that Stefan's suggestions above may be on the right
> track to resolve that too.
>
> Regards,
> Jake

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-30 19:31               ` Jacob Keller
  2016-03-30 19:40                 ` Stefan Beller
@ 2016-03-31 13:47                 ` Jeff King
  2016-04-06 17:47                   ` Jacob Keller
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff King @ 2016-03-31 13:47 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Stefan Beller, Junio C Hamano, Git mailing list, Jens Lehmann

On Wed, Mar 30, 2016 at 12:31:30PM -0700, Jacob Keller wrote:

> So far I've only found a single location that ends up looking worse
> within the Linux kernel. Diffs of some Kbuild settings result in
> something like
> 
> before:
> 
>           If unsure, say Y.
> +
> +config RMI4_I2C
> +       tristate "RMI4 I2C Support"
> +       depends on RMI4_CORE && I2C
> +       help
> +         Say Y here if you want to support RMI4 devices connected to an I2C
> +         bus.
> +
> +         If unsure, say Y.
> 
> after:
> 
>           required for all RMI4 device support.
> 
> +         If unsure, say Y.
> +
> +config RMI4_I2C
> +       tristate "RMI4 I2C Support"
> +       depends on RMI4_CORE && I2C
> +       help
> +         Say Y here if you want to support RMI4 devices connected to an I2C
> +         bus.
> +
>           If unsure, say Y.
> 
> So in this particular instance which has multiple blank lines and is a
> similar issue as with Stefan's note above, this is where the heuristic
> falls apart. At least for C code this is basically vanishingly small
> compared to the number of comment header fix ups.
> 
> I think it may be that Stefan's suggestions above may be on the right
> track to resolve that too.

This is a tricky one. There _aren't_ actually multiple blank lines in
the ambiguous area, because this particular example comes at the very
end of the file. Try:

  git show 8d99758dee3 drivers/input/rmi4/Kconfig

which adds a block in the middle of the file. It looks good both before
and after running through the script. Now look at this example:

  git show fdf51604f10 drivers/input/rmi4/Kconfig

which looks like:

diff --git a/drivers/input/rmi4/Kconfig b/drivers/input/rmi4/Kconfig
index 5ea60e3..cc3f7c5 100644
--- a/drivers/input/rmi4/Kconfig
+++ b/drivers/input/rmi4/Kconfig
@@ -8,3 +8,12 @@ config RMI4_CORE
          required for all RMI4 device support.
 
          If unsure, say Y.
+
+config RMI4_I2C
+       tristate "RMI4 I2C Support"
+       depends on RMI4_CORE && I2C
+       help
+         Say Y here if you want to support RMI4 devices connected to an I2C
+         bus.
+
+         If unsure, say Y.


Note that there is no trailing context, as we're adding at the end of
the file. So the ambiguous portion consists of only two lines: an empty
line, and "If unsure...". And we bump the latter to the top, per the
heuristic (it's the exact opposite of every other case, where the blank
line is a true delimiter).

As a human, I think the indentation here is the real syntactic clue. But
getting into indentation heuristics is probably insane.

The script could probably make this work by disabling itself if the hunk
is at the end of the diffed file (i.e., we don't see more context lines
afterward). That covers any case like this where newline _is_ a
delimiter, but we just have some internal newlines, too. It wouldn't
cover the case where we had internal newlines but used some other
paragraph delimiter, but based on the results so far, that seems rather
rare.

Something like this:

--- foo.pl.orig	2016-03-31 09:44:44.281232230 -0400
+++ foo.pl	2016-03-31 09:44:34.901232632 -0400
@@ -24,13 +24,15 @@
       push @hunk, $_;
       $state = STATE_IN_CHUNK;
     } else {
-      flush_hunk();
+      print @hunk;
+      @hunk = ();
       $state = STATE_NONE;
       print;
     }
   }
 }
-flush_hunk();
+print @hunk;
+@hunk = ();
 
 sub flush_hunk {
   my $context_len = 0;

-Peff

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-30 19:40                 ` Stefan Beller
@ 2016-04-01 19:04                   ` Junio C Hamano
  0 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2016-04-01 19:04 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Jacob Keller, Jeff King, Git mailing list, Jens Lehmann

Stefan Beller <sbeller@google.com> writes:

> Thanks for going through examples!
> (I would, too. But fixing a submodule regression is more important
> now; I only develop new features when there are no known regressions
> caused by me)

This is a tangent but perhaps as an experiment perhaps we can try it
as the rule for everybody to adopt for one cycle, and see if that
improves the quality of the end-user experience?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-03-31 13:47                 ` Jeff King
@ 2016-04-06 17:47                   ` Jacob Keller
  2016-04-12 19:34                     ` Stefan Beller
  0 siblings, 1 reply; 27+ messages in thread
From: Jacob Keller @ 2016-04-06 17:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, Junio C Hamano, Git mailing list, Jens Lehmann

On Thu, Mar 31, 2016 at 6:47 AM, Jeff King <peff@peff.net> wrote:
> On Wed, Mar 30, 2016 at 12:31:30PM -0700, Jacob Keller wrote:
>
>> So far I've only found a single location that ends up looking worse
>> within the Linux kernel. Diffs of some Kbuild settings result in
>> something like
>>
>> before:
>>
>>           If unsure, say Y.
>> +
>> +config RMI4_I2C
>> +       tristate "RMI4 I2C Support"
>> +       depends on RMI4_CORE && I2C
>> +       help
>> +         Say Y here if you want to support RMI4 devices connected to an I2C
>> +         bus.
>> +
>> +         If unsure, say Y.
>>
>> after:
>>
>>           required for all RMI4 device support.
>>
>> +         If unsure, say Y.
>> +
>> +config RMI4_I2C
>> +       tristate "RMI4 I2C Support"
>> +       depends on RMI4_CORE && I2C
>> +       help
>> +         Say Y here if you want to support RMI4 devices connected to an I2C
>> +         bus.
>> +
>>           If unsure, say Y.
>>
>> So in this particular instance which has multiple blank lines and is a
>> similar issue as with Stefan's note above, this is where the heuristic
>> falls apart. At least for C code this is basically vanishingly small
>> compared to the number of comment header fix ups.
>>
>> I think it may be that Stefan's suggestions above may be on the right
>> track to resolve that too.
>
> This is a tricky one. There _aren't_ actually multiple blank lines in
> the ambiguous area, because this particular example comes at the very
> end of the file. Try:
>
>   git show 8d99758dee3 drivers/input/rmi4/Kconfig
>
> which adds a block in the middle of the file. It looks good both before
> and after running through the script. Now look at this example:
>
>   git show fdf51604f10 drivers/input/rmi4/Kconfig
>
> which looks like:
>
> diff --git a/drivers/input/rmi4/Kconfig b/drivers/input/rmi4/Kconfig
> index 5ea60e3..cc3f7c5 100644
> --- a/drivers/input/rmi4/Kconfig
> +++ b/drivers/input/rmi4/Kconfig
> @@ -8,3 +8,12 @@ config RMI4_CORE
>           required for all RMI4 device support.
>
>           If unsure, say Y.
> +
> +config RMI4_I2C
> +       tristate "RMI4 I2C Support"
> +       depends on RMI4_CORE && I2C
> +       help
> +         Say Y here if you want to support RMI4 devices connected to an I2C
> +         bus.
> +
> +         If unsure, say Y.
>
>
> Note that there is no trailing context, as we're adding at the end of
> the file. So the ambiguous portion consists of only two lines: an empty
> line, and "If unsure...". And we bump the latter to the top, per the
> heuristic (it's the exact opposite of every other case, where the blank
> line is a true delimiter).
>
> As a human, I think the indentation here is the real syntactic clue. But
> getting into indentation heuristics is probably insane.
>
> The script could probably make this work by disabling itself if the hunk
> is at the end of the diffed file (i.e., we don't see more context lines
> afterward). That covers any case like this where newline _is_ a
> delimiter, but we just have some internal newlines, too. It wouldn't
> cover the case where we had internal newlines but used some other
> paragraph delimiter, but based on the results so far, that seems rather
> rare.
>
> Something like this:
>
> --- foo.pl.orig 2016-03-31 09:44:44.281232230 -0400
> +++ foo.pl      2016-03-31 09:44:34.901232632 -0400
> @@ -24,13 +24,15 @@
>        push @hunk, $_;
>        $state = STATE_IN_CHUNK;
>      } else {
> -      flush_hunk();
> +      print @hunk;
> +      @hunk = ();
>        $state = STATE_NONE;
>        print;
>      }
>    }
>  }
> -flush_hunk();
> +print @hunk;
> +@hunk = ();
>
>  sub flush_hunk {
>    my $context_len = 0;
>
> -Peff

I started attempting to implement this heuristic within xdiff, but I
am at a loss as to how xdiff actually works. I suspect this would go
in xdi_change_compact or after it, but I really don't understand how
xdiff represents the diffs at all...

Thanks,
Jake

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-04-06 17:47                   ` Jacob Keller
@ 2016-04-12 19:34                     ` Stefan Beller
  2016-04-14 13:56                       ` Davide Libenzi
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Beller @ 2016-04-12 19:34 UTC (permalink / raw)
  To: Jacob Keller, davidel
  Cc: Jeff King, Junio C Hamano, Git mailing list, Jens Lehmann

On Wed, Apr 6, 2016 at 10:47 AM, Jacob Keller <jacob.keller@gmail.com> wrote:
>
> I started attempting to implement this heuristic within xdiff, but I
> am at a loss as to how xdiff actually works. I suspect this would go
> in xdi_change_compact or after it, but I really don't understand how
> xdiff represents the diffs at all...

I agree that this seems like the right place.

On the off chance that David, the author of xdiff remembers that
part, I cc'd him. (The whole discussion on better diffs is found at
http://thread.gmane.org/gmane.comp.version-control.git/290093)

Thanks,
Stefan



>
> Thanks,
> Jake

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-04-12 19:34                     ` Stefan Beller
@ 2016-04-14 13:56                       ` Davide Libenzi
  2016-04-14 18:34                         ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Davide Libenzi @ 2016-04-14 13:56 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jacob Keller, davidel, Jeff King, Junio C Hamano,
	Git mailing list, Jens Lehmann

On Tue, 12 Apr 2016, Stefan Beller wrote:

> On Wed, Apr 6, 2016 at 10:47 AM, Jacob Keller <jacob.keller@gmail.com> wrote:
> >
> > I started attempting to implement this heuristic within xdiff, but I
> > am at a loss as to how xdiff actually works. I suspect this would go
> > in xdi_change_compact or after it, but I really don't understand how
> > xdiff represents the diffs at all...
> 
> I agree that this seems like the right place.
> 
> On the off chance that David, the author of xdiff remembers that
> part, I cc'd him. (The whole discussion on better diffs is found at
> http://thread.gmane.org/gmane.comp.version-control.git/290093)

That was a zillions of years ago :) , but from a quick look at email 
thread, if you want to do it within xdiff, xdi_change_compact would be the 
place.
The issue is knowing in which situations one diff look better than 
another, and embedding an if-tis-do-tat logic deep into the core diff 
machinery.
In theory one could implement the same thing higher up, working with the 
unified diff text format, where maybe a user can provide its own diff 
post-process hook script.
In any case, that still leaves open the issue on what to shift in the diff 
chunks, and in which cases. Which is likely going to be language/format 
dependent. IMHO, it gets nasty pretty quickly.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-04-14 13:56                       ` Davide Libenzi
@ 2016-04-14 18:34                         ` Jeff King
  2016-04-14 21:05                           ` Stefan Beller
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2016-04-14 18:34 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Stefan Beller, Jacob Keller, Junio C Hamano, Git mailing list,
	Jens Lehmann

On Thu, Apr 14, 2016 at 06:56:39AM -0700, Davide Libenzi wrote:

> That was a zillions of years ago :) , but from a quick look at email
> thread, if you want to do it within xdiff, xdi_change_compact would be
> the place.  The issue is knowing in which situations one diff look
> better than another, and embedding an if-tis-do-tat logic deep into
> the core diff machinery.  In theory one could implement the same thing
> higher up, working with the unified diff text format, where maybe a
> user can provide its own diff post-process hook script.  In any case,
> that still leaves open the issue on what to shift in the diff chunks,
> and in which cases. Which is likely going to be language/format
> dependent. IMHO, it gets nasty pretty quickly.

Thanks, that's helpful. Stefan already came up with a heuristic that I
implemented as a post-processing script in perl. It _seems_ to work
pretty well in practice across multiple languages, so our next step was
to implement it in an actual usable and efficient way. :)

Looking over the code, I agree that xdl_change_compact() is the place we
would want to put it. We'd probably tie it to a command-line option and
let people play around with it, and then consider making it the default
if there's widespread approval.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-04-14 18:34                         ` Jeff King
@ 2016-04-14 21:05                           ` Stefan Beller
  2016-04-15  0:07                             ` [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics Stefan Beller
                                               ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Stefan Beller @ 2016-04-14 21:05 UTC (permalink / raw)
  To: Jeff King
  Cc: Davide Libenzi, Jacob Keller, Junio C Hamano, Git mailing list,
	Jens Lehmann

On Thu, Apr 14, 2016 at 11:34 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Apr 14, 2016 at 06:56:39AM -0700, Davide Libenzi wrote:
>
>> That was a zillions of years ago :) , but from a quick look at email
>> thread, if you want to do it within xdiff, xdi_change_compact would be
>> the place.  The issue is knowing in which situations one diff look
>> better than another, and embedding an if-tis-do-tat logic deep into
>> the core diff machinery.  In theory one could implement the same thing
>> higher up, working with the unified diff text format, where maybe a
>> user can provide its own diff post-process hook script.  In any case,
>> that still leaves open the issue on what to shift in the diff chunks,
>> and in which cases. Which is likely going to be language/format
>> dependent. IMHO, it gets nasty pretty quickly.
>
> Thanks, that's helpful. Stefan already came up with a heuristic that I
> implemented as a post-processing script in perl. It _seems_ to work
> pretty well in practice across multiple languages, so our next step was
> to implement it in an actual usable and efficient way. :)

To reiterate the heuristic for Davide (so you can avoid reading the
whole thread):

    If there are diff chunks, which can be shifted around, shift it such that
    the last empty line is below the chunk and the rest above.

Example:
(indented, shiftable part marked with Xs)

        diff --git a/test.c b/test.c
        index 2d7f343..2a14d36 100644
        --- a/test.c
        +++ b/test.c
        @@ -8,6 +8,14 @@ void A()
         }

         /**
        + * This is text.
        + */
        +void B()
        +{
        +  text text
X1      +}
X2      +
X3      +/**
          * This does 'foo foo'.
          */
         void C()

The last empty line is X2, so that's where we wrap:
(X2 is the last line of the diff)

        diff --git a/test.c b/test.c
        index 2d7f343..2a14d36 100644
        --- a/test.c
        +++ b/test.c
        @@ -8,6 +8,14 @@ void A()
         }

X3      +/**
        + * This is text.
        + */
        +void B()
        +{
        +  text text
X1      +}
X2      +
         /**
          * This does 'foo foo'.
          */
         void C()


>
> Looking over the code, I agree that xdl_change_compact() is the place we
> would want to put it. We'd probably tie it to a command-line option and
> let people play around with it, and then consider making it the default
> if there's widespread approval.

I just stumbled upon
http://blog.scoutapp.com/articles/2016/04/12/3-git-productivity-hacks
which advertises git config --global pager.diff "diff-so-fancy | less
--tabs=4 -RFX"

Would you consider your perl script good enough to put that instead of
diff-so-fancy?

>
> -Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics.
  2016-04-14 21:05                           ` Stefan Beller
@ 2016-04-15  0:07                             ` Stefan Beller
  2016-04-15  0:26                               ` Jacob Keller
  2016-04-15  2:09                               ` Junio C Hamano
  2016-04-15  0:21                             ` weird diff output? Jacob Keller
  2016-04-15  2:18                             ` Jeff King
  2 siblings, 2 replies; 27+ messages in thread
From: Stefan Beller @ 2016-04-15  0:07 UTC (permalink / raw)
  Cc: peff, davidel, jacob.keller, gitster, git, Jens.Lehmann, Stefan Beller

TODO(sbeller):
* describe the discussion on why this is better
* see if this can be tested?

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 xdiff/xdiffi.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 2358a2d..24eb9a0 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -400,9 +400,16 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
 }
 
 
+static int starts_with_emptyline(const char *recs)
+{
+	return recs[0] == '\n'; /* CRLF not covered here */
+}
+
+
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 	long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec;
 	char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
+	unsigned char has_emptyline;
 	xrecord_t **recs = xdf->recs;
 
 	/*
@@ -436,6 +443,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 
 		do {
 			grpsiz = ix - ixs;
+			has_emptyline = 0;
 
 			/*
 			 * If the line before the current change group, is equal to
@@ -447,6 +455,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 				rchg[--ixs] = 1;
 				rchg[--ix] = 0;
 
+				has_emptyline |=
+					starts_with_emptyline(recs[ix]->ptr);
 				/*
 				 * This change might have joined two change groups,
 				 * so we try to take this scenario in account by moving
@@ -475,6 +485,9 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 				rchg[ixs++] = 0;
 				rchg[ix++] = 1;
 
+				has_emptyline |=
+					starts_with_emptyline(recs[ix]->ptr);
+
 				/*
 				 * This change might have joined two change groups,
 				 * so we try to take this scenario in account by moving
@@ -498,6 +511,32 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			rchg[--ix] = 0;
 			while (rchgo[--ixo]);
 		}
+
+		/*
+		 * If a group can be moved back and forth, see if there is an
+		 * empty line in the moving space. If there is an empty line,
+		 * make sure the last empty line is the end of the group.
+		 *
+		 * As we shifted the group forward as far as possible, we only
+		 * need to shift it back if at all.
+		 */
+		if (has_emptyline) {
+			while (ixs > 0 && recs[ixs - 1]->ha == recs[ix - 1]->ha &&
+			       xdl_recmatch(recs[ixs - 1]->ptr, recs[ixs - 1]->size, recs[ix - 1]->ptr, recs[ix - 1]->size, flags) &&
+			       !starts_with_emptyline(recs[ix - 1]->ptr)) {
+				rchg[--ixs] = 1;
+				rchg[--ix] = 0;
+
+				/*
+				 * This change did not join two change groups,
+				 * as we did that before already, so there is no
+				 * need to adapt the other-file, i.e.
+				 * running
+				 *     for (; rchg[ixs - 1]; ixs--);
+				 *     while (rchgo[--ixo]);
+				 */
+			}
+		}
 	}
 
 	return 0;
-- 
2.8.1.474.gffdc890.dirty

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-04-14 21:05                           ` Stefan Beller
  2016-04-15  0:07                             ` [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics Stefan Beller
@ 2016-04-15  0:21                             ` Jacob Keller
  2016-04-15  2:18                             ` Jeff King
  2 siblings, 0 replies; 27+ messages in thread
From: Jacob Keller @ 2016-04-15  0:21 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jeff King, Davide Libenzi, Junio C Hamano, Git mailing list,
	Jens Lehmann

On Thu, Apr 14, 2016 at 2:05 PM, Stefan Beller <sbeller@google.com> wrote:
> On Thu, Apr 14, 2016 at 11:34 AM, Jeff King <peff@peff.net> wrote:
>> On Thu, Apr 14, 2016 at 06:56:39AM -0700, Davide Libenzi wrote:
>>
>>> That was a zillions of years ago :) , but from a quick look at email
>>> thread, if you want to do it within xdiff, xdi_change_compact would be
>>> the place.  The issue is knowing in which situations one diff look
>>> better than another, and embedding an if-tis-do-tat logic deep into
>>> the core diff machinery.  In theory one could implement the same thing
>>> higher up, working with the unified diff text format, where maybe a
>>> user can provide its own diff post-process hook script.  In any case,
>>> that still leaves open the issue on what to shift in the diff chunks,
>>> and in which cases. Which is likely going to be language/format
>>> dependent. IMHO, it gets nasty pretty quickly.
>>
>> Thanks, that's helpful. Stefan already came up with a heuristic that I
>> implemented as a post-processing script in perl. It _seems_ to work
>> pretty well in practice across multiple languages, so our next step was
>> to implement it in an actual usable and efficient way. :)
>
> To reiterate the heuristic for Davide (so you can avoid reading the
> whole thread):
>
>     If there are diff chunks, which can be shifted around, shift it such that
>     the last empty line is below the chunk and the rest above.
>
> Example:
> (indented, shiftable part marked with Xs)
>
>         diff --git a/test.c b/test.c
>         index 2d7f343..2a14d36 100644
>         --- a/test.c
>         +++ b/test.c
>         @@ -8,6 +8,14 @@ void A()
>          }
>
>          /**
>         + * This is text.
>         + */
>         +void B()
>         +{
>         +  text text
> X1      +}
> X2      +
> X3      +/**
>           * This does 'foo foo'.
>           */
>          void C()
>
> The last empty line is X2, so that's where we wrap:
> (X2 is the last line of the diff)
>
>         diff --git a/test.c b/test.c
>         index 2d7f343..2a14d36 100644
>         --- a/test.c
>         +++ b/test.c
>         @@ -8,6 +8,14 @@ void A()
>          }
>
> X3      +/**
>         + * This is text.
>         + */
>         +void B()
>         +{
>         +  text text
> X1      +}
> X2      +
>          /**
>           * This does 'foo foo'.
>           */
>          void C()
>
>
>>
>> Looking over the code, I agree that xdl_change_compact() is the place we
>> would want to put it. We'd probably tie it to a command-line option and
>> let people play around with it, and then consider making it the default
>> if there's widespread approval.
>
> I just stumbled upon
> http://blog.scoutapp.com/articles/2016/04/12/3-git-productivity-hacks
> which advertises git config --global pager.diff "diff-so-fancy | less
> --tabs=4 -RFX"
>
> Would you consider your perl script good enough to put that instead of
> diff-so-fancy?
>

Interesting. I'll play around with that for a bit and see how fast it appears.

Thanks,
Jake

>>
>> -Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics.
  2016-04-15  0:07                             ` [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics Stefan Beller
@ 2016-04-15  0:26                               ` Jacob Keller
  2016-04-15  0:43                                 ` Stefan Beller
  2016-04-15  2:09                               ` Junio C Hamano
  1 sibling, 1 reply; 27+ messages in thread
From: Jacob Keller @ 2016-04-15  0:26 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jeff King, Davide Libenzi, Junio C Hamano, Git mailing list,
	Jens Lehmann

On Thu, Apr 14, 2016 at 5:07 PM, Stefan Beller <sbeller@google.com> wrote:
> TODO(sbeller):
> * describe the discussion on why this is better
> * see if this can be tested?
>

Thanks for taking time to do this! It looks like a few things are
still missing, CRLF obviously, and making it a configuration option.

> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
>  xdiff/xdiffi.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
>
> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index 2358a2d..24eb9a0 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> @@ -400,9 +400,16 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
>  }
>
>
> +static int starts_with_emptyline(const char *recs)
> +{
> +       return recs[0] == '\n'; /* CRLF not covered here */
> +}
> +
> +
>  int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>         long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec;
>         char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
> +       unsigned char has_emptyline;
>         xrecord_t **recs = xdf->recs;
>
>         /*
> @@ -436,6 +443,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>
>                 do {
>                         grpsiz = ix - ixs;
> +                       has_emptyline = 0;
>
>                         /*
>                          * If the line before the current change group, is equal to
> @@ -447,6 +455,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>                                 rchg[--ixs] = 1;
>                                 rchg[--ix] = 0;
>
> +                               has_emptyline |=
> +                                       starts_with_emptyline(recs[ix]->ptr);

I assume you're doing |= so that we don't overwrite the empty line
setting each loop here to 0 when it's false? That's a bit subtle, and
it took me a moment to figure out, since I am used to thinking of it
as bitwise | and not a boolean or like we're intending here (though
obviously we're using bitwise to perform that intended behavior).

>                                 /*
>                                  * This change might have joined two change groups,
>                                  * so we try to take this scenario in account by moving
> @@ -475,6 +485,9 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>                                 rchg[ixs++] = 0;
>                                 rchg[ix++] = 1;
>
> +                               has_emptyline |=
> +                                       starts_with_emptyline(recs[ix]->ptr);
> +
>                                 /*
>                                  * This change might have joined two change groups,
>                                  * so we try to take this scenario in account by moving
> @@ -498,6 +511,32 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>                         rchg[--ix] = 0;
>                         while (rchgo[--ixo]);
>                 }
> +
> +               /*
> +                * If a group can be moved back and forth, see if there is an
> +                * empty line in the moving space. If there is an empty line,
> +                * make sure the last empty line is the end of the group.
> +                *
> +                * As we shifted the group forward as far as possible, we only
> +                * need to shift it back if at all.
> +                */
> +               if (has_emptyline) {
> +                       while (ixs > 0 && recs[ixs - 1]->ha == recs[ix - 1]->ha &&
> +                              xdl_recmatch(recs[ixs - 1]->ptr, recs[ixs - 1]->size, recs[ix - 1]->ptr, recs[ix - 1]->size, flags) &&
> +                              !starts_with_emptyline(recs[ix - 1]->ptr)) {
> +                               rchg[--ixs] = 1;
> +                               rchg[--ix] = 0;
> +
> +                               /*
> +                                * This change did not join two change groups,
> +                                * as we did that before already, so there is no
> +                                * need to adapt the other-file, i.e.
> +                                * running
> +                                *     for (; rchg[ixs - 1]; ixs--);
> +                                *     while (rchgo[--ixo]);
> +                                */
> +                       }
> +               }
>         }

And this was the more difficult part which I wasn't able to fully
understand how to do. It seems pretty reasonable. I think we can make
it configurable by using a new XDIFF flag similar to how we handle
various diff options like the different diff algorithms, and then we
could add tests specific to ensure that the flag enables the behavior
we want on some known test cases.

I am not really sure how to thoroughly test it beyond that though.

Regards,
Jake

>
>         return 0;
> --
> 2.8.1.474.gffdc890.dirty
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics.
  2016-04-15  0:26                               ` Jacob Keller
@ 2016-04-15  0:43                                 ` Stefan Beller
  2016-04-15  2:07                                   ` Jacob Keller
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Beller @ 2016-04-15  0:43 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Jeff King, Davide Libenzi, Junio C Hamano, Git mailing list,
	Jens Lehmann

On Thu, Apr 14, 2016 at 5:26 PM, Jacob Keller <jacob.keller@gmail.com> wrote:
> On Thu, Apr 14, 2016 at 5:07 PM, Stefan Beller <sbeller@google.com> wrote:
>> TODO(sbeller):
>> * describe the discussion on why this is better
>> * see if this can be tested?
>>
>
> Thanks for taking time to do this! It looks like a few things are
> still missing, CRLF obviously, and making it a configuration option.

I mainly wanted to get this out in the world quickly as it took me a while to
understand the code. Do you know the feeling when you stare at code
for hours and debug it and read headers to make sense of these
cryptic variables and then after intensive thinking a clear image emerges?

I put comments into the loop to convey my thought process on why that
is enough code doing the job. So I'd be happy about a critical review. :)

>
>> Signed-off-by: Stefan Beller <sbeller@google.com>
>> ---
>>  xdiff/xdiffi.c | 39 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 39 insertions(+)
>>
>> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
>> index 2358a2d..24eb9a0 100644
>> --- a/xdiff/xdiffi.c
>> +++ b/xdiff/xdiffi.c
>> @@ -400,9 +400,16 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
>>  }
>>
>>
>> +static int starts_with_emptyline(const char *recs)
>> +{
>> +       return recs[0] == '\n'; /* CRLF not covered here */
>> +}
>> +
>> +
>>  int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>         long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec;
>>         char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
>> +       unsigned char has_emptyline;
>>         xrecord_t **recs = xdf->recs;
>>
>>         /*
>> @@ -436,6 +443,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>
>>                 do {
>>                         grpsiz = ix - ixs;
>> +                       has_emptyline = 0;
>>
>>                         /*
>>                          * If the line before the current change group, is equal to
>> @@ -447,6 +455,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>                                 rchg[--ixs] = 1;
>>                                 rchg[--ix] = 0;
>>
>> +                               has_emptyline |=
>> +                                       starts_with_emptyline(recs[ix]->ptr);
>
> I assume you're doing |= so that we don't overwrite the empty line
> setting each loop here to 0 when it's false? That's a bit subtle, and
> it took me a moment to figure out, since I am used to thinking of it
> as bitwise | and not a boolean or like we're intending here (though
> obviously we're using bitwise to perform that intended behavior).

Here are my thoughts:
* this loop shifts the group back and forth, "collecting" adjacent groups
  until no more groups are eaten.
* That is why the last iteration of the loop will shift around most
and completely
   cover the relevant area. we could do this in the last iteration
only of this loop.
   But we do not know when the last iteration will be, so do it every time.

   We could also have an extra loop after this loop to do a back and
forth once to look
   for empty lines.

* Yes, the |= should convey:

    if (starts_with_emptyline(...)
        has_emptyline = 1;

We could do += as well. (Then we'd get the count which is still good enough.)

* We do not want to overwrite the has_emptyline for non empty lines in
the inner loop.

* The outer loop doesn't matter as we reset has_emptyline to 0 each
time as explained
   above. Technically we could "has_emptyline = 0;" before the do{ }
while loop, to save
   a little bit of instructions.

* I assumed starts_with_emptyline returns a boolean (though it is int)
  In this code I use unsigned char, which should probably be int as well?

>
>>                                 /*
>>                                  * This change might have joined two change groups,
>>                                  * so we try to take this scenario in account by moving
>> @@ -475,6 +485,9 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>                                 rchg[ixs++] = 0;
>>                                 rchg[ix++] = 1;
>>
>> +                               has_emptyline |=
>> +                                       starts_with_emptyline(recs[ix]->ptr);
>> +
>>                                 /*
>>                                  * This change might have joined two change groups,
>>                                  * so we try to take this scenario in account by moving
>> @@ -498,6 +511,32 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>                         rchg[--ix] = 0;
>>                         while (rchgo[--ixo]);
>>                 }
>> +
>> +               /*
>> +                * If a group can be moved back and forth, see if there is an
>> +                * empty line in the moving space. If there is an empty line,
>> +                * make sure the last empty line is the end of the group.
>> +                *
>> +                * As we shifted the group forward as far as possible, we only
>> +                * need to shift it back if at all.
>> +                */
>> +               if (has_emptyline) {
>> +                       while (ixs > 0 && recs[ixs - 1]->ha == recs[ix - 1]->ha &&
>> +                              xdl_recmatch(recs[ixs - 1]->ptr, recs[ixs - 1]->size, recs[ix - 1]->ptr, recs[ix - 1]->size, flags) &&
>> +                              !starts_with_emptyline(recs[ix - 1]->ptr)) {
>> +                               rchg[--ixs] = 1;
>> +                               rchg[--ix] = 0;
>> +
>> +                               /*
>> +                                * This change did not join two change groups,
>> +                                * as we did that before already, so there is no
>> +                                * need to adapt the other-file, i.e.
>> +                                * running
>> +                                *     for (; rchg[ixs - 1]; ixs--);
>> +                                *     while (rchgo[--ixo]);
>> +                                */
>> +                       }
>> +               }
>>         }
>
> And this was the more difficult part which I wasn't able to fully
> understand how to do. It seems pretty reasonable. I think we can make
> it configurable by using a new XDIFF flag similar to how we handle
> various diff options like the different diff algorithms, and then we
> could add tests specific to ensure that the flag enables the behavior
> we want on some known test cases.

Ok I'll look into adding a flag for that.

I have no idea what the "recs->ha" is, though (short for hash?),
so I am not quite sure about the condition in the while loop. It was mainly
copied from above where we shift the group backward.

>
> I am not really sure how to thoroughly test it beyond that though.

Thanks for the review!
In case you want to pick it up (partially), feel free to do so. :)

Stefan

>
> Regards,
> Jake
>
>>
>>         return 0;
>> --
>> 2.8.1.474.gffdc890.dirty
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics.
  2016-04-15  0:43                                 ` Stefan Beller
@ 2016-04-15  2:07                                   ` Jacob Keller
  0 siblings, 0 replies; 27+ messages in thread
From: Jacob Keller @ 2016-04-15  2:07 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jeff King, Davide Libenzi, Junio C Hamano, Git mailing list,
	Jens Lehmann

On Thu, Apr 14, 2016 at 5:43 PM, Stefan Beller <sbeller@google.com> wrote:
> On Thu, Apr 14, 2016 at 5:26 PM, Jacob Keller <jacob.keller@gmail.com> wrote:
>> On Thu, Apr 14, 2016 at 5:07 PM, Stefan Beller <sbeller@google.com> wrote:
>>> TODO(sbeller):
>>> * describe the discussion on why this is better
>>> * see if this can be tested?
>>>
>>
>> Thanks for taking time to do this! It looks like a few things are
>> still missing, CRLF obviously, and making it a configuration option.
>
> I mainly wanted to get this out in the world quickly as it took me a while to
> understand the code. Do you know the feeling when you stare at code
> for hours and debug it and read headers to make sense of these
> cryptic variables and then after intensive thinking a clear image emerges?
>
> I put comments into the loop to convey my thought process on why that
> is enough code doing the job. So I'd be happy about a critical review. :)
>

Yes, I am glad you got it out here in the world. I'll do my best to
review it sometime early tomorrow. I know that feeling, and I tried to
do that for this code and started getting a headache so I stopped for
a bit.

I like the comments they help understand the process.

>>
>>> Signed-off-by: Stefan Beller <sbeller@google.com>
>>> ---
>>>  xdiff/xdiffi.c | 39 +++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 39 insertions(+)
>>>
>>> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
>>> index 2358a2d..24eb9a0 100644
>>> --- a/xdiff/xdiffi.c
>>> +++ b/xdiff/xdiffi.c
>>> @@ -400,9 +400,16 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
>>>  }
>>>
>>>
>>> +static int starts_with_emptyline(const char *recs)
>>> +{
>>> +       return recs[0] == '\n'; /* CRLF not covered here */
>>> +}
>>> +
>>> +
>>>  int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>>         long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec;
>>>         char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
>>> +       unsigned char has_emptyline;
>>>         xrecord_t **recs = xdf->recs;
>>>
>>>         /*
>>> @@ -436,6 +443,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>>
>>>                 do {
>>>                         grpsiz = ix - ixs;
>>> +                       has_emptyline = 0;
>>>
>>>                         /*
>>>                          * If the line before the current change group, is equal to
>>> @@ -447,6 +455,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>>>                                 rchg[--ixs] = 1;
>>>                                 rchg[--ix] = 0;
>>>
>>> +                               has_emptyline |=
>>> +                                       starts_with_emptyline(recs[ix]->ptr);
>>
>> I assume you're doing |= so that we don't overwrite the empty line
>> setting each loop here to 0 when it's false? That's a bit subtle, and
>> it took me a moment to figure out, since I am used to thinking of it
>> as bitwise | and not a boolean or like we're intending here (though
>> obviously we're using bitwise to perform that intended behavior).
>
> Here are my thoughts:
> * this loop shifts the group back and forth, "collecting" adjacent groups
>   until no more groups are eaten.
> * That is why the last iteration of the loop will shift around most
> and completely
>    cover the relevant area. we could do this in the last iteration
> only of this loop.
>    But we do not know when the last iteration will be, so do it every time.
>

Ya I think this makes sense, and I think it's better to do it here
than having to do it as a separate loop after the fact.

>    We could also have an extra loop after this loop to do a back and
> forth once to look
>    for empty lines.
>

I think it's better to do it here.

> * Yes, the |= should convey:
>
>     if (starts_with_emptyline(...)
>         has_emptyline = 1;
>
> We could do += as well. (Then we'd get the count which is still good enough.)
>

We might do a += and rename the variable or something so that it's a
bit more clear what wer'e doing.

> * We do not want to overwrite the has_emptyline for non empty lines in
> the inner loop.
>

Right.

> * The outer loop doesn't matter as we reset has_emptyline to 0 each
> time as explained

Yes.

>    above. Technically we could "has_emptyline = 0;" before the do{ }
> while loop, to save
>    a little bit of instructions.
>
> * I assumed starts_with_emptyline returns a boolean (though it is int)
>   In this code I use unsigned char, which should probably be int as well?
>

I think the int is better, ya.

> Ok I'll look into adding a flag for that.
>
> I have no idea what the "recs->ha" is, though (short for hash?),
> so I am not quite sure about the condition in the while loop. It was mainly
> copied from above where we shift the group backward.
>

I don't really know either.

>>
>> I am not really sure how to thoroughly test it beyond that though.
>
> Thanks for the review!
> In case you want to pick it up (partially), feel free to do so. :)
>

I'll probably pick it up sometime tomorrow and try to see how it works and see

> Stefan
>
>>

Thanks again!
Jake

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics.
  2016-04-15  0:07                             ` [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics Stefan Beller
  2016-04-15  0:26                               ` Jacob Keller
@ 2016-04-15  2:09                               ` Junio C Hamano
  2016-04-15  3:33                                 ` Stefan Beller
  1 sibling, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2016-04-15  2:09 UTC (permalink / raw)
  To: Stefan Beller

Stefan Beller <sbeller@google.com> writes:

>  
> +static int starts_with_emptyline(const char *recs)
> +{
> +	return recs[0] == '\n'; /* CRLF not covered here */
> +}
> +
> +

That's "is-empty-line", not "starts-with" ;-)

> +
> +		/*
> +		 * If a group can be moved back and forth, see if there is an
> +		 * empty line in the moving space. If there is an empty line,
> +		 * make sure the last empty line is the end of the group.
> +		 *
> +		 * As we shifted the group forward as far as possible, we only
> +		 * need to shift it back if at all.
> +		 */

Sounds sensible.

> +		if (has_emptyline) {
> +			while (ixs > 0 && recs[ixs - 1]->ha == recs[ix - 1]->ha &&
> +			       xdl_recmatch(recs[ixs - 1]->ptr, recs[ixs - 1]->size, recs[ix - 1]->ptr, recs[ix - 1]->size, flags) &&
> +			       !starts_with_emptyline(recs[ix - 1]->ptr)) {

You probably want to wrap the "hash compares equal and recmatch does
say they are the same" into a helper function (to be automatically
inlined by the compiler) to make it more readable here.  I think
is-empty is a lot cheaper than the recmatch so that should probably
be done earlier in the && chain.

> +				rchg[--ixs] = 1;
> +				rchg[--ix] = 0;
> +
> +				/*
> +				 * This change did not join two change groups,
> +				 * as we did that before already, so there is no

Sorry, cannot quite parse the part before "already".

> +				 * need to adapt the other-file, i.e.
> +				 * running
> +				 *     for (; rchg[ixs - 1]; ixs--);
> +				 *     while (rchgo[--ixo]);
> +				 */
> +			}
> +		}
>  	}
>  
>  	return 0;

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: weird diff output?
  2016-04-14 21:05                           ` Stefan Beller
  2016-04-15  0:07                             ` [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics Stefan Beller
  2016-04-15  0:21                             ` weird diff output? Jacob Keller
@ 2016-04-15  2:18                             ` Jeff King
  2 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2016-04-15  2:18 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Davide Libenzi, Jacob Keller, Junio C Hamano, Git mailing list,
	Jens Lehmann

On Thu, Apr 14, 2016 at 02:05:03PM -0700, Stefan Beller wrote:

> > Looking over the code, I agree that xdl_change_compact() is the place we
> > would want to put it. We'd probably tie it to a command-line option and
> > let people play around with it, and then consider making it the default
> > if there's widespread approval.
> 
> I just stumbled upon
> http://blog.scoutapp.com/articles/2016/04/12/3-git-productivity-hacks
> which advertises git config --global pager.diff "diff-so-fancy | less
> --tabs=4 -RFX"
> 
> Would you consider your perl script good enough to put that instead of
> diff-so-fancy?

For some definition of "good enough". I don't plan to run it myself. And
I don't use diff-so-fancy. But I think diff-so-fancy folks also tend to
run contrib/diff-highlight, which is written in perl and quite similar
in structure to what I posted earlier (unsurprisingly, since I wrote
it).

So I think it works, and the performance hit from piping through perl
generally isn't bad enough to be a problem (and by definition it's only
running when you would run an interactive pager in the first place).

I don't think that this particular heuristic is in quite the same class
as diff-highlight and diff-so-fancy, though. Those ones transform the
diff away from something that can be applied, so you really do just want
them for human viewing. But this new heuristic is something that you'd
probably want as part of format-patch, for example, and we don't
generally kick in the pager there. So I think it would be much more
natural inside of the diff generation.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics.
  2016-04-15  2:09                               ` Junio C Hamano
@ 2016-04-15  3:33                                 ` Stefan Beller
  0 siblings, 0 replies; 27+ messages in thread
From: Stefan Beller @ 2016-04-15  3:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, davidel, Jacob Keller, git, Jens Lehmann

On Thu, Apr 14, 2016 at 7:09 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Stefan Beller <sbeller@google.com> writes:
>
>>
>> +static int starts_with_emptyline(const char *recs)
>> +{
>> +     return recs[0] == '\n'; /* CRLF not covered here */
>> +}
>> +
>> +
>
> That's "is-empty-line", not "starts-with" ;-)

heh, ok.
To understand the code, I was debugging it and looking at the
pointers `recs[ix - 1]->ptr` which is pointing into the text file,
i.e. when printing it in the debugger it would read

    '\n\tbla\n\nfoo\n ...'

so I found that a proper description at the time.


>
>> +
>> +             /*
>> +              * If a group can be moved back and forth, see if there is an
>> +              * empty line in the moving space. If there is an empty line,
>> +              * make sure the last empty line is the end of the group.
>> +              *
>> +              * As we shifted the group forward as far as possible, we only
>> +              * need to shift it back if at all.
>> +              */
>
> Sounds sensible.
>
>> +             if (has_emptyline) {
>> +                     while (ixs > 0 && recs[ixs - 1]->ha == recs[ix - 1]->ha &&
>> +                            xdl_recmatch(recs[ixs - 1]->ptr, recs[ixs - 1]->size, recs[ix - 1]->ptr, recs[ix - 1]->size, flags) &&
>> +                            !starts_with_emptyline(recs[ix - 1]->ptr)) {
>
> You probably want to wrap the "hash compares equal and recmatch does
> say they are the same" into a helper function (to be automatically
> inlined by the compiler) to make it more readable here.  I think
> is-empty is a lot cheaper than the recmatch so that should probably
> be done earlier in the && chain.

ok, will do. Given that xdiff upstream and our code diverged over the years,
I could apply this helper function at other places in the code as well.


>
>> +                             rchg[--ixs] = 1;
>> +                             rchg[--ix] = 0;
>> +
>> +                             /*
>> +                              * This change did not join two change groups,
>> +                              * as we did that before already, so there is no
>
> Sorry, cannot quite parse the part before "already".

I think to drop this comment in the final version of this patch.
As I `wrote` this loop by copying it from above, I tried justifying
each change to it. (More to prove to myself I understood the code)

will drop this comment.

>
>> +                              * need to adapt the other-file, i.e.
>> +                              * running
>> +                              *     for (; rchg[ixs - 1]; ixs--);
>> +                              *     while (rchgo[--ixo]);
>> +                              */
>> +                     }
>> +             }
>>       }
>>
>>       return 0;

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-04-15  3:34 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-29  0:26 weird diff output? Jacob Keller
2016-03-29 17:37 ` Stefan Beller
2016-03-29 17:54   ` Junio C Hamano
2016-03-29 18:16     ` Stefan Beller
2016-03-29 23:05       ` Jacob Keller
2016-03-30  0:04         ` Junio C Hamano
2016-03-30  4:55         ` Jeff King
2016-03-30  6:05           ` Stefan Beller
2016-03-30  6:05           ` Jacob Keller
2016-03-30 19:14             ` Jacob Keller
2016-03-30 19:31               ` Jacob Keller
2016-03-30 19:40                 ` Stefan Beller
2016-04-01 19:04                   ` Junio C Hamano
2016-03-31 13:47                 ` Jeff King
2016-04-06 17:47                   ` Jacob Keller
2016-04-12 19:34                     ` Stefan Beller
2016-04-14 13:56                       ` Davide Libenzi
2016-04-14 18:34                         ` Jeff King
2016-04-14 21:05                           ` Stefan Beller
2016-04-15  0:07                             ` [RFC PATCH, WAS: "weird diff output?"] Implement better chunk heuristics Stefan Beller
2016-04-15  0:26                               ` Jacob Keller
2016-04-15  0:43                                 ` Stefan Beller
2016-04-15  2:07                                   ` Jacob Keller
2016-04-15  2:09                               ` Junio C Hamano
2016-04-15  3:33                                 ` Stefan Beller
2016-04-15  0:21                             ` weird diff output? Jacob Keller
2016-04-15  2:18                             ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.