git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git diff word diff bug??
@ 2021-04-20 16:38 Count of San Francisco
  2021-04-25  8:54 ` Atharva Raykar
  0 siblings, 1 reply; 7+ messages in thread
From: Count of San Francisco @ 2021-04-20 16:38 UTC (permalink / raw)
  To: git

Hi All,

Here is my "git bugreport":

Thank you for filling out a Git bug report!
Please answer the following questions to help us understand your issue.

What did you do before the bug happened? (Steps to reproduce your issue)
   git diff --word-diff=porcelain file0.txt file1.txt
     or
   git diff --word-diff file0.txt file1.txt

What did you expect to happen? (Expected behavior)

   I expected the diff for porcelain or default word-diff to be clear on 
which lines got removed and which changes belong to which line. I 
explain more in details below.

What happened instead? (Actual behavior)

   The diff was not clear.

What's different between what you expected and what actually happened?

   The diff made it looked like all the removed text were on one line 
and a later change in a line look like it was meant for a different 
line. When in fact, the later changes were for the same line (i.e. the 
first line). More details below.

Anything else you want to add:

Here are the details to reproduce and more details on how I interpreted 
the diff. If I am writing a script to highlight changes or to do extra 
processing for my specific use case, my script would get confused as to 
what really changed.

file0.txt content:
*** Begin Content *** --> this line is not in the actual file but just a 
marker here for clarity.
The fox jumped over the wall.
Blah1e32
q432423
qe23234
  233
253
345235

53243
afsfffas
*** End Content ****

file1.txt content:
*** Begin Content ***
The fox jumped over the river.
   He made it over.
*** End Content ****

git diff --word-diff file0.txt file1.txt produced this:
diff --git a/file0.txt b/file1.txt
index c8756ba..3413f10 100644
--- a/file0.txt
+++ b/file1.txt
@@ -1,11 +1,2 @@
The fox jumped over the [-wall.-]
[-Blah1e32-]
[-q432423-]
[-qe23234-]
[- 233-]
[-253-]
[-345235-]

[-53243-]
[-afsfffas-]{+river.+}
{+  He made it over.+}

The diff above does not make it clear that the "{+river+}" is really to 
be appended (or related) to the first line.
I expected the first diff line to look like this:
The fox jumped over the [-wall.-]{+river+} and the rest of the lines are 
delete lines.

git diff --word-diff=porcelain file0.txt file1.txt produced this:
diff --git a/file0.txt b/file1.txt
index c8756ba..3413f10 100644
--- a/file0.txt
+++ b/file1.txt
@@ -1,11 +1,2 @@
  The fox jumped over the
-wall.
~
-Blah1e32
~
-q432423
~
-qe23234
~
- 233
~
-253
~
-345235
~
~
-53243
~
-afsfffas
+river.
~
+  He made it over.
~

This is more non-discernable. The git diff --help documentation says 
that "Newlines in the input are represented by a tilde ~ on a line of 
its own". So a script would see the '~' character and interpret that as 
a new line. The script would have mistaken the "+river" for a different 
line. The git diff --help documentation does not explain what to do in 
this scenario.

I expected this:
  The fox jumped over the
-wall.
+river.
~

Is this a bug? If not, how do I make the distinction that the {+river+} 
(in the first case) and the +river (in the 2nd case) is really for the 
first line?

Please review the rest of the bug report below.
You can delete any lines you don't wish to share.


[System Info]
git version:
git version 2.30.0
cpu: x86_64
no commit associated with this build
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
uname: Darwin 20.3.0 Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 
PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64 x86_64
compiler info: clang: 12.0.0 (clang-1200.0.32.28)
libc info: no libc information available
$SHELL (typically, interactive shell): /usr/local/bin/bash


[Enabled Hooks]
not run from a git repository - no hooks to show


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: git diff word diff bug??
  2021-04-20 16:38 git diff word diff bug?? Count of San Francisco
@ 2021-04-25  8:54 ` Atharva Raykar
  2021-04-26  9:45   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 7+ messages in thread
From: Atharva Raykar @ 2021-04-25  8:54 UTC (permalink / raw)
  To: Count of San Francisco; +Cc: git

On 20-Apr-2021, at 22:08, Count of San Francisco <countofsanfrancisco@gmail.com> wrote:
> 
> Hi All,
> 
> Here is my "git bugreport":
> 
> Thank you for filling out a Git bug report!
> Please answer the following questions to help us understand your issue.
> 
> What did you do before the bug happened? (Steps to reproduce your issue)
>   git diff --word-diff=porcelain file0.txt file1.txt
>     or
>   git diff --word-diff file0.txt file1.txt
> 
> What did you expect to happen? (Expected behavior)
> 
>   I expected the diff for porcelain or default word-diff to be clear on which lines got removed and which changes belong to which line. I explain more in details below.
> 
> What happened instead? (Actual behavior)
> 
>   The diff was not clear.
> 
> What's different between what you expected and what actually happened?
> 
>   The diff made it looked like all the removed text were on one line and a later change in a line look like it was meant for a different line. When in fact, the later changes were for the same line (i.e. the first line). More details below.
> 
> Anything else you want to add:
> 
> Here are the details to reproduce and more details on how I interpreted the diff. If I am writing a script to highlight changes or to do extra processing for my specific use case, my script would get confused as to what really changed.
> 
> file0.txt content:
> *** Begin Content *** --> this line is not in the actual file but just a marker here for clarity.
> The fox jumped over the wall.
> Blah1e32
> q432423
> qe23234
>  233
> 253
> 345235
> 
> 53243
> afsfffas
> *** End Content ****
> 
> file1.txt content:
> *** Begin Content ***
> The fox jumped over the river.
>   He made it over.
> *** End Content ****
> 
> git diff --word-diff file0.txt file1.txt produced this:
> diff --git a/file0.txt b/file1.txt
> index c8756ba..3413f10 100644
> --- a/file0.txt
> +++ b/file1.txt
> @@ -1,11 +1,2 @@
> The fox jumped over the [-wall.-]
> [-Blah1e32-]
> [-q432423-]
> [-qe23234-]
> [- 233-]
> [-253-]
> [-345235-]
> 
> [-53243-]
> [-afsfffas-]{+river.+}
> {+  He made it over.+}

From my experience, git diff prefers to bundle up a series of
deletions or additions into a group if they all have the same
word delimiter. The way I would interpret this diff is the steps
needed to be taken when moving left to right in file0 to get to
the state of file1, while minimising the number of times file1
has to be consulted to know what needs to be done next.

Here it would be:
"Delete all the words from 'wall' upto 'afsfffas', and then add
'river.' and ' He made it over'".

> The diff above does not make it clear that the "{+river+}" is really to be appended (or related) to the first line.
> I expected the first diff line to look like this:
> The fox jumped over the [-wall.-]{+river+} and the rest of the lines are delete lines.
> 
> git diff --word-diff=porcelain file0.txt file1.txt produced this:
> diff --git a/file0.txt b/file1.txt
> index c8756ba..3413f10 100644
> --- a/file0.txt
> +++ b/file1.txt
> @@ -1,11 +1,2 @@
>  The fox jumped over the
> -wall.
> ~
> -Blah1e32
> ~
> -q432423
> ~
> -qe23234
> ~
> - 233
> ~
> -253
> ~
> -345235
> ~
> ~
> -53243
> ~
> -afsfffas
> +river.
> ~
> +  He made it over.
> ~
> 
> This is more non-discernable. The git diff --help documentation says that "Newlines in the input are represented by a tilde ~ on a line of its own". So a script would see the '~' character and interpret that as a new line. The script would have mistaken the "+river" for a different line. The git diff --help documentation does not explain what to do in this scenario.
> 
> I expected this:
>  The fox jumped over the
> -wall.
> +river.
> ~

This is also consistent with the behaviour I mentioned above.
A script would need to interpret this as:
delete "wall"        (this starts the streak of deletions)
go to next line
delete "Blah1e32"
...

and as soon as it sees a '+', that is, an addition, it knows
the series of deletions are done with, so it will add "river"
to the last line that was common to both, that is,
"the fox jumped over the".

> Is this a bug? If not, how do I make the distinction that the {+river+} (in the first case) and the +river (in the 2nd case) is really for the first line?

I do not think this is a bug, because it does not really
deviate from any specified behaviour. But I do see the source
of confusion.

I hope I could explain that well enough.

> Please review the rest of the bug report below.
> You can delete any lines you don't wish to share.
> 
> 
> [System Info]
> git version:
> git version 2.30.0
> cpu: x86_64
> no commit associated with this build
> sizeof-long: 8
> sizeof-size_t: 8
> shell-path: /bin/sh
> uname: Darwin 20.3.0 Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64 x86_64
> compiler info: clang: 12.0.0 (clang-1200.0.32.28)
> libc info: no libc information available
> $SHELL (typically, interactive shell): /usr/local/bin/bash
> 
> 
> [Enabled Hooks]
> not run from a git repository - no hooks to show
> 

--
Atharva Raykar


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git diff word diff bug??
  2021-04-25  8:54 ` Atharva Raykar
@ 2021-04-26  9:45   ` Ævar Arnfjörð Bjarmason
  2021-04-30  4:12     ` Count of San Francisco
  2021-05-02 18:00     ` Phillip Wood
  0 siblings, 2 replies; 7+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-04-26  9:45 UTC (permalink / raw)
  To: Atharva Raykar; +Cc: Count of San Francisco, git


On Sun, Apr 25 2021, Atharva Raykar wrote:

> On 20-Apr-2021, at 22:08, Count of San Francisco <countofsanfrancisco@gmail.com> wrote:
>> 
>> Hi All,
>> 
>> Here is my "git bugreport":
>> 
>> Thank you for filling out a Git bug report!
>> Please answer the following questions to help us understand your issue.
>> 
>> What did you do before the bug happened? (Steps to reproduce your issue)
>>   git diff --word-diff=porcelain file0.txt file1.txt
>>     or
>>   git diff --word-diff file0.txt file1.txt
>> 
>> What did you expect to happen? (Expected behavior)
>> 
>>   I expected the diff for porcelain or default word-diff to be clear on which lines got removed and which changes belong to which line. I explain more in details below.
>> 
>> What happened instead? (Actual behavior)
>> 
>>   The diff was not clear.
>> 
>> What's different between what you expected and what actually happened?
>> 
>>   The diff made it looked like all the removed text were on one line and a later change in a line look like it was meant for a different line. When in fact, the later changes were for the same line (i.e. the first line). More details below.
>> 
>> Anything else you want to add:
>> 
>> Here are the details to reproduce and more details on how I interpreted the diff. If I am writing a script to highlight changes or to do extra processing for my specific use case, my script would get confused as to what really changed.
>> 
>> file0.txt content:
>> *** Begin Content *** --> this line is not in the actual file but just a marker here for clarity.
>> The fox jumped over the wall.
>> Blah1e32
>> q432423
>> qe23234
>>  233
>> 253
>> 345235
>> 
>> 53243
>> afsfffas
>> *** End Content ****
>> 
>> file1.txt content:
>> *** Begin Content ***
>> The fox jumped over the river.
>>   He made it over.
>> *** End Content ****
>> 
>> git diff --word-diff file0.txt file1.txt produced this:
>> diff --git a/file0.txt b/file1.txt
>> index c8756ba..3413f10 100644
>> --- a/file0.txt
>> +++ b/file1.txt
>> @@ -1,11 +1,2 @@
>> The fox jumped over the [-wall.-]
>> [-Blah1e32-]
>> [-q432423-]
>> [-qe23234-]
>> [- 233-]
>> [-253-]
>> [-345235-]
>> 
>> [-53243-]
>> [-afsfffas-]{+river.+}
>> {+  He made it over.+}
>
> From my experience, git diff prefers to bundle up a series of
> deletions or additions into a group if they all have the same
> word delimiter. The way I would interpret this diff is the steps
> needed to be taken when moving left to right in file0 to get to
> the state of file1, while minimising the number of times file1
> has to be consulted to know what needs to be done next.
>
> Here it would be:
> "Delete all the words from 'wall' upto 'afsfffas', and then add
> 'river.' and ' He made it over'".
>
>> The diff above does not make it clear that the "{+river+}" is really to be appended (or related) to the first line.
>> I expected the first diff line to look like this:
>> The fox jumped over the [-wall.-]{+river+} and the rest of the lines are delete lines.
>> 
>> git diff --word-diff=porcelain file0.txt file1.txt produced this:
>> diff --git a/file0.txt b/file1.txt
>> index c8756ba..3413f10 100644
>> --- a/file0.txt
>> +++ b/file1.txt
>> @@ -1,11 +1,2 @@
>>  The fox jumped over the
>> -wall.
>> ~
>> -Blah1e32
>> ~
>> -q432423
>> ~
>> -qe23234
>> ~
>> - 233
>> ~
>> -253
>> ~
>> -345235
>> ~
>> ~
>> -53243
>> ~
>> -afsfffas
>> +river.
>> ~
>> +  He made it over.
>> ~
>> 
>> This is more non-discernable. The git diff --help documentation says
>> that "Newlines in the input are represented by a tilde ~ on a line
>> of its own". So a script would see the '~' character and interpret
>> that as a new line. The script would have mistaken the "+river" for
>> a different line. The git diff --help documentation does not explain
>> what to do in this scenario.
>> 
>> I expected this:
>>  The fox jumped over the
>> -wall.
>> +river.
>> ~
>
> This is also consistent with the behaviour I mentioned above.
> A script would need to interpret this as:
> delete "wall"        (this starts the streak of deletions)
> go to next line
> delete "Blah1e32"
> ...
>
> and as soon as it sees a '+', that is, an addition, it knows
> the series of deletions are done with, so it will add "river"
> to the last line that was common to both, that is,
> "the fox jumped over the".
>
>> Is this a bug? If not, how do I make the distinction that the {+river+} (in the first case) and the +river (in the 2nd case) is really for the first line?
>
> I do not think this is a bug, because it does not really
> deviate from any specified behaviour. But I do see the source
> of confusion.
>
> I hope I could explain that well enough.

I do think it's somewhere between a "bug" and a "missing feature"
though.

The core issue here is that in diff.c we "fake up" the whole word-diff
mode by essentially taking both sides, splitting them on whitespace, and
then diffing that. So for A/B like:

    A: foo bar
       baz
    B: foo baz
       boo

We diff:

    A: foo\nbar\nbaz
    B: foo\nbaz\nboo

If you insert an extra newline into that stream you'll get different
results, which is closer to what the Count is expecting here.

In this case using --word-diff-regex='[^\n]' will produce:

    The fox jumped over the [-wall-]{+river+}.

But the rest of the output isn't ideal, exercise for the user and all
that.

The bug is arguably that we should be doing this by default, i.e. we
split up our output into "\n" internally, and arguably confuse that with
what the user wanted from their input.

But see bf82940dbf1 (color-words: enable REG_NEWLINE to help user,
2009-01-17) which is probably the source of the "bug" here (if any) for
a trade-off related to that.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git diff word diff bug??
  2021-04-26  9:45   ` Ævar Arnfjörð Bjarmason
@ 2021-04-30  4:12     ` Count of San Francisco
  2021-05-02 18:00     ` Phillip Wood
  1 sibling, 0 replies; 7+ messages in thread
From: Count of San Francisco @ 2021-04-30  4:12 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Atharva Raykar; +Cc: git

Hi,

It would be nice to get the ability I was expecting as a new feature or 
as the default behavior. If I get to have a preference, I would prefer 
this be the new default behavior. Even as Atharva explained, when the 
addition is encountered you know it is the termination but this is 
really confusing and can lead to too many corner cases for a script to 
handle. It's just not a very natural understanding. For example, take 
the content of file1.txt and simply delete "river." (with the period).

You will now have this word diff:

The fox jumped over the
   [-wall.-]
[-Blah1e32-]
[-q432423-]
[-qe23234-]
[- 233-]
[-253-]
[-345235-]

[-53243-]
[-afsfffas-]{+He made it over.+}

The script would be confused and think that the "wall" from "[-wall-]" 
occurred on a new line in the original text (i.e. file0.txt). Forget 
about the script for a moment, a human reader would be confused by 
looking at a diff and if they were in a code review process, they may be 
confused by which line the original "wall" belonged to.

Also, adding an additional new line as suggested by Ævar sometimes work 
and sometimes does not as I experiment that idea with other diffs. I 
found that if you add an additional "fake" and non-empty new line, the 
word-diff gets what I wanted. For example, if I add "xxx fake new line" 
for every new line, the word diff produces something more intuitive and 
"natural" to understand what really changed.

So, can we get this new behavior added/supported?

Thanks,

Scott


On 4/26/21 2:45 AM, Ævar Arnfjörð Bjarmason wrote:
> On Sun, Apr 25 2021, Atharva Raykar wrote:
>
>> On 20-Apr-2021, at 22:08, Count of San Francisco<countofsanfrancisco@gmail.com>  wrote:
>>> Hi All,
>>>
>>> Here is my "git bugreport":
>>>
>>> Thank you for filling out a Git bug report!
>>> Please answer the following questions to help us understand your issue.
>>>
>>> What did you do before the bug happened? (Steps to reproduce your issue)
>>>    git diff --word-diff=porcelain file0.txt file1.txt
>>>      or
>>>    git diff --word-diff file0.txt file1.txt
>>>
>>> What did you expect to happen? (Expected behavior)
>>>
>>>    I expected the diff for porcelain or default word-diff to be clear on which lines got removed and which changes belong to which line. I explain more in details below.
>>>
>>> What happened instead? (Actual behavior)
>>>
>>>    The diff was not clear.
>>>
>>> What's different between what you expected and what actually happened?
>>>
>>>    The diff made it looked like all the removed text were on one line and a later change in a line look like it was meant for a different line. When in fact, the later changes were for the same line (i.e. the first line). More details below.
>>>
>>> Anything else you want to add:
>>>
>>> Here are the details to reproduce and more details on how I interpreted the diff. If I am writing a script to highlight changes or to do extra processing for my specific use case, my script would get confused as to what really changed.
>>>
>>> file0.txt content:
>>> *** Begin Content *** --> this line is not in the actual file but just a marker here for clarity.
>>> The fox jumped over the wall.
>>> Blah1e32
>>> q432423
>>> qe23234
>>>   233
>>> 253
>>> 345235
>>>
>>> 53243
>>> afsfffas
>>> *** End Content ****
>>>
>>> file1.txt content:
>>> *** Begin Content ***
>>> The fox jumped over the river.
>>>    He made it over.
>>> *** End Content ****
>>>
>>> git diff --word-diff file0.txt file1.txt produced this:
>>> diff --git a/file0.txt b/file1.txt
>>> index c8756ba..3413f10 100644
>>> --- a/file0.txt
>>> +++ b/file1.txt
>>> @@ -1,11 +1,2 @@
>>> The fox jumped over the [-wall.-]
>>> [-Blah1e32-]
>>> [-q432423-]
>>> [-qe23234-]
>>> [- 233-]
>>> [-253-]
>>> [-345235-]
>>>
>>> [-53243-]
>>> [-afsfffas-]{+river.+}
>>> {+  He made it over.+}
>> >From my experience, git diff prefers to bundle up a series of
>> deletions or additions into a group if they all have the same
>> word delimiter. The way I would interpret this diff is the steps
>> needed to be taken when moving left to right in file0 to get to
>> the state of file1, while minimising the number of times file1
>> has to be consulted to know what needs to be done next.
>>
>> Here it would be:
>> "Delete all the words from 'wall' upto 'afsfffas', and then add
>> 'river.' and ' He made it over'".
>>
>>> The diff above does not make it clear that the "{+river+}" is really to be appended (or related) to the first line.
>>> I expected the first diff line to look like this:
>>> The fox jumped over the [-wall.-]{+river+} and the rest of the lines are delete lines.
>>>
>>> git diff --word-diff=porcelain file0.txt file1.txt produced this:
>>> diff --git a/file0.txt b/file1.txt
>>> index c8756ba..3413f10 100644
>>> --- a/file0.txt
>>> +++ b/file1.txt
>>> @@ -1,11 +1,2 @@
>>>   The fox jumped over the
>>> -wall.
>>> ~
>>> -Blah1e32
>>> ~
>>> -q432423
>>> ~
>>> -qe23234
>>> ~
>>> - 233
>>> ~
>>> -253
>>> ~
>>> -345235
>>> ~
>>> ~
>>> -53243
>>> ~
>>> -afsfffas
>>> +river.
>>> ~
>>> +  He made it over.
>>> ~
>>>
>>> This is more non-discernable. The git diff --help documentation says
>>> that "Newlines in the input are represented by a tilde ~ on a line
>>> of its own". So a script would see the '~' character and interpret
>>> that as a new line. The script would have mistaken the "+river" for
>>> a different line. The git diff --help documentation does not explain
>>> what to do in this scenario.
>>>
>>> I expected this:
>>>   The fox jumped over the
>>> -wall.
>>> +river.
>>> ~
>> This is also consistent with the behaviour I mentioned above.
>> A script would need to interpret this as:
>> delete "wall"        (this starts the streak of deletions)
>> go to next line
>> delete "Blah1e32"
>> ...
>>
>> and as soon as it sees a '+', that is, an addition, it knows
>> the series of deletions are done with, so it will add "river"
>> to the last line that was common to both, that is,
>> "the fox jumped over the".
>>
>>> Is this a bug? If not, how do I make the distinction that the {+river+} (in the first case) and the +river (in the 2nd case) is really for the first line?
>> I do not think this is a bug, because it does not really
>> deviate from any specified behaviour. But I do see the source
>> of confusion.
>>
>> I hope I could explain that well enough.
> I do think it's somewhere between a "bug" and a "missing feature"
> though.
>
> The core issue here is that in diff.c we "fake up" the whole word-diff
> mode by essentially taking both sides, splitting them on whitespace, and
> then diffing that. So for A/B like:
>
>      A: foo bar
>         baz
>      B: foo baz
>         boo
>
> We diff:
>
>      A: foo\nbar\nbaz
>      B: foo\nbaz\nboo
>
> If you insert an extra newline into that stream you'll get different
> results, which is closer to what the Count is expecting here.
>
> In this case using --word-diff-regex='[^\n]' will produce:
>
>      The fox jumped over the [-wall-]{+river+}.
>
> But the rest of the output isn't ideal, exercise for the user and all
> that.
>
> The bug is arguably that we should be doing this by default, i.e. we
> split up our output into "\n" internally, and arguably confuse that with
> what the user wanted from their input.
>
> But see bf82940dbf1 (color-words: enable REG_NEWLINE to help user,
> 2009-01-17) which is probably the source of the "bug" here (if any) for
> a trade-off related to that.
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git diff word diff bug??
  2021-04-26  9:45   ` Ævar Arnfjörð Bjarmason
  2021-04-30  4:12     ` Count of San Francisco
@ 2021-05-02 18:00     ` Phillip Wood
  2021-05-03  9:47       ` Phillip Wood
  1 sibling, 1 reply; 7+ messages in thread
From: Phillip Wood @ 2021-05-02 18:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Atharva Raykar
  Cc: Count of San Francisco, git

On 26/04/2021 10:45, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sun, Apr 25 2021, Atharva Raykar wrote:
> 
>> On 20-Apr-2021, at 22:08, Count of San Francisco <countofsanfrancisco@gmail.com> wrote:
>>>[...]
>>> The diff above does not make it clear that the "{+river+}" is really to be appended (or related) to the first line.
>>> I expected the first diff line to look like this:
>>> The fox jumped over the [-wall.-]{+river+} and the rest of the lines are delete lines.
>>>
>>> git diff --word-diff=porcelain file0.txt file1.txt produced this:
>>> diff --git a/file0.txt b/file1.txt
>>> index c8756ba..3413f10 100644
>>> --- a/file0.txt
>>> +++ b/file1.txt
>>> @@ -1,11 +1,2 @@
>>>   The fox jumped over the
>>> -wall.
>>> ~
>>> -Blah1e32
>>> ~
>>> -q432423
>>> ~
>>> -qe23234
>>> ~
>>> - 233
>>> ~
>>> -253
>>> ~
>>> -345235
>>> ~
>>> ~
>>> -53243
>>> ~
>>> -afsfffas
>>> +river.
>>> ~
>>> +  He made it over.
>>> ~
>>>
>>> This is more non-discernable. The git diff --help documentation says
>>> that "Newlines in the input are represented by a tilde ~ on a line
>>> of its own". So a script would see the '~' character and interpret
>>> that as a new line. The script would have mistaken the "+river" for
>>> a different line. The git diff --help documentation does not explain
>>> what to do in this scenario.
>>>
>>> I expected this:
>>>   The fox jumped over the
>>> -wall.
>>> +river.
>>> ~
>>
>> This is also consistent with the behaviour I mentioned above.
>> A script would need to interpret this as:
>> delete "wall"        (this starts the streak of deletions)
>> go to next line
>> delete "Blah1e32"
>> ...
>>
>> and as soon as it sees a '+', that is, an addition, it knows
>> the series of deletions are done with, so it will add "river"
>> to the last line that was common to both, that is,
>> "the fox jumped over the".
>>
>>> Is this a bug? If not, how do I make the distinction that the {+river+} (in the first case) and the +river (in the 2nd case) is really for the first line?
>>
>> I do not think this is a bug, because it does not really
>> deviate from any specified behaviour. But I do see the source
>> of confusion.
>>
>> I hope I could explain that well enough.
> 
> I do think it's somewhere between a "bug" and a "missing feature"
> though.
> 
> The core issue here is that in diff.c we "fake up" the whole word-diff
> mode by essentially taking both sides, splitting them on whitespace, and
> then diffing that. So for A/B like:
> 
>      A: foo bar
>         baz
>      B: foo baz
>         boo
> 
> We diff:
> 
>      A: foo\nbar\nbaz
>      B: foo\nbaz\nboo
> 
> If you insert an extra newline into that stream you'll get different
> results, which is closer to what the Count is expecting here.
> 
> In this case using --word-diff-regex='[^\n]' will produce: >
>      The fox jumped over the [-wall-]{+river+}.
> 
> But the rest of the output isn't ideal, exercise for the user and all
> that.

Yes it's a complete mess as we're matching a character at a time rather 
than words, the only reason it works for the first line is that there 
happens to be some context at the end of the line.

> The bug is arguably that we should be doing this by default, i.e. we
> split up our output into "\n" internally, and arguably confuse that with
> what the user wanted from their input.
> 
> But see bf82940dbf1 (color-words: enable REG_NEWLINE to help user,
> 2009-01-17) which is probably the source of the "bug" here (if any) for
> a trade-off related to that.

Reverting that commit has no effect on the output, looking at 
find_word_boundaries() it has always terminated the match at a newline 
so I'm not sure what the commit message for bf82940dbf1 means when it 
says "Of course newlines can still be matched explicitly.".

To me the problem is not with how the word diff is calculated but how it 
is presented. At the moment it just prints all the deletions followed by 
all the insertions. I think that if following some context it just 
printed the deletions and insertions up to the end of that line and then 
dumped the rest of the deletions and insertions following the line break 
that would give a more readable result. It would need to do something 
similar for deletions and insertions at the beginning of a line that are 
followed by some context before the end of the line.

In general I find it useful that --word-diff ignores the line structure 
of the input, this enables it to show changes to the words in re-flowed 
text without the clutter one sees in a line-by-line diff from moving the 
line breaks. I think the changes above would probably be compatible with 
that.

Best Wishes

Phillip


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git diff word diff bug??
  2021-05-02 18:00     ` Phillip Wood
@ 2021-05-03  9:47       ` Phillip Wood
  2022-06-07 22:40         ` Scott Phuong
  0 siblings, 1 reply; 7+ messages in thread
From: Phillip Wood @ 2021-05-03  9:47 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Atharva Raykar
  Cc: Count of San Francisco, git

On 02/05/2021 19:00, Phillip Wood wrote:
> On 26/04/2021 10:45, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Sun, Apr 25 2021, Atharva Raykar wrote:
>>
>>> On 20-Apr-2021, at 22:08, Count of San Francisco 
>>> <countofsanfrancisco@gmail.com> wrote:
>>>> [...]
>>>> The diff above does not make it clear that the "{+river+}" is really 
>>>> to be appended (or related) to the first line.
>>>> I expected the first diff line to look like this:
>>>> The fox jumped over the [-wall.-]{+river+} and the rest of the lines 
>>>> are delete lines.
>>>>
>>>> git diff --word-diff=porcelain file0.txt file1.txt produced this:
>>>> diff --git a/file0.txt b/file1.txt
>>>> index c8756ba..3413f10 100644
>>>> --- a/file0.txt
>>>> +++ b/file1.txt
>>>> @@ -1,11 +1,2 @@
>>>>   The fox jumped over the
>>>> -wall.
>>>> ~
>>>> -Blah1e32
>>>> ~
>>>> -q432423
>>>> ~
>>>> -qe23234
>>>> ~
>>>> - 233
>>>> ~
>>>> -253
>>>> ~
>>>> -345235
>>>> ~
>>>> ~
>>>> -53243
>>>> ~
>>>> -afsfffas
>>>> +river.
>>>> ~
>>>> +  He made it over.
>>>> ~
>>>>
>>>> This is more non-discernable. The git diff --help documentation says
>>>> that "Newlines in the input are represented by a tilde ~ on a line
>>>> of its own". So a script would see the '~' character and interpret
>>>> that as a new line. The script would have mistaken the "+river" for
>>>> a different line. The git diff --help documentation does not explain
>>>> what to do in this scenario.
>>>>
>>>> I expected this:
>>>>   The fox jumped over the
>>>> -wall.
>>>> +river.
>>>> ~
>>>
>>> This is also consistent with the behaviour I mentioned above.
>>> A script would need to interpret this as:
>>> delete "wall"        (this starts the streak of deletions)
>>> go to next line
>>> delete "Blah1e32"
>>> ...
>>>
>>> and as soon as it sees a '+', that is, an addition, it knows
>>> the series of deletions are done with, so it will add "river"
>>> to the last line that was common to both, that is,
>>> "the fox jumped over the".
>>>
>>>> Is this a bug? If not, how do I make the distinction that the 
>>>> {+river+} (in the first case) and the +river (in the 2nd case) is 
>>>> really for the first line?
>>>
>>> I do not think this is a bug, because it does not really
>>> deviate from any specified behaviour. But I do see the source
>>> of confusion.
>>>
>>> I hope I could explain that well enough.
>>
>> I do think it's somewhere between a "bug" and a "missing feature"
>> though.
>>
>> The core issue here is that in diff.c we "fake up" the whole word-diff
>> mode by essentially taking both sides, splitting them on whitespace, and
>> then diffing that. So for A/B like:
>>
>>      A: foo bar
>>         baz
>>      B: foo baz
>>         boo
>>
>> We diff:
>>
>>      A: foo\nbar\nbaz
>>      B: foo\nbaz\nboo
>>
>> If you insert an extra newline into that stream you'll get different
>> results, which is closer to what the Count is expecting here.
>>
>> In this case using --word-diff-regex='[^\n]' will produce: >
>>      The fox jumped over the [-wall-]{+river+}.
>>
>> But the rest of the output isn't ideal, exercise for the user and all
>> that.
> 
> Yes it's a complete mess as we're matching a character at a time rather 
> than words, the only reason it works for the first line is that there 
> happens to be some context at the end of the line.
> 
>> The bug is arguably that we should be doing this by default, i.e. we
>> split up our output into "\n" internally, and arguably confuse that with
>> what the user wanted from their input.
>>
>> But see bf82940dbf1 (color-words: enable REG_NEWLINE to help user,
>> 2009-01-17) which is probably the source of the "bug" here (if any) for
>> a trade-off related to that.
> 
> Reverting that commit has no effect on the output, looking at 
> find_word_boundaries() it has always terminated the match at a newline 
> so I'm not sure what the commit message for bf82940dbf1 means when it 
> says "Of course newlines can still be matched explicitly.".

Having thought about it some more I think it is saying "REG_NEWLINE is 
an incomplete fix for the problem of words being silently truncated at 
newlines as it does not stop words being truncated when the regex 
contains an explicit match for a newline"

> To me the problem is not with how the word diff is calculated but how it 
> is presented. At the moment it just prints all the deletions followed by 
> all the insertions. I think that if following some context it just 
> printed the deletions and insertions up to the end of that line and then 
> dumped the rest of the deletions and insertions following the line break 
> that would give a more readable result. It would need to do something 
> similar for deletions and insertions at the beginning of a line that are 
> followed by some context before the end of the line.
> 
> In general I find it useful that --word-diff ignores the line structure 
> of the input, this enables it to show changes to the words in re-flowed 
> text without the clutter one sees in a line-by-line diff from moving the 
> line breaks. I think the changes above would probably be compatible with 
> that.
> 
> Best Wishes
> 
> Phillip
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git diff word diff bug??
  2021-05-03  9:47       ` Phillip Wood
@ 2022-06-07 22:40         ` Scott Phuong
  0 siblings, 0 replies; 7+ messages in thread
From: Scott Phuong @ 2022-06-07 22:40 UTC (permalink / raw)
  To: phillip.wood, Ævar Arnfjörð Bjarmason, Atharva Raykar; +Cc: git

Hi All,

Sorry for not replying sooner.

Here is another example I recently encountered:

First, let test-1.txt contain only these two lines:

> Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
> Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.


Next, let test-2.txt contain only these two lines:

> It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).
> There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.


Now, git diff --word-diff test-1.txt and test-2.txt show this:
  
> diff --git a/test-1.txt b/test-2.txt
> index d6eb12a..e34ce77 100644
> --- a/test-1.txt
> +++ b/test-2.txt
> @@ -1,2 +1,2 @@
> [-Lorem Ipsum-]{+It+} is [-simply dummy text of-]{+a long established fact that a reader will be distracted by+} the [-printing and typesetting industry.-]{+readable content of a page when looking at its layout. The point of using+} Lorem Ipsum {+is that it+} has[-been the industry's standard dummy text ever since the 1500s, when an unknown printer took-] a [-galley-]{+more-or-less normal distribution+} of [-type and scrambled it-]{+letters, as opposed+} to [-make-]{+using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and+} a [-type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised-]{+search for 'lorem ipsum' will uncover many web sites still+} in {+their infancy. Various versions have evolved over+} the [-1960s with-]{+years, sometimes by accident, sometimes on purpose (injected humour and+} the [-release-]{+like).+}
> {+There are many variations+} of [-Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions-]{+passages+} of Lorem[-Ipsum.-]
> [-Contrary to popular belief, Lorem-] Ipsum [-is not simply random text. It has roots-]{+available, but the majority have suffered alteration+} in {+some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use+} a [-piece-]{+passage+} of [-classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College-]{+Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden+} in [-Virginia, looked up one-]{+the middle+} of {+text. All+} the[-more obscure Latin words, consectetur, from a-] Lorem Ipsum [-passage, and going through-]{+generators on+} the [-cites of-]{+Internet tend to repeat predefined chunks as necessary, making this+} the [-word in classical literature, discovered-]{+first true generator on+} the [-undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes-]{+Internet. It uses a dictionary+} of [-Good and Evil) by Cicero, written in 45 BC. This book is-]{+over 200 Latin words, combined with+} a [-treatise on the theory-]{+handful+} of [-ethics, very popular during the Renaissance.-]{+model sentence structures, to generate Lorem Ipsum which looks reasonable.+} The [-first line of-]{+generated+} Lorem [-Ipsum, "Lorem ipsum dolor sit amet..", comes-]{+Ipsum is therefore always free+} from [-a line in section 1.10.32.-]{+repetition, injected humour, or non-characteristic words etc.+}

Notice how this diff is three lines and the two files themselves each two line. And the middle line (i.e. second line) of the diff seems to "blend" (lack of better word) changes in line 1 and line 2 of test-1.txt and test-2.txt.

Needless to say, this is a very strange diff. As a human trying to understand this diff (such as in a code or document reviews), this diff is very not intuitive as to what really changed. This is especially harder for text-based documents (such as markdown).

Intuitively, I would want all the changes to line 1 of both files to be all on the same line in the diff and all contained on the same line.

Thus, is there a way to force this behavior in current versions of diff? This email thread seem to suggest not. If not, how do I convince or ask git developers to implement this new behavior?

Thanks,

Scott

On 5/3/21 2:47 AM, Phillip Wood wrote:
> On 02/05/2021 19:00, Phillip Wood wrote:
>> On 26/04/2021 10:45, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Sun, Apr 25 2021, Atharva Raykar wrote:
>>>
>>>> On 20-Apr-2021, at 22:08, Count of San Francisco <countofsanfrancisco@gmail.com> wrote:
>>>>> [...]
>>>>> The diff above does not make it clear that the "{+river+}" is really to be appended (or related) to the first line.
>>>>> I expected the first diff line to look like this:
>>>>> The fox jumped over the [-wall.-]{+river+} and the rest of the lines are delete lines.
>>>>>
>>>>> git diff --word-diff=porcelain file0.txt file1.txt produced this:
>>>>> diff --git a/file0.txt b/file1.txt
>>>>> index c8756ba..3413f10 100644
>>>>> --- a/file0.txt
>>>>> +++ b/file1.txt
>>>>> @@ -1,11 +1,2 @@
>>>>>   The fox jumped over the
>>>>> -wall.
>>>>> ~
>>>>> -Blah1e32
>>>>> ~
>>>>> -q432423
>>>>> ~
>>>>> -qe23234
>>>>> ~
>>>>> - 233
>>>>> ~
>>>>> -253
>>>>> ~
>>>>> -345235
>>>>> ~
>>>>> ~
>>>>> -53243
>>>>> ~
>>>>> -afsfffas
>>>>> +river.
>>>>> ~
>>>>> +  He made it over.
>>>>> ~
>>>>>
>>>>> This is more non-discernable. The git diff --help documentation says
>>>>> that "Newlines in the input are represented by a tilde ~ on a line
>>>>> of its own". So a script would see the '~' character and interpret
>>>>> that as a new line. The script would have mistaken the "+river" for
>>>>> a different line. The git diff --help documentation does not explain
>>>>> what to do in this scenario.
>>>>>
>>>>> I expected this:
>>>>>   The fox jumped over the
>>>>> -wall.
>>>>> +river.
>>>>> ~
>>>>
>>>> This is also consistent with the behaviour I mentioned above.
>>>> A script would need to interpret this as:
>>>> delete "wall"        (this starts the streak of deletions)
>>>> go to next line
>>>> delete "Blah1e32"
>>>> ...
>>>>
>>>> and as soon as it sees a '+', that is, an addition, it knows
>>>> the series of deletions are done with, so it will add "river"
>>>> to the last line that was common to both, that is,
>>>> "the fox jumped over the".
>>>>
>>>>> Is this a bug? If not, how do I make the distinction that the {+river+} (in the first case) and the +river (in the 2nd case) is really for the first line?
>>>>
>>>> I do not think this is a bug, because it does not really
>>>> deviate from any specified behaviour. But I do see the source
>>>> of confusion.
>>>>
>>>> I hope I could explain that well enough.
>>>
>>> I do think it's somewhere between a "bug" and a "missing feature"
>>> though.
>>>
>>> The core issue here is that in diff.c we "fake up" the whole word-diff
>>> mode by essentially taking both sides, splitting them on whitespace, and
>>> then diffing that. So for A/B like:
>>>
>>>      A: foo bar
>>>         baz
>>>      B: foo baz
>>>         boo
>>>
>>> We diff:
>>>
>>>      A: foo\nbar\nbaz
>>>      B: foo\nbaz\nboo
>>>
>>> If you insert an extra newline into that stream you'll get different
>>> results, which is closer to what the Count is expecting here.
>>>
>>> In this case using --word-diff-regex='[^\n]' will produce: >
>>>      The fox jumped over the [-wall-]{+river+}.
>>>
>>> But the rest of the output isn't ideal, exercise for the user and all
>>> that.
>>
>> Yes it's a complete mess as we're matching a character at a time rather than words, the only reason it works for the first line is that there happens to be some context at the end of the line.
>>
>>> The bug is arguably that we should be doing this by default, i.e. we
>>> split up our output into "\n" internally, and arguably confuse that with
>>> what the user wanted from their input.
>>>
>>> But see bf82940dbf1 (color-words: enable REG_NEWLINE to help user,
>>> 2009-01-17) which is probably the source of the "bug" here (if any) for
>>> a trade-off related to that.
>>
>> Reverting that commit has no effect on the output, looking at find_word_boundaries() it has always terminated the match at a newline so I'm not sure what the commit message for bf82940dbf1 means when it says "Of course newlines can still be matched explicitly.".
> 
> Having thought about it some more I think it is saying "REG_NEWLINE is an incomplete fix for the problem of words being silently truncated at newlines as it does not stop words being truncated when the regex contains an explicit match for a newline"
> 
>> To me the problem is not with how the word diff is calculated but how it is presented. At the moment it just prints all the deletions followed by all the insertions. I think that if following some context it just printed the deletions and insertions up to the end of that line and then dumped the rest of the deletions and insertions following the line break that would give a more readable result. It would need to do something similar for deletions and insertions at the beginning of a line that are followed by some context before the end of the line.
>>
>> In general I find it useful that --word-diff ignores the line structure of the input, this enables it to show changes to the words in re-flowed text without the clutter one sees in a line-by-line diff from moving the line breaks. I think the changes above would probably be compatible with that.
>>
>> Best Wishes
>>
>> Phillip
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-06-08  0:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-20 16:38 git diff word diff bug?? Count of San Francisco
2021-04-25  8:54 ` Atharva Raykar
2021-04-26  9:45   ` Ævar Arnfjörð Bjarmason
2021-04-30  4:12     ` Count of San Francisco
2021-05-02 18:00     ` Phillip Wood
2021-05-03  9:47       ` Phillip Wood
2022-06-07 22:40         ` Scott Phuong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).