Understanding and improving --word-diff

* Understanding and improving --word-diff
@ 2010-11-08 15:16 Matthijs Kooijman
  2010-11-08 15:41 ` Matthieu Moy
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Matthijs Kooijman @ 2010-11-08 15:16 UTC (permalink / raw)
  To: git; +Cc: Thomas Rast, Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 7104 bytes --]

Hi folks,

I recently discovered --word-diff (or rather, --color-words and found
--word-diff when I started to hack on the git master version) and I had
hoped it would make the unified diffs generated by git-diff more
readable.

More specifically, I had expected to get a normal unified diff, with
colouring added to highlight the changes within the normal - and + lines
(so you don't have to review the entire changed line to see that just a
single word or character has changed).

E.g., I would like to see:

-a <r>b</r> c
+a <g>x</g> c

Unfortunately, all --word-diff types currently departs from line-based -
and + lines and show the new version of the file with the changed words
(both old and new versions) shown inline, marked with coloring or
{- ...  -} kind of syntax. E.g., with --word-diff=color, the above would
look like:

a <r>b</r><g>x</g> c

Personally, I think that the first example above is easier to read than
the second one (at least for diffs of code).

I was planning to let this mail be accompanied with a patch, so I've
started hacking on this feature already. However, halfway through some
cleanups and a prototype implementation of the above (breaking some of
the other --word-diff formats in the process), I found that the current
generalization of the different styles as stored in diff_words_styles[]
does not apply cleanly enough to my intended output format. While trying
to extend this generalization to something that would fit, I found that
I don't actually understand the rationale behind --word-diff and the
formats well enough to find a proper implemenation (see the link at the
bottom of this email for the unfinished code I hacked up until now).

So, here's some observations and questions about how --word-diff works
or should work. Comments are welcome, both in general terms as well as
in terms of the word-diff implementation (I know my way around there by
now).

Intended use
------------
First of all, it seems that the main intended use of --word-diff is for
LaTeX or HTML documents or similar, where blocks of running text might
be hard-wrapped (and thus rewrapped after a small change). In these
cases, a small change in wording could cause a lot of whitespace to
shift, resulting in a big normal diff. The current word-diff
implementation therefore simply does not show the whitespace (or rather,
non-word) changes, since they're usually not relevant to LaTeX anyway.

Is this indeed the main usecase, or are there others I'm missing?

Inexact output
--------------
Secondly, the --word-diff output currently never displays any changes to
the non-word (whitespace) parts of a file. This makes sense for the
LaTeX case, but sometimes you might want to get exact diff output
instead. At first glance this seems possible by specifiying a word-regex
of "." or something similar (i.e., make sure that the word regex matches
everything). But this is problematic for newlines. The documentation
states that stuff gets silently ignored if a newline ends up inside a
word. For the --word-diff=color format, this is probably a fixed
limitation of the otput format: you can't give a color to a newline (or
a space, for that matter).

Including a newline inside a {- ... -} block should not be a problem
with the --word-diff=plain format, and something similar can be argued
for the porcelain format.

An alternative approach would be to add a --word-diff-exact flag, which
would cause the whitespace between to matches of the word regex to be
treated as a word as well and have it included in the generate
word-diff.

This still leaves an implementation problem: To generate the word-diff,
the current code looks at one patch hunk at a time, collecting all the
plus and minus lines. It then splits those lines into words and
generates two new "files" containing one word per line. It then applies
a diff to this new document to get the word-diff.

When a word would contain a newline, this would effectively mean the
word would be split into two words for the word-diff, which will
probably screw up the output. An obvious solution would be to use some
escape sequence (e.g. \n) for a newline, though that might get messy and
inefficient. An alternative that seems feasable is to use the empty word
(i.e., an empty line in the word-diff "files") to mean a newline.

This would mean that every newline always breaks a word into two,
regardless of what the word regex is set to (but I guess that makes
sense anyway?). I also think this would allow complete diff output wrt
whitespace and newlines, for output formats that support it: plain,
(modified) porcelain and my proposed format.

Porcelain format
----------------
Lastly, the "porcelain" word-diff format seems a bit weird to me. Is
the format specified somewhere, or are there any programs that use it
currently? I couldn't find any users inside the git.git tree itself?

Looking at the format itself, it's a bit unclear to me what the ~ lines
mean exactly. Commit 882749, which introduced the format says the mean
"newlines in the input", but I'm not sure if this means the old file,
new file or both.

In fact, it seems that this uncertainty makes the porcelain-format
ambiguous wrt newlines. For example, these two diff hunks:

@@ -1,3 +1,2 @@
 a
-b
 c

@@ -1,3 +1,3 @@
 a
-b
+
 c

both look the same in porcelain format, except for the hunk header.

@@ -1,3 +1,2 @@
 a
~
-b
~
 c
~
@@ -1,3 +1,3 @@
 a
~
-b
~
 c
~

This is somewhat expected, of course, since the --word-diff formats are
documented to show only changes to words, not to non-words/whitespace.
So I guess it is expected that the output is ambigious wrt whitespace,
but if so, what is the use of this porcelain format? Wouldn't it be make
a lot more sense to make the format unambiguous and make it do
word-based diff at the same time? I think this should be possible
because of the explicit notation used for the newline.

For example, Specifying the ~ lines to mean a newline in the old, new or
both files depending on the previous +, - or space prefixed line is
probably enough for this. By generating empty +, - or space prefixed
lines when needed, every occurence of ~ could be disambiguated.

For example, the above two diff hunks would then become the following.
The only difference is the near-empty line (just a space prefix) after
-b in the second hunk.

@@ -1,3 +1,2 @@
 a
~
-b
~
 c
~
@@ -1,3 +1,3 @@
 a
~
-b
 
~
 c
~




So, these are some thoughts I've had while hacking on the code. As said,
suggestions are welcome. I'd like my hacking to result in some useful
patches, but right now I'm unsure what direction(s) I should be
thinking/working in.

In case you're interested in the hacking I've done so far, I've put it
up here:

http://git.stderr.nl/gitweb?p=matthijs/upstream/git.git;a=shortlog;h=refs/heads/word-diff

Most of it is broken or not properly tested, but it gets an idea what
kinds of cleanup I've been doing.

Gr.

Matthijs

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread