All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2 v4] xdiff: implement empty line chunk heuristic
@ 2016-04-18 21:12 Stefan Beller
  2016-04-18 21:12 ` [PATCH 1/2] xdiff: add recs_match helper function Stefan Beller
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Stefan Beller @ 2016-04-18 21:12 UTC (permalink / raw)
  To: gitster; +Cc: git, jacob.keller, Stefan Beller

> OK, so perhaps either of you two can do a final version people can
> start having fun with?

Here we go. I squashed in your patch, although with a minor change:

-               if ((flags & XDF_SHORTEST_LINE_HEURISTIC)) {
+               if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {

We did not need that in the "shortest line" heuristic as we know
a line with the shortest line length must exist. We do not know about
empty lines though.

Thanks,
Stefan

Jacob Keller (1):
  xdiff: add recs_match helper function

Stefan Beller (1):
  xdiff: implement empty line chunk heuristic

 Documentation/diff-config.txt  |  5 +++++
 Documentation/diff-options.txt |  6 ++++++
 diff.c                         | 11 +++++++++++
 xdiff/xdiff.h                  |  2 ++
 xdiff/xdiffi.c                 | 40 ++++++++++++++++++++++++++++++++++++----
 5 files changed, 60 insertions(+), 4 deletions(-)

-- 
2.8.0.26.gba39a1b.dirty

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] xdiff: add recs_match helper function
  2016-04-18 21:12 [PATCH 0/2 v4] xdiff: implement empty line chunk heuristic Stefan Beller
@ 2016-04-18 21:12 ` Stefan Beller
  2016-04-18 21:12 ` [PATCH 2/2] xdiff: implement empty line chunk heuristic Stefan Beller
  2016-04-18 21:22 ` [PATCH 0/2 v4] " Junio C Hamano
  2 siblings, 0 replies; 20+ messages in thread
From: Stefan Beller @ 2016-04-18 21:12 UTC (permalink / raw)
  To: gitster; +Cc: git, jacob.keller, Jacob Keller, Stefan Beller

From: Jacob Keller <jacob.keller@gmail.com>

It is a common pattern in xdl_change_compact to check that hashes and
strings match. The resulting code to perform this change causes very
long lines and makes it hard to follow the intention. Introduce a helper
function recs_match which performs both checks to increase
code readability.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Stefan Beller <sbeller@google.com>
---
 xdiff/xdiffi.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 2358a2d..748eeb9 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -400,6 +400,14 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
 }
 
 
+static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
+{
+	return (recs[ixs]->ha == recs[ix]->ha &&
+		xdl_recmatch(recs[ixs]->ptr, recs[ixs]->size,
+			     recs[ix]->ptr, recs[ix]->size,
+			     flags));
+}
+
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 	long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec;
 	char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
@@ -442,8 +450,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			 * the last line of the current change group, shift backward
 			 * the group.
 			 */
-			while (ixs > 0 && recs[ixs - 1]->ha == recs[ix - 1]->ha &&
-			       xdl_recmatch(recs[ixs - 1]->ptr, recs[ixs - 1]->size, recs[ix - 1]->ptr, recs[ix - 1]->size, flags)) {
+			while (ixs > 0 && recs_match(recs, ixs - 1, ix - 1, flags)) {
 				rchg[--ixs] = 1;
 				rchg[--ix] = 0;
 
@@ -470,8 +477,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			 * the line next of the current change group, shift forward
 			 * the group.
 			 */
-			while (ix < nrec && recs[ixs]->ha == recs[ix]->ha &&
-			       xdl_recmatch(recs[ixs]->ptr, recs[ixs]->size, recs[ix]->ptr, recs[ix]->size, flags)) {
+			while (ix < nrec && recs_match(recs, ixs, ix, flags)) {
 				rchg[ixs++] = 0;
 				rchg[ix++] = 1;
 
-- 
2.8.0.26.gba39a1b.dirty

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-18 21:12 [PATCH 0/2 v4] xdiff: implement empty line chunk heuristic Stefan Beller
  2016-04-18 21:12 ` [PATCH 1/2] xdiff: add recs_match helper function Stefan Beller
@ 2016-04-18 21:12 ` Stefan Beller
  2016-04-18 22:04   ` Jacob Keller
  2016-04-19  5:03   ` Jeff King
  2016-04-18 21:22 ` [PATCH 0/2 v4] " Junio C Hamano
  2 siblings, 2 replies; 20+ messages in thread
From: Stefan Beller @ 2016-04-18 21:12 UTC (permalink / raw)
  To: gitster; +Cc: git, jacob.keller, Stefan Beller, Jacob Keller

In order to produce the smallest possible diff and combine several diff
hunks together, we implement a heuristic from GNU Diff which moves diff
hunks forward as far as possible when we find common context above and
below a diff hunk. This sometimes produces less readable diffs when
writing C, Shell, or other programming languages, ie:

...
 /*
+ *
+ *
+ */
+
+/*
...

instead of the more readable equivalent of

...
+/*
+ *
+ *
+ */
+
 /*
...

Implement the following heuristic to (optionally) produce the desired
output.

  If there are diff chunks which can be shifted around, shift each hunk
  such that the last common empty line is below the chunk with the rest
  of the context above.

This heuristic appears to resolve the above example and several other
common issues without producing significantly weird results. However, as
with any heuristic it is not really known whether this will always be
more optimal. Thus, it can be disabled via diff.compactionHeuristic.

Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Stefan Beller <sbeller@google.com>
---
 Documentation/diff-config.txt  |  5 +++++
 Documentation/diff-options.txt |  6 ++++++
 diff.c                         | 11 +++++++++++
 xdiff/xdiff.h                  |  2 ++
 xdiff/xdiffi.c                 | 26 ++++++++++++++++++++++++++
 5 files changed, 50 insertions(+)

diff --git a/Documentation/diff-config.txt b/Documentation/diff-config.txt
index edba565..a9f4b57 100644
--- a/Documentation/diff-config.txt
+++ b/Documentation/diff-config.txt
@@ -170,6 +170,11 @@ diff.tool::
 
 include::mergetools-diff.txt[]
 
+diff.compactionHeuristic::
+	Set this option to enable an experimental heuristic that
+	shifts the hunk boundary in an attempt to make the resulting
+	patch easier to read.
+
 diff.algorithm::
 	Choose a diff algorithm.  The variants are as follows:
 +
diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index 4b0318e..0993742 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -63,6 +63,12 @@ ifndef::git-format-patch[]
 	Synonym for `-p --raw`.
 endif::git-format-patch[]
 
+--compaction-heuristic::
+--no-compaction-heuristic::
+	These are to help debugging and tuning an experimental
+	heuristic that shifts the hunk boundary in an attempt to
+	make the resulting patch easier to read.
+
 --minimal::
 	Spend extra time to make sure the smallest possible
 	diff is produced.
diff --git a/diff.c b/diff.c
index 4dfe660..d3734d3 100644
--- a/diff.c
+++ b/diff.c
@@ -26,6 +26,7 @@
 #endif
 
 static int diff_detect_rename_default;
+static int diff_compaction_heuristic = 1;
 static int diff_rename_limit_default = 400;
 static int diff_suppress_blank_empty;
 static int diff_use_color_default = -1;
@@ -189,6 +190,10 @@ int git_diff_ui_config(const char *var, const char *value, void *cb)
 		diff_detect_rename_default = git_config_rename(var, value);
 		return 0;
 	}
+	if (!strcmp(var, "diff.compactionheuristic")) {
+		diff_compaction_heuristic = git_config_bool(var, value);
+		return 0;
+	}
 	if (!strcmp(var, "diff.autorefreshindex")) {
 		diff_auto_refresh_index = git_config_bool(var, value);
 		return 0;
@@ -3278,6 +3283,8 @@ void diff_setup(struct diff_options *options)
 	options->use_color = diff_use_color_default;
 	options->detect_rename = diff_detect_rename_default;
 	options->xdl_opts |= diff_algorithm;
+	if (diff_compaction_heuristic)
+		DIFF_XDL_SET(options, COMPACTION_HEURISTIC);
 
 	options->orderfile = diff_order_file_cfg;
 
@@ -3798,6 +3805,10 @@ int diff_opt_parse(struct diff_options *options,
 		DIFF_XDL_SET(options, IGNORE_WHITESPACE_AT_EOL);
 	else if (!strcmp(arg, "--ignore-blank-lines"))
 		DIFF_XDL_SET(options, IGNORE_BLANK_LINES);
+	else if (!strcmp(arg, "--compaction-heuristic"))
+		DIFF_XDL_SET(options, COMPACTION_HEURISTIC);
+	else if (!strcmp(arg, "--no-compaction-heuristic"))
+		DIFF_XDL_CLR(options, COMPACTION_HEURISTIC);
 	else if (!strcmp(arg, "--patience"))
 		options->xdl_opts = DIFF_WITH_ALG(options, PATIENCE_DIFF);
 	else if (!strcmp(arg, "--histogram"))
diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h
index 4fb7e79..7423f77 100644
--- a/xdiff/xdiff.h
+++ b/xdiff/xdiff.h
@@ -41,6 +41,8 @@ extern "C" {
 
 #define XDF_IGNORE_BLANK_LINES (1 << 7)
 
+#define XDF_COMPACTION_HEURISTIC (1 << 8)
+
 #define XDL_EMIT_FUNCNAMES (1 << 0)
 #define XDL_EMIT_FUNCCONTEXT (1 << 2)
 
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 748eeb9..5a02b15 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -400,6 +400,11 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
 }
 
 
+static int is_blank_line(xrecord_t **recs, long ix, long flags)
+{
+	return xdl_blankline(recs[ix]->ptr, recs[ix]->size, flags);
+}
+
 static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
 {
 	return (recs[ixs]->ha == recs[ix]->ha &&
@@ -411,6 +416,7 @@ static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 	long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec;
 	char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
+	unsigned int blank_lines;
 	xrecord_t **recs = xdf->recs;
 
 	/*
@@ -444,6 +450,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 
 		do {
 			grpsiz = ix - ixs;
+			blank_lines = 0;
 
 			/*
 			 * If the line before the current change group, is equal to
@@ -478,6 +485,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			 * the group.
 			 */
 			while (ix < nrec && recs_match(recs, ixs, ix, flags)) {
+				blank_lines += is_blank_line(recs, ix, flags);
+
 				rchg[ixs++] = 0;
 				rchg[ix++] = 1;
 
@@ -504,6 +513,23 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			rchg[--ix] = 0;
 			while (rchgo[--ixo]);
 		}
+
+		/*
+		 * If a group can be moved back and forth, see if there is an
+		 * blank line in the moving space. If there is a blank line,
+		 * make sure the last blank line is the end of the group.
+		 *
+		 * As we shifted the group forward as far as possible, we only
+		 * need to shift it back if at all.
+		 */
+		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
+			while (ixs > 0 &&
+			       !is_blank_line(recs, ix - 1, flags) &&
+			       recs_match(recs, ixs - 1, ix - 1, flags)) {
+				rchg[--ixs] = 1;
+				rchg[--ix] = 0;
+			}
+		}
 	}
 
 	return 0;
-- 
2.8.0.26.gba39a1b.dirty

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2 v4] xdiff: implement empty line chunk heuristic
  2016-04-18 21:12 [PATCH 0/2 v4] xdiff: implement empty line chunk heuristic Stefan Beller
  2016-04-18 21:12 ` [PATCH 1/2] xdiff: add recs_match helper function Stefan Beller
  2016-04-18 21:12 ` [PATCH 2/2] xdiff: implement empty line chunk heuristic Stefan Beller
@ 2016-04-18 21:22 ` Junio C Hamano
  2016-04-18 23:53   ` Stefan Beller
  2 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2016-04-18 21:22 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, jacob.keller

Stefan Beller <sbeller@google.com> writes:

>> OK, so perhaps either of you two can do a final version people can
>> start having fun with?
>
> Here we go. I squashed in your patch, although with a minor change:
>
> -               if ((flags & XDF_SHORTEST_LINE_HEURISTIC)) {
> +               if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
>
> We did not need that in the "shortest line" heuristic as we know
> a line with the shortest line length must exist. We do not know about
> empty lines though.

Makes sense.  The last hunk of

$ git show 9614b8dcf -- update-cache.c

gives an unexpected result without "&& blank_lines" above.  Lack of
"&& blank_lines" happens to make the result slightly easier to read,
but at the cost of having an extra line in the hunk.

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-18 21:12 ` [PATCH 2/2] xdiff: implement empty line chunk heuristic Stefan Beller
@ 2016-04-18 22:04   ` Jacob Keller
  2016-04-18 22:24     ` Junio C Hamano
  2016-04-19  5:03   ` Jeff King
  1 sibling, 1 reply; 20+ messages in thread
From: Jacob Keller @ 2016-04-18 22:04 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Junio C Hamano, Git mailing list, Jacob Keller

On Mon, Apr 18, 2016 at 2:12 PM, Stefan Beller <sbeller@google.com> wrote:
> In order to produce the smallest possible diff and combine several diff
> hunks together, we implement a heuristic from GNU Diff which moves diff
> hunks forward as far as possible when we find common context above and
> below a diff hunk. This sometimes produces less readable diffs when
> writing C, Shell, or other programming languages, ie:
>
> ...
>  /*
> + *
> + *
> + */
> +
> +/*
> ...
>
> instead of the more readable equivalent of
>
> ...
> +/*
> + *
> + *
> + */
> +
>  /*
> ...
>
> Implement the following heuristic to (optionally) produce the desired
> output.
>
>   If there are diff chunks which can be shifted around, shift each hunk
>   such that the last common empty line is below the chunk with the rest
>   of the context above.
>
> This heuristic appears to resolve the above example and several other
> common issues without producing significantly weird results. However, as
> with any heuristic it is not really known whether this will always be
> more optimal. Thus, it can be disabled via diff.compactionHeuristic.
>
> Signed-off-by: Stefan Beller <sbeller@google.com>
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---

Thanks Stephan and Junio, this looks pretty good. I think before it's
merged we'd probably want to implement some sort of attributes which
allows per-path configuration, incase it needs to be configured at
all.

I've got it applied to my local git, and I'm going to try to run a
diff between enabled vs disabled on a large section of the Linux
kernel history and a few other projects to see if I spot anything odd.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-18 22:04   ` Jacob Keller
@ 2016-04-18 22:24     ` Junio C Hamano
  0 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2016-04-18 22:24 UTC (permalink / raw)
  To: Jacob Keller; +Cc: Stefan Beller, Git mailing list, Jacob Keller

Jacob Keller <jacob.keller@gmail.com> writes:

> Thanks Stephan and Junio, this looks pretty good. I think before it's
> merged we'd probably want to implement some sort of attributes which
> allows per-path configuration, incase it needs to be configured at
> all.

My take on it is that we'd want to make sure that the shift with
blank line heuristics is "good enough", i.e. there is no need for
end-user configuration or attributes, and then remove the tentative
option, configuration and its documentation, before this is merged.

If we really want to add knobs to handle different kind of payloads
in vastly different way, the right place to do so is to add a set of
bits "use this and that heuristics" to userdiff driver, I would say,
but in the compaction codepath it does not seem to be enough room to
have that many knobs to be tweaked in the first place to me.

> I've got it applied to my local git, and I'm going to try to run a
> diff between enabled vs disabled on a large section of the Linux
> kernel history and a few other projects to see if I spot anything odd.

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2 v4] xdiff: implement empty line chunk heuristic
  2016-04-18 21:22 ` [PATCH 0/2 v4] " Junio C Hamano
@ 2016-04-18 23:53   ` Stefan Beller
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Beller @ 2016-04-18 23:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jacob Keller

On Mon, Apr 18, 2016 at 2:22 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Stefan Beller <sbeller@google.com> writes:
>
>>> OK, so perhaps either of you two can do a final version people can
>>> start having fun with?
>>
>> Here we go. I squashed in your patch, although with a minor change:
>>
>> -               if ((flags & XDF_SHORTEST_LINE_HEURISTIC)) {
>> +               if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
>>
>> We did not need that in the "shortest line" heuristic as we know
>> a line with the shortest line length must exist. We do not know about
>> empty lines though.
>
> Makes sense.  The last hunk of
>
> $ git show 9614b8dcf -- update-cache.c
>
> gives an unexpected result without "&& blank_lines" above.  Lack of
> "&& blank_lines" happens to make the result slightly easier to read,
> but at the cost of having an extra line in the hunk.

So without the blank_lines check you get  (A):
    @@ -271,15 +279,14 @@ int main(int argc, char **argv)
                     if (!verify_path(path)) {
                             fprintf(stderr, "Ignoring path %s\n", argv[i]);
                             continue;
    -                }
    -                if (add_file_to_cache(path)) {
    -                        fprintf(stderr, "Unable to add %s to
database\n", path);
    -                        goto out;
                     }
    +                if (add_file_to_cache(path))
    +                        usage("Unable to add %s to database", path);
             }
    ...

and with the heuristic you get (B):

@@ -272,14 +280,13 @@ int main(int argc, char **argv)
    @@ -272,14 +280,13 @@ int main(int argc, char **argv)
                             fprintf(stderr, "Ignoring path %s\n", argv[i]);
                             continue;
                     }
    -                if (add_file_to_cache(path)) {
    -                        fprintf(stderr, "Unable to add %s to
database\n", path);
    -                        goto out;
    -                }
    +                if (add_file_to_cache(path))
    +                        usage("Unable to add %s to database", path);
             }
    ...

In case of (A) the compaction heuristic tries to shift the hunk upwards,
stopping at the first empty line or when lines miss match.
As there is no blank line, it goes until the miss match.

Personally I'd find it less readable, because the intent was not to remove

    -                }
    -                if (add_file_to_cache(path)) {
    -                        fprintf(stderr, "Unable to add %s to
database\n", path);
    -                        goto out;

but rather remove

    -                if (add_file_to_cache(path)) {
    -                        fprintf(stderr, "Unable to add %s to
database\n", path);
    -                        goto out;
    -                }

as that is the logic unit I'd think.

Although you find this instance easier to read the behavior without the
blank_lines check would result in

    Shift hunk upward as much as possible, stop at the first empty line.

For hunks without empty line this just becomes

    Shift hunk upward as much as possible.

which is 50:50 for looking good, so we kept the old behavior as
that is just as good.

Thanks,
Stefan


>
> Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-18 21:12 ` [PATCH 2/2] xdiff: implement empty line chunk heuristic Stefan Beller
  2016-04-18 22:04   ` Jacob Keller
@ 2016-04-19  5:03   ` Jeff King
  2016-04-19  6:47     ` Stefan Beller
                       ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Jeff King @ 2016-04-19  5:03 UTC (permalink / raw)
  To: Stefan Beller; +Cc: gitster, git, jacob.keller, Jacob Keller

On Mon, Apr 18, 2016 at 02:12:30PM -0700, Stefan Beller wrote:

> +
> +		/*
> +		 * If a group can be moved back and forth, see if there is an
> +		 * blank line in the moving space. If there is a blank line,
> +		 * make sure the last blank line is the end of the group.

s/an/a/ on the first line

> +		 * As we shifted the group forward as far as possible, we only
> +		 * need to shift it back if at all.

Maybe because I'm reading it as a diff that only contains this hunk and
not the whole rest of the function, but the "we" here confused me. You
mean the earlier, existing loop in xdl_change_compact, right?

Maybe something like:

  As we already shifted the group forward as far as possible in the
  earlier loop...

would help.

> +		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
> +			while (ixs > 0 &&
> +			       !is_blank_line(recs, ix - 1, flags) &&
> +			       recs_match(recs, ixs - 1, ix - 1, flags)) {
> +				rchg[--ixs] = 1;
> +				rchg[--ix] = 0;
> +			}
> +		}

This turned out to be delightfully simple (especially compared to the
perl monstrosity).

I tried comparing the output to the perl one, but it's not quite the
same. In that one we had to work with the existing hunks and context
lines, so any hunk that got shifted ended up with extra context on one
side, and too little on the other. But here, we can actually bump the
context lines to give the correct amount on both sides, which is good.

I guess this will invalidate old patch-ids, but there's not much to be
done about that.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-19  5:03   ` Jeff King
@ 2016-04-19  6:47     ` Stefan Beller
  2016-04-19  7:00       ` Jeff King
  2016-04-19 15:17     ` Stefan Beller
  2016-04-19 16:51     ` Junio C Hamano
  2 siblings, 1 reply; 20+ messages in thread
From: Stefan Beller @ 2016-04-19  6:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Jacob Keller, Jacob Keller

On Mon, Apr 18, 2016 at 10:03 PM, Jeff King <peff@peff.net> wrote:
> On Mon, Apr 18, 2016 at 02:12:30PM -0700, Stefan Beller wrote:
>
>> +
>> +             /*
>> +              * If a group can be moved back and forth, see if there is an
>> +              * blank line in the moving space. If there is a blank line,
>> +              * make sure the last blank line is the end of the group.
>
> s/an/a/ on the first line

So it looks like I'll be resending another version for this series tomorrow.
Thanks for pointing this out!

>
>> +              * As we shifted the group forward as far as possible, we only
>> +              * need to shift it back if at all.
>
> Maybe because I'm reading it as a diff that only contains this hunk and
> not the whole rest of the function, but the "we" here confused me. You
> mean the earlier, existing loop in xdl_change_compact, right?
>
> Maybe something like:
>
>   As we already shifted the group forward as far as possible in the
>   earlier loop...
>
> would help.

I'll see to get rid of the 'we', otherwise I'll stick with your suggestion.

>
>> +             if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
>> +                     while (ixs > 0 &&
>> +                            !is_blank_line(recs, ix - 1, flags) &&
>> +                            recs_match(recs, ixs - 1, ix - 1, flags)) {
>> +                             rchg[--ixs] = 1;
>> +                             rchg[--ix] = 0;
>> +                     }
>> +             }
>
> This turned out to be delightfully simple (especially compared to the
> perl monstrosity).
>
> I tried comparing the output to the perl one, but it's not quite the
> same. In that one we had to work with the existing hunks and context
> lines, so any hunk that got shifted ended up with extra context on one
> side, and too little on the other. But here, we can actually bump the
> context lines to give the correct amount on both sides, which is good.
>
> I guess this will invalidate old patch-ids, but there's not much to be
> done about that.

For the record:
I thought about "optimal hunk separation" for a while, specially during my
bike commute. And while this heuristic seems to be a good fit for most of
the cases inspected, we can do better (in the future).

I am convinced the better way to do it is like this:

    Calculate the entropy for each line and take the last line with the
    lowest entropy as the last line of the hunk.

That heuristic requires more compute though as it will be hard to compute
the entropy for the line. To do that I would imagine, we'd need to loop over
the whole file and count the occurrences for each char (byte) and then
take the negative log of (#number of that byte / #number of bytes in file) [1].

This would model our actual goal a bit more closely to split at parts, where
there is low information density (the definition of entropy).

One example Jacob pointed out was a thing like

/**
 * Comment here. Over
 * more lines.
 *
+ *  Add line here with a blank line
+ *
+ * in between and a trailing blank after.
+ *
 */

I think we had cases like this in the kernel tree and else where,
and for a human it is clear to break after the last "empty line"
(which for comments starts with " * "). To detect those we can use
the entropy as it doesn't convey lots of information.
(git show e1f7037167323461c0415447676262dcb)

It also keeps the false positives out, Jacob pointed at
85ed2f32064b82e541fc7dcf2b0049a05 IIRC, which was bad with
the shortest lines only, but I'd imagine the entropy based
heuristic will do better there.

[1] https://en.wikipedia.org/wiki/Entropy_(information_theory)

Thanks for the review,
Stefan

>
> -Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-19  6:47     ` Stefan Beller
@ 2016-04-19  7:00       ` Jeff King
  2016-04-19  7:05         ` Stefan Beller
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2016-04-19  7:00 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Junio C Hamano, git, Jacob Keller, Jacob Keller

On Mon, Apr 18, 2016 at 11:47:52PM -0700, Stefan Beller wrote:

> I am convinced the better way to do it is like this:
> 
>     Calculate the entropy for each line and take the last line with the
>     lowest entropy as the last line of the hunk.

I'll be curious to see the results, but I think sometimes predictable
and stupid may be the best route with these sorts of things. In
particular, I'd worry that a content-independent measure of entropy
might miss some subtleties of a particular language (e.g., that "*" is
more or less meaningful than some other character). But we'll see. :)

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-19  7:00       ` Jeff King
@ 2016-04-19  7:05         ` Stefan Beller
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Beller @ 2016-04-19  7:05 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Jacob Keller, Jacob Keller

On Tue, Apr 19, 2016 at 12:00 AM, Jeff King <peff@peff.net> wrote:
> On Mon, Apr 18, 2016 at 11:47:52PM -0700, Stefan Beller wrote:
>
>> I am convinced the better way to do it is like this:
>>
>>     Calculate the entropy for each line and take the last line with the
>>     lowest entropy as the last line of the hunk.
>
> I'll be curious to see the results, but I think sometimes predictable
> and stupid may be the best route with these sorts of things. In
> particular, I'd worry that a content-independent measure of entropy
> might miss some subtleties of a particular language (e.g., that "*" is
> more or less meaningful than some other character). But we'll see. :)

I would assume that the "*" would have little entropy when there are lots
of comments, i.e. it just "feels" like an empty line.
If there are no "*", then the entropy is high as it is unusual. And
unusual things
should not be at the border of a hunk I would assume.
So m prediction is that the  'subtleties of a particular language' correlate
highly with the actual use of characters.

Anyway, the experiment can be carried out later. :)

Thanks,
Stefan

>
> -Peff
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-19  5:03   ` Jeff King
  2016-04-19  6:47     ` Stefan Beller
@ 2016-04-19 15:17     ` Stefan Beller
  2016-04-19 17:06       ` Jeff King
  2016-04-19 16:51     ` Junio C Hamano
  2 siblings, 1 reply; 20+ messages in thread
From: Stefan Beller @ 2016-04-19 15:17 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Jacob Keller, Jacob Keller

On Mon, Apr 18, 2016 at 10:03 PM, Jeff King <peff@peff.net> wrote:

> I guess this will invalidate old patch-ids, but there's not much to be
> done about that.

What do you mean by that? (What consequences do you imagine?)
I think diffs with any kind of heuristic can still be applied, no?

Thanks,
Stefan

>
> -Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-19  5:03   ` Jeff King
  2016-04-19  6:47     ` Stefan Beller
  2016-04-19 15:17     ` Stefan Beller
@ 2016-04-19 16:51     ` Junio C Hamano
  2 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2016-04-19 16:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, git, jacob.keller, Jacob Keller

Jeff King <peff@peff.net> writes:

> I guess this will invalidate old patch-ids, but there's not much to be
> done about that.

If we really cared, we could disable this (and any future) change to
the compaction logic to "patch-id --[un]stable" option.

I am not sure if it is worth the effort, though ;-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-19 15:17     ` Stefan Beller
@ 2016-04-19 17:06       ` Jeff King
  2016-04-19 23:02         ` Jacob Keller
  2016-04-20  6:00         ` Junio C Hamano
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff King @ 2016-04-19 17:06 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Junio C Hamano, git, Jacob Keller, Jacob Keller

On Tue, Apr 19, 2016 at 08:17:38AM -0700, Stefan Beller wrote:

> On Mon, Apr 18, 2016 at 10:03 PM, Jeff King <peff@peff.net> wrote:
> 
> > I guess this will invalidate old patch-ids, but there's not much to be
> > done about that.
> 
> What do you mean by that? (What consequences do you imagine?)
> I think diffs with any kind of heuristic can still be applied, no?

I mean that if you save any old patch-ids from "git patch-id", they
won't match up when compared with new versions of git. We can probably
ignore it, though. This isn't the first time that patch-ids might have
changed, and I think the advice is already that one should not count on
them to be stable in the long term.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-19 17:06       ` Jeff King
@ 2016-04-19 23:02         ` Jacob Keller
  2016-04-19 23:07           ` Junio C Hamano
  2016-04-20  6:00         ` Junio C Hamano
  1 sibling, 1 reply; 20+ messages in thread
From: Jacob Keller @ 2016-04-19 23:02 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, Junio C Hamano, git, Jacob Keller

On Tue, Apr 19, 2016 at 10:06 AM, Jeff King <peff@peff.net> wrote:
> On Tue, Apr 19, 2016 at 08:17:38AM -0700, Stefan Beller wrote:
>
>> On Mon, Apr 18, 2016 at 10:03 PM, Jeff King <peff@peff.net> wrote:
>>
>> > I guess this will invalidate old patch-ids, but there's not much to be
>> > done about that.
>>
>> What do you mean by that? (What consequences do you imagine?)
>> I think diffs with any kind of heuristic can still be applied, no?
>
> I mean that if you save any old patch-ids from "git patch-id", they
> won't match up when compared with new versions of git. We can probably
> ignore it, though. This isn't the first time that patch-ids might have
> changed, and I think the advice is already that one should not count on
> them to be stable in the long term.
>
> -Peff

Plus they'll be stable within a version of Git, it's only recorded
patch ids that change, which hopefully isn't done very much if at all.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-19 23:02         ` Jacob Keller
@ 2016-04-19 23:07           ` Junio C Hamano
  2016-04-20 13:12             ` Michael S. Tsirkin
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2016-04-19 23:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jacob Keller, Jeff King, Stefan Beller, git, Jacob Keller

Jacob Keller <jacob.keller@gmail.com> writes:

> On Tue, Apr 19, 2016 at 10:06 AM, Jeff King <peff@peff.net> wrote:
>> On Tue, Apr 19, 2016 at 08:17:38AM -0700, Stefan Beller wrote:
>>
>>> On Mon, Apr 18, 2016 at 10:03 PM, Jeff King <peff@peff.net> wrote:
>>>
>>> > I guess this will invalidate old patch-ids, but there's not much to be
>>> > done about that.
>>>
>>> What do you mean by that? (What consequences do you imagine?)
>>> I think diffs with any kind of heuristic can still be applied, no?
>>
>> I mean that if you save any old patch-ids from "git patch-id", they
>> won't match up when compared with new versions of git. We can probably
>> ignore it, though. This isn't the first time that patch-ids might have
>> changed, and I think the advice is already that one should not count on
>> them to be stable in the long term.
>>
>> -Peff
>
> Plus they'll be stable within a version of Git, it's only recorded
> patch ids that change, which hopefully isn't done very much if at all.
>
> Thanks,
> Jake

Some people, like those who did things like 30e12b92 (patch-id: make
it stable against hunk reordering, 2014-04-27), _may_ care.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-19 17:06       ` Jeff King
  2016-04-19 23:02         ` Jacob Keller
@ 2016-04-20  6:00         ` Junio C Hamano
  1 sibling, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2016-04-20  6:00 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, git, Jacob Keller, Jacob Keller

Jeff King <peff@peff.net> writes:

> I mean that if you save any old patch-ids from "git patch-id", they
> won't match up when compared with new versions of git. We can probably
> ignore it, though. This isn't the first time that patch-ids might have
> changed, and I think the advice is already that one should not count on
> them to be stable in the long term.

Another thing that this *will* break is the patch signature upload
protocol k.org uses to allow Linus, Greg, et al. on the road with
limited hotel wifi bandwidth to prepare patch-X-test1.gz and
patch-X-test1.sign file.  They can locally tag X-test1, prepare
"git diff X X-test1 | gzip -n >patch-X-test1.gz" and sign the
result, and upload _only_ the detached signature after pushing.

They can tell k.org, when uploading the detached signature, to
recreate the patchfile by running the same "git diff" to save the
bandwidth of sending the same thing twice (as they have to "push"
anyway, having to send the generated patch is a pure overhead).

Having said all that, kup(1) users are already warned that the
textual diff produced by "git diff-tree -p" (which is mentioned in
the documentation of the tool) varies across versions of Git and
the above "optimization" would not work unless both ends have the
same version of Git, so it may not be too big an issue for them.
They have already been burned once when we corrected "git archive"
output in the past (they obviously have the same optimization to
sign tarballs, and the kup(1) mechanism relies to have byte-for-byte
identical output).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-19 23:07           ` Junio C Hamano
@ 2016-04-20 13:12             ` Michael S. Tsirkin
  2016-04-20 16:09               ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2016-04-20 13:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jacob Keller, Jeff King, Stefan Beller, git, Jacob Keller

On Tue, Apr 19, 2016 at 04:07:35PM -0700, Junio C Hamano wrote:
> Jacob Keller <jacob.keller@gmail.com> writes:
> 
> > On Tue, Apr 19, 2016 at 10:06 AM, Jeff King <peff@peff.net> wrote:
> >> On Tue, Apr 19, 2016 at 08:17:38AM -0700, Stefan Beller wrote:
> >>
> >>> On Mon, Apr 18, 2016 at 10:03 PM, Jeff King <peff@peff.net> wrote:
> >>>
> >>> > I guess this will invalidate old patch-ids, but there's not much to be
> >>> > done about that.
> >>>
> >>> What do you mean by that? (What consequences do you imagine?)
> >>> I think diffs with any kind of heuristic can still be applied, no?
> >>
> >> I mean that if you save any old patch-ids from "git patch-id", they
> >> won't match up when compared with new versions of git. We can probably
> >> ignore it, though. This isn't the first time that patch-ids might have
> >> changed, and I think the advice is already that one should not count on
> >> them to be stable in the long term.
> >>
> >> -Peff
> >
> > Plus they'll be stable within a version of Git, it's only recorded
> > patch ids that change, which hopefully isn't done very much if at all.
> >
> > Thanks,
> > Jake
> 
> Some people, like those who did things like 30e12b92 (patch-id: make
> it stable against hunk reordering, 2014-04-27), _may_ care.
> 

FWIW IIRC what that commit is about is ability to reorder the chunks in
a patch without changing patch-id. Not about keeping id stable across
git revisions.

-- 
MST

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-20 13:12             ` Michael S. Tsirkin
@ 2016-04-20 16:09               ` Junio C Hamano
  2016-04-20 16:17                 ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2016-04-20 16:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jacob Keller, Jeff King, Stefan Beller, git, Jacob Keller

"Michael S. Tsirkin" <mst@redhat.com> writes:

> FWIW IIRC what that commit is about is ability to reorder the chunks in
> a patch without changing patch-id. Not about keeping id stable across
> git revisions.

OK, but "reorder the chunks" is not meant to stay to be the _ONLY_
purpose for an option whose name is a broad "--[un]stable", but
merely one (and only) possible cause of patch-id instability that
happened to be noticed as an issue back then and was dealt with that
commit, no?  In other words, the intent of the "--stable" feature is
to give a stable ID that is not affected by random end-user settings
(e.g. diff.orderfile) and if somebody invents a new configurable knob
in the future, they are supposed to pay attention to the "--stable"
feature or existing users who do use "--stable" will be broken, no?

I can still buy "--stable is not about stability across versions of
Git"--it makes our job easier ;-)  I just want to make sure that
"--stable is about stability inside a single version of Git that
patch ID for the same commit will stay the same and unaffected by
random end-user configuration knobs".

Which in turn would mean that we won't have to worry about this
option in patch-id as long as we remove the diff.compactionheuristic
configuration and command line option once the developers are done
experimenting with their heuristics code.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic
  2016-04-20 16:09               ` Junio C Hamano
@ 2016-04-20 16:17                 ` Jeff King
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff King @ 2016-04-20 16:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Michael S. Tsirkin, Jacob Keller, Stefan Beller, git, Jacob Keller

On Wed, Apr 20, 2016 at 09:09:53AM -0700, Junio C Hamano wrote:

> "Michael S. Tsirkin" <mst@redhat.com> writes:
> 
> > FWIW IIRC what that commit is about is ability to reorder the chunks in
> > a patch without changing patch-id. Not about keeping id stable across
> > git revisions.
> 
> OK, but "reorder the chunks" is not meant to stay to be the _ONLY_
> purpose for an option whose name is a broad "--[un]stable", but
> merely one (and only) possible cause of patch-id instability that
> happened to be noticed as an issue back then and was dealt with that
> commit, no?  In other words, the intent of the "--stable" feature is
> to give a stable ID that is not affected by random end-user settings
> (e.g. diff.orderfile) and if somebody invents a new configurable knob
> in the future, they are supposed to pay attention to the "--stable"
> feature or existing users who do use "--stable" will be broken, no?

I forgot that we added "--stable". Evne if it is not meant to be about
stability across versions, is there any reason _not_ to turn off
this heuristic for --stable (or for patch-ids in general)?

I guess maybe that creates some inconsistency between generating a
patch-id directly, and making one from a diff given on stdin (though I
don't know that we can promise much about the latter in the general
case; we can fix file ordering, but we don't have enough information to
tweak other aspects).

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-04-20 16:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-18 21:12 [PATCH 0/2 v4] xdiff: implement empty line chunk heuristic Stefan Beller
2016-04-18 21:12 ` [PATCH 1/2] xdiff: add recs_match helper function Stefan Beller
2016-04-18 21:12 ` [PATCH 2/2] xdiff: implement empty line chunk heuristic Stefan Beller
2016-04-18 22:04   ` Jacob Keller
2016-04-18 22:24     ` Junio C Hamano
2016-04-19  5:03   ` Jeff King
2016-04-19  6:47     ` Stefan Beller
2016-04-19  7:00       ` Jeff King
2016-04-19  7:05         ` Stefan Beller
2016-04-19 15:17     ` Stefan Beller
2016-04-19 17:06       ` Jeff King
2016-04-19 23:02         ` Jacob Keller
2016-04-19 23:07           ` Junio C Hamano
2016-04-20 13:12             ` Michael S. Tsirkin
2016-04-20 16:09               ` Junio C Hamano
2016-04-20 16:17                 ` Jeff King
2016-04-20  6:00         ` Junio C Hamano
2016-04-19 16:51     ` Junio C Hamano
2016-04-18 21:22 ` [PATCH 0/2 v4] " Junio C Hamano
2016-04-18 23:53   ` Stefan Beller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.