* [PATCH] diff compaction heuristic: favor shortest neighboring blank lines @ 2016-06-16 17:46 Stefan Beller 2016-06-16 20:27 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Stefan Beller @ 2016-06-16 17:46 UTC (permalink / raw) To: peff; +Cc: git, jacob.keller, mhagger, Stefan Beller The compaction heuristic for diffs was deemed quite good, but in 5580b27 we have an example demonstrates a common case, that fails with the existing heuristic. That is why we disabled the heuristic in the v2.9.0 release. With this patch a diff looks like: def bar do_bar_stuff() common_ending() end + def bal + do_bal_stuff() + + common_ending() + end + def baz do_baz_stuff() common_ending() end whereas before we had: WIP (TODO: ask peff to provide an example that actually triggers, I seem to be unable to write one, though I thought the above was one) The way we do it, is by inspecting the neighboring lines and see how much indent there is and we choose the line that has the shortest amount of blanks in the neighboring lines. (TODO: update comments in the file) Signed-off-by: Stefan Beller <sbeller@google.com> --- Hi Jeff, hi Michael, thanks for pointing out the flaw in this ruby code base before the 2.9 release. I think this patch would fix your finding, though I cannot reproduce it. Can you provide an example repo/patch that looks ugly on current origin/master, so I can be sure this fixes the issue? This can also be a start of a discussion if this is a too short-sighted enhancement of the heuristic. Essentially we'd want to prefer + end + + def bal over: + do_bal_stuff() + + common_ending() because there is less space between line start and {end, def bal} than for {do_bal_stuff, common_ending}. Thanks, Stefan xdiff/xdiffi.c | 41 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 39 insertions(+), 2 deletions(-) diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index b3c6848..58adc74 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -405,6 +405,35 @@ static int is_blank_line(xrecord_t **recs, long ix, long flags) return xdl_blankline(recs[ix]->ptr, recs[ix]->size, flags); } +static unsigned int leading_blank(const char *line) +{ + unsigned int ret = 0; + while (*line) { + if (*line == '\t') + ret += 8; + else if (*line == ' ') + ret ++; + else + break; + line++; + } + return ret; +} + +static unsigned int surrounding_leading_blank(xrecord_t **recs, long ix, + long flags, long nrec) +{ + unsigned int i, ret = UINT_MAX; + if (ix > 0) + ret = leading_blank(recs[ix - 1]->ptr); + if (ix < nrec - 1) { + i = leading_blank(recs[ix + 1]->ptr); + if (i < ret) + ret = i; + } + return ret; +} + static int recs_match(xrecord_t **recs, long ixs, long ix, long flags) { return (recs[ixs]->ha == recs[ix]->ha && @@ -416,7 +445,7 @@ static int recs_match(xrecord_t **recs, long ixs, long ix, long flags) int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec; char *rchg = xdf->rchg, *rchgo = xdfo->rchg; - unsigned int blank_lines; + unsigned int blank_lines, min_bl_neigh_indent; xrecord_t **recs = xdf->recs; /* @@ -451,6 +480,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { do { grpsiz = ix - ixs; blank_lines = 0; + min_bl_neigh_indent = UINT_MAX; /* * If the line before the current change group, is equal to @@ -485,7 +515,13 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { * the group. */ while (ix < nrec && recs_match(recs, ixs, ix, flags)) { - blank_lines += is_blank_line(recs, ix, flags); + if (is_blank_line(recs, ix, flags)) { + unsigned int bl_neigh_indent = + surrounding_leading_blank(recs, ix, flags, nrec); + if (min_bl_neigh_indent > bl_neigh_indent) + min_bl_neigh_indent = min_bl_neigh_indent; + blank_lines++; + } rchg[ixs++] = 0; rchg[ix++] = 1; @@ -525,6 +561,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) { while (ixs > 0 && !is_blank_line(recs, ix - 1, flags) && + surrounding_leading_blank(recs, ix - 1, flags, nrec) > min_bl_neigh_indent && recs_match(recs, ixs - 1, ix - 1, flags)) { rchg[--ixs] = 1; rchg[--ix] = 0; -- 2.9.0.8.gb6cbe66 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-16 17:46 [PATCH] diff compaction heuristic: favor shortest neighboring blank lines Stefan Beller @ 2016-06-16 20:27 ` Junio C Hamano 2016-06-16 21:06 ` Stefan Beller 2016-06-16 21:10 ` Michael Haggerty 2016-06-17 15:36 ` Jeff King 2 siblings, 1 reply; 15+ messages in thread From: Junio C Hamano @ 2016-06-16 20:27 UTC (permalink / raw) To: Stefan Beller; +Cc: peff, git, jacob.keller, mhagger Stefan Beller <sbeller@google.com> writes: > + def bal > + do_bal_stuff() > + > + common_ending() > + end > + > def baz > do_baz_stuff() > > common_ending() > end > > whereas before we had: > > WIP (TODO: ask peff to provide an example that actually triggers, I seem to be > unable to write one, though I thought the above was one) > > > The way we do it, is by inspecting the neighboring lines and see how > much indent there is and we choose the line that has the shortest amount > of blanks in the neighboring lines. > ... > because there is less space between line start and {end, def bal} > than for {do_bal_stuff, common_ending}. I haven't thought this carefully yet, but would this equally work well for Python, where it does not have the "end" or does the lack of "end" pose a problem? You'll still find "def bal" is a good boundary (but you cannot tell if it is the beginning or the end of a block, unless you understand the language), though. > +static unsigned int leading_blank(const char *line) > +{ > + unsigned int ret = 0; > + while (*line) { > + if (*line == '\t') > + ret += 8; This will be broken with a line with space-before-tab whitespace breakage, I suspect... > + else if (*line == ' ') > + ret ++; > + else > + break; > + line++; > + } > + return ret; > +} > + > +static unsigned int surrounding_leading_blank(xrecord_t **recs, long ix, > + long flags, long nrec) > +{ > + unsigned int i, ret = UINT_MAX; > + if (ix > 0) > + ret = leading_blank(recs[ix - 1]->ptr); > + if (ix < nrec - 1) { > + i = leading_blank(recs[ix + 1]->ptr); > + if (i < ret) > + ret = i; > + } > + return ret; > +} > + > static int recs_match(xrecord_t **recs, long ixs, long ix, long flags) > { > return (recs[ixs]->ha == recs[ix]->ha && > @@ -416,7 +445,7 @@ static int recs_match(xrecord_t **recs, long ixs, long ix, long flags) > int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { > long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec; > char *rchg = xdf->rchg, *rchgo = xdfo->rchg; > - unsigned int blank_lines; > + unsigned int blank_lines, min_bl_neigh_indent; > xrecord_t **recs = xdf->recs; > > /* > @@ -451,6 +480,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { > do { > grpsiz = ix - ixs; > blank_lines = 0; > + min_bl_neigh_indent = UINT_MAX; > > /* > * If the line before the current change group, is equal to > @@ -485,7 +515,13 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { > * the group. > */ > while (ix < nrec && recs_match(recs, ixs, ix, flags)) { > - blank_lines += is_blank_line(recs, ix, flags); > + if (is_blank_line(recs, ix, flags)) { > + unsigned int bl_neigh_indent = > + surrounding_leading_blank(recs, ix, flags, nrec); > + if (min_bl_neigh_indent > bl_neigh_indent) > + min_bl_neigh_indent = min_bl_neigh_indent; > + blank_lines++; > + } > > rchg[ixs++] = 0; > rchg[ix++] = 1; > @@ -525,6 +561,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { > if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) { > while (ixs > 0 && > !is_blank_line(recs, ix - 1, flags) && > + surrounding_leading_blank(recs, ix - 1, flags, nrec) > min_bl_neigh_indent && > recs_match(recs, ixs - 1, ix - 1, flags)) { > rchg[--ixs] = 1; > rchg[--ix] = 0; ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-16 20:27 ` Junio C Hamano @ 2016-06-16 21:06 ` Stefan Beller 0 siblings, 0 replies; 15+ messages in thread From: Stefan Beller @ 2016-06-16 21:06 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, git, Jacob Keller, Michael Haggerty On Thu, Jun 16, 2016 at 1:27 PM, Junio C Hamano <gitster@pobox.com> wrote: >> ... >> because there is less space between line start and {end, def bal} >> than for {do_bal_stuff, common_ending}. > > I haven't thought this carefully yet, but would this equally work > well for Python, where it does not have the "end" or does the lack > of "end" pose a problem? You'll still find "def bal" is a good > boundary (but you cannot tell if it is the beginning or the end of a > block, unless you understand the language), though. Good point. I found a flaw in my implementation (as it doesn't match my mental model, not necessarily a bad thing) We take the minimum of the two neighbors, i.e. + do_bal_stuff() + + common_ending() is preferrable to + do_bal_stuff() + + common_ending() and in python the example would look like: def foo(): do_foo() common_thing() + def baz(): + do_baz() + + common_thing() + def bar(): do_bar() common_thing() and breaking between common_thing() def bar(): is more favorable than between do_baz() common_thing() because in the first former the count of white space in front of "def bar():" is smaller than for any of "do_baz()" and "common_thing()" > >> +static unsigned int leading_blank(const char *line) >> +{ >> + unsigned int ret = 0; >> + while (*line) { >> + if (*line == '\t') >> + ret += 8; > > This will be broken with a line with space-before-tab whitespace > breakage, I suspect... How so? We inspect each character on its own and then move on later by line++. (I am not seeing how this could cause trouble, so please help me?) Going back to python, this may become a problem when you have a code like: def baz(): do_baz() common_thing() def bar(): + do_bal() + + common_thing() + +def bar(): + do_bar() common_thing() but this was fabricated with a typo (the first definition of bar should have been bal), (Also it doesn't worsen the diff, as it is same without the heuristic) once that typo is fixed we get: (both with and without the heuristic) do_foo() common_thing() def baz(): do_baz() common_thing() +def bal(): + + do_bal() + + common_thing() + def bar(): do_bar() common_thing() Clearly it can also be intentional to have 2 methods with the same code for historical reasons, (even without the blank line after the function definition this produces the same result) When playing around with various diffs I could not find a thing that this patch makes worse, it only fixes the actual issue. (I realized Peff actually attached a script to produce a bad diff, which is gone with this patch) Thanks, Stefan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-16 17:46 [PATCH] diff compaction heuristic: favor shortest neighboring blank lines Stefan Beller 2016-06-16 20:27 ` Junio C Hamano @ 2016-06-16 21:10 ` Michael Haggerty 2016-06-16 21:36 ` Stefan Beller 2016-06-17 15:36 ` Jeff King 2 siblings, 1 reply; 15+ messages in thread From: Michael Haggerty @ 2016-06-16 21:10 UTC (permalink / raw) To: Stefan Beller, peff; +Cc: git, jacob.keller On 06/16/2016 07:46 PM, Stefan Beller wrote: > The compaction heuristic for diffs was deemed quite good, but in 5580b27 > we have an example demonstrates a common case, that fails with the existing > heuristic. That is why we disabled the heuristic in the v2.9.0 release. > > With this patch a diff looks like: > > def bar > do_bar_stuff() > > common_ending() > end > > + def bal > + do_bal_stuff() > + > + common_ending() > + end > + > def baz > do_baz_stuff() > > common_ending() > end > > whereas before we had: > > WIP (TODO: ask peff to provide an example that actually triggers, I seem to be > unable to write one, though I thought the above was one) > > > The way we do it, is by inspecting the neighboring lines and see how > much indent there is and we choose the line that has the shortest amount > of blanks in the neighboring lines. > > (TODO: update comments in the file) > > Signed-off-by: Stefan Beller <sbeller@google.com> > --- > > Hi Jeff, hi Michael, > > thanks for pointing out the flaw in this ruby code base before the 2.9 release. > I think this patch would fix your finding, though I cannot reproduce it. > Can you provide an example repo/patch that looks ugly on current origin/master, > so I can be sure this fixes the issue? > > This can also be a start of a discussion if this is a too short-sighted enhancement > of the heuristic. Essentially we'd want to prefer > > + end > + > + def bal > > over: > > + do_bal_stuff() > + > + common_ending() > > because there is less space between line start and {end, def bal} > than for {do_bal_stuff, common_ending}. It's really cool that you are trying to implement the indentation heuristic. I'm curious how it works in practice (something we can probably only determine by applying it to a corpus of diffs to see how often it outperforms/underperforms the other possible approaches). The heuristic I proposed was * Prefer to start and end hunks *following* lines with the least indentation. * Define the "indentation" of a blank line to be the indentation of the previous non-blank line minus epsilon. * In the case of a tie, prefer to slide the hunk down as far as possible. If that's what you are trying to implement, then notice that the first rule says that the hunk should start *following* the line with the least indentation. So the "score" for > + end > + > + def bal would be the indentation of the line preceding "end", which is larger than the indentation of "end". And the score for > + do_bal_stuff() > + > + common_ending() would come from the line preceding "do_bal_stuff()", namely "def bal()", which is indented the same as "end". So the latter would be preferred. But actually, I don't understand how your example is meaningful. I think this heuristic is only used to slide hunks around; i.e., when the line following the hunk is the same as the first lines of the hunk, or when the line preceding the hunk is the same as the last line of the hunk. Right? So your snippets would never compete against each other. Let's take the full example. The five possible placements for the `+` characters are marked with the digits 1 through 5 here: > def bar > do_bar_stuff() > 1 > 12 common_ending() > 123 end > 1234 > 12345 def bal > 12345 do_bal_stuff() > 2345 > 345 common_ending() > 45 end > 5 > def baz > do_baz_stuff() > > common_ending() > end Of the lines preceding the candidate hunks, the blank line preceding the hunk marked "5" has the smallest indent, namely "epsilon" less than the indentation of the "end" line preceding it. So placement "5" would be chosen. I don't know enough about this area of the code to review your patch in detail, but I did note below a typo that jumped out at me. Michael > xdiff/xdiffi.c | 41 +++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 39 insertions(+), 2 deletions(-) > > diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c > index b3c6848..58adc74 100644 > --- a/xdiff/xdiffi.c > +++ b/xdiff/xdiffi.c > [...] > @@ -485,7 +515,13 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { > * the group. > */ > while (ix < nrec && recs_match(recs, ixs, ix, flags)) { > - blank_lines += is_blank_line(recs, ix, flags); > + if (is_blank_line(recs, ix, flags)) { > + unsigned int bl_neigh_indent = > + surrounding_leading_blank(recs, ix, flags, nrec); > + if (min_bl_neigh_indent > bl_neigh_indent) > + min_bl_neigh_indent = min_bl_neigh_indent; The line above has no effect (same variable on both sides of the =). > + blank_lines++; > + } > > rchg[ixs++] = 0; > rchg[ix++] = 1; > @@ -525,6 +561,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { > if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) { > while (ixs > 0 && > !is_blank_line(recs, ix - 1, flags) && > + surrounding_leading_blank(recs, ix - 1, flags, nrec) > min_bl_neigh_indent && > recs_match(recs, ixs - 1, ix - 1, flags)) { > rchg[--ixs] = 1; > rchg[--ix] = 0; > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-16 21:10 ` Michael Haggerty @ 2016-06-16 21:36 ` Stefan Beller 0 siblings, 0 replies; 15+ messages in thread From: Stefan Beller @ 2016-06-16 21:36 UTC (permalink / raw) To: Michael Haggerty; +Cc: Jeff King, git, Jacob Keller > > The heuristic I proposed was > > * Prefer to start and end hunks *following* lines with the least > indentation. > > * Define the "indentation" of a blank line to be the indentation of > the previous non-blank line minus epsilon. > > * In the case of a tie, prefer to slide the hunk down as far as > possible. > > If that's what you are trying to implement, then notice that the first > rule says that the hunk should start *following* the line with the least > indentation. So the "score" for I did not try to implement that heuristic. > * Prefer to start and end hunks *following* lines with the least > indentation. We had this alone as a heuristic and it turned out badly. (More false positives than the current "empty line only" thing) It may however be a good starting block when combined with the other conditions, as blank lines are assigned a longer virtual length than what they have. > * In the case of a tie, prefer to slide the hunk down as far as > possible. > This is what the current implementation does as well. (i.e. if there is more than one blank line, we have a tie, so choose the last one as the end of the hunk) So what I am proposing in this patch: * Prefer to start and end hunks *following* lines with an empty line. (as implemented currently) * In the case of a tie, make use of a second tier heuristic. * Define the "length" of a blank line to be the minimum of the amount of white space in the line before and after * Choose that empty line (first tier) that has the shortest length (second tier) * In the case of a tie, prefer to slide the hunk down as far as possible. I would be really interested how these two heuristics compare in practice. > >> + end >> + >> + def bal > > would be the indentation of the line preceding "end", which is larger > than the indentation of "end". And the score for > >> + do_bal_stuff() >> + >> + common_ending() > > would come from the line preceding "do_bal_stuff()", namely "def bal()", > which is indented the same as "end". So the latter would be preferred. > > But actually, I don't understand how your example is meaningful. I think > this heuristic is only used to slide hunks around; i.e., when the line > following the hunk is the same as the first lines of the hunk, or when > the line preceding the hunk is the same as the last line of the hunk. Right. > Right? So your snippets would never compete against each other. right, I was using that to show why some parts are more favorable to put a break in (i.e. the transisiton from "+ lines" to surrounding lines. I was not clear enough there. :( > Let's > take the full example. The five possible placements for the `+` > characters are marked with the digits 1 through 5 here: > >> def bar >> do_bar_stuff() >> 1 >> 12 common_ending() >> 123 end >> 1234 >> 12345 def bal >> 12345 do_bal_stuff() >> 2345 >> 345 common_ending() >> 45 end >> 5 >> def baz >> do_baz_stuff() >> >> common_ending() >> end > > Of the lines preceding the candidate hunks, the blank line preceding the > hunk marked "5" has the smallest indent, namely "epsilon" less than the > indentation of the "end" line preceding it. So placement "5" would be > chosen. I agree 5 is the best here, but for other reasons. Imagine /s/end/long final boilerplate code/, and it may be more clear. What I am trying to say is, that you "indentation" of the blank line is not reaching until the end of the "end" line, but rather to the front of "end". (If you don't care of white spaces in the git history and have an editor, that makes fancy white space stuff, i.e. your text editor cursor is in the "end" line and you press "enter" to get to the next line, it is most likely indented to strlen ("<end line>") - strlen("end"), as that is how much indentation we had in the preceding line? > > I don't know enough about this area of the code to review your patch in > detail, but I did note below a typo that jumped out at me. I did not know the code either before writing the patch. :) >> + if (min_bl_neigh_indent > bl_neigh_indent) >> + min_bl_neigh_indent = min_bl_neigh_indent; > > The line above has no effect (same variable on both sides of the =). Thanks! I had the feeling something was off judging from behavior, but could not tell what exactly. This is it. It should be min_bl_neigh_indent = bl_neigh_indent; i.e. the minimum of the existing and newly computed surrounding line starting blanks. Junio writes: > Now, on what column does 'd' sit on the example below? > > def foo > > I typed SP SP HT 'd' 'e' 'f'; it still is indented by 8 places. I see, so for the TAB case we would need to do a ret += (8 - ret%8); Thanks! Stefan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-16 17:46 [PATCH] diff compaction heuristic: favor shortest neighboring blank lines Stefan Beller 2016-06-16 20:27 ` Junio C Hamano 2016-06-16 21:10 ` Michael Haggerty @ 2016-06-17 15:36 ` Jeff King 2016-06-17 16:09 ` Stefan Beller 2 siblings, 1 reply; 15+ messages in thread From: Jeff King @ 2016-06-17 15:36 UTC (permalink / raw) To: Stefan Beller; +Cc: git, jacob.keller, mhagger On Thu, Jun 16, 2016 at 10:46:20AM -0700, Stefan Beller wrote: > The compaction heuristic for diffs was deemed quite good, but in 5580b27 > we have an example demonstrates a common case, that fails with the existing > heuristic. That is why we disabled the heuristic in the v2.9.0 release. > > With this patch a diff looks like: > > def bar > do_bar_stuff() > > common_ending() > end > > + def bal > + do_bal_stuff() > + > + common_ending() > + end > + > def baz > do_baz_stuff() > > common_ending() > end > > whereas before we had: > > WIP (TODO: ask peff to provide an example that actually triggers, I seem to be > unable to write one, though I thought the above was one) Did you want something that triggers the "bad" case with the existing compaction heuristic? I gave one in: http://article.gmane.org/gmane.comp.version-control.git/296947 I think the difference is that in my example, the diff (before compaction) has the blank line at the top (because we are adding a new entry at the bottom), whereas in yours, the blank line is at the bottom. This patch does make my "bad" case look better. Unfortunately, it re-breaks another case: cat >file.rb <<\EOF ## # This is the foo function. def foo foo stuff end ## # This is the bar function. def bar bar stuff end ## # This is the baz function. def baz baz stuff end EOF git init git add file.rb sed -i 7,12d file.rb git diff That goes back to the original, pre-compaction-heuristic diff: diff --git a/file.rb b/file.rb index 0b31fb6..ee25d63 100644 --- a/file.rb +++ b/file.rb @@ -5,12 +5,6 @@ def foo end ## -# This is the bar function. -def bar - bar stuff -end - -## # This is the baz function. def baz baz stuff -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-17 15:36 ` Jeff King @ 2016-06-17 16:09 ` Stefan Beller 2016-06-23 17:10 ` Michael Haggerty 0 siblings, 1 reply; 15+ messages in thread From: Stefan Beller @ 2016-06-17 16:09 UTC (permalink / raw) To: Jeff King; +Cc: git, Jacob Keller, Michael Haggerty On Fri, Jun 17, 2016 at 8:36 AM, Jeff King <peff@peff.net> wrote: > > Did you want something that triggers the "bad" case with the existing > compaction heuristic? > > I gave one in: > > http://article.gmane.org/gmane.comp.version-control.git/296947 I found that out after sending the initial patch. Thanks for pointing out! > > I think the difference is that in my example, the diff (before > compaction) has the blank line at the top (because we are adding a new > entry at the bottom), whereas in yours, the blank line is at the bottom. > > This patch does make my "bad" case look better. Unfortunately, it > re-breaks another case: I think before spending more time on discussing and implementing new (hopefully better) heuristics, I'd want to step back and try to be a bit more systematic, i.e. I'll want to collect lots of test cases in different languages and use cases. A mini test suite, maybe? I think I'll include these examples in there, too. Thanks, Stefan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-17 16:09 ` Stefan Beller @ 2016-06-23 17:10 ` Michael Haggerty 2016-06-23 17:25 ` Stefan Beller ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Michael Haggerty @ 2016-06-23 17:10 UTC (permalink / raw) To: Stefan Beller, Jeff King; +Cc: git, Jacob Keller On 06/17/2016 06:09 PM, Stefan Beller wrote: > I think before spending more time on discussing and implementing new > (hopefully better) heuristics, I'd want to step back and try to be a bit more > systematic, i.e. I'll want to collect lots of test cases in different languages > and use cases. A mini test suite, maybe? Stefan, I've also been playing around with diff heuristics and also trying to find some good test data. Have you put together some test cases yet? In case you are interested, the current iteration of my heuristic considers six things to determine a "score" for breaking a diff above a particular line: * `blank` -- is the line blank? * `indent` -- the indentation of the line (if it is not blank) * `pre_blank` -- is the preceding line blank? * `pre_indent` -- the indentation of the closest preceding non-blank line * `post_blank` -- is the following line blank? * `post_indent` -- the indentation of the closest following non-blank line The meat of the scoring algorithm is below. I compute the score for the beginning and the end of the insert/delete block for each candidate shift of an insertion/deletion hunk, then choose the one that has the lowest score. There are some adjustable "bonus" values below. The values shown give quite good results for C, shell, Python, XML, Ruby, CSS, YAML, and HTML in the couple of projects that I've tried it on. But I haven't done enough systematic testing over a large enough corpus to know whether the parameters could be improved or whether there are types of code that it performs really terribly on. Right now my program doesn't output diffs in the usual format, but rather as a diff notated with various information. So it's not in a form that other people could use so easily. Here's an example: > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > * > * Default behaviour is to initialize the notes tree from the tree object > * specified by the given (or default) notes ref. > */ > #define NOTES_INIT_EMPTY 1 > | -4 > | -6 + /* > || 28 + + * By default, the notes tree is only readable, and the notes ref can be > || + + * any treeish. The notes tree can however be made writable with this flag, > || + + * in which case only strict ref names can be used. > || + + */ > || + +#define NOTES_INIT_WRITABLE 2 > | + + > | +/* > * Initialize the given notes_tree with the notes tree structure at the given > * ref. If given ref is NULL, the value of the $GIT_NOTES_REF environment > * variable is used, and if that is missing, the default notes ref is used > * ("refs/notes/commits"). > * > * If you need to re-initialize a notes_tree structure (e.g. when switching from > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The two columns of vertical bars show the highest and lowest range of lines that this insertion block can be shifted to. The numbers show the score for breaking the diff above that line. The first column of +/- is where this version of the heuristic would place the diff, and the second column is where the master version of Git would place it. My program can also search for and output "slideable" insertion/deletion blocks, so it can be used to find diffs that are theoretically capable of being optimized, even if they don't happen to be handled differently by two different algorithms. Let me know if you want to collaborate on this somehow. Michael Scoring heuristic: > # A place to accumulate bonus factors (positive makes this > # index more favored): > bonus = 0 > > # Bonuses based on the location of blank lines: > if pre_blank and not blank: > bonus += 3 > elif blank and not pre_blank: > bonus += 2 > elif blank and pre_blank: > bonus += 1 > > # Now fill in missing indent values: > if pre_indent is None: > pre_indent = 0 > > if post_indent is None: > post_indent = 0 > > if blank: > indent = post_indent > > if indent > pre_indent: > # The line is indented more than its predecessor. It is > # preferable to keep these lines together, so we score it > # based on the larger indent: > score = indent > bonus -= 4 > > elif indent < pre_indent: > # The line is indented less than its predecessor. It could > # be that this line is the start of a new block (e.g., of > # an "else" block, or of a block without a block > # terminator) or it could be the end of the previous > # block. > if indent < post_indent: > # The following line is indented more. So it is likely > # that this line is the start of a block. It's a > # pretty good place to split, so score it based on the > # smaller indent: > score = indent > bonus += 2 > else: > # This was probably the end of a block. We score based > # on the line's own indent: > score = indent > > else: > # The line has the same indentation level as its > # predecessor. We score it based on its own indent: > score = indent > > return 10 * score - bonus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-23 17:10 ` Michael Haggerty @ 2016-06-23 17:25 ` Stefan Beller 2016-06-23 17:37 ` Junio C Hamano 2016-06-30 13:54 ` Michael Haggerty 2 siblings, 0 replies; 15+ messages in thread From: Stefan Beller @ 2016-06-23 17:25 UTC (permalink / raw) To: Michael Haggerty; +Cc: Jeff King, git, Jacob Keller On Thu, Jun 23, 2016 at 10:10 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote: > On 06/17/2016 06:09 PM, Stefan Beller wrote: >> I think before spending more time on discussing and implementing new >> (hopefully better) heuristics, I'd want to step back and try to be a bit more >> systematic, i.e. I'll want to collect lots of test cases in different languages >> and use cases. A mini test suite, maybe? > > Stefan, > > I've also been playing around with diff heuristics and also trying to > find some good test data. Have you put together some test cases yet? I started with that and had 2 or 3 test cases written out, but the diffs of diffs confused me and then I got distracted, so I am stalling on putting together a mini test suite. You'll find my attempt at [1] as an integration of our test suite. Maybe we don't need to have these tests in tree, but once we're done developing the heuristic, we can put the test suite in the commit message of that commit? (Assuming the tests are short and as for a heuristic there is no right and wrong, but only a better or worse) [1] https://github.com/stefanbeller/git/tree/diff_heuristic_improvements > > In case you are interested, the current iteration of my heuristic > considers six things to determine a "score" for breaking a diff above a > particular line: > > * `blank` -- is the line blank? > * `indent` -- the indentation of the line (if it is not blank) > * `pre_blank` -- is the preceding line blank? > * `pre_indent` -- the indentation of the closest preceding non-blank > line > * `post_blank` -- is the following line blank? > * `post_indent` -- the indentation of the closest following non-blank > line > > The meat of the scoring algorithm is below. I compute the score for the > beginning and the end of the insert/delete block for each candidate > shift of an insertion/deletion hunk, then choose the one that has the > lowest score. > > There are some adjustable "bonus" values below. The values shown give > quite good results for C, shell, Python, XML, Ruby, CSS, YAML, and HTML > in the couple of projects that I've tried it on. But I haven't done > enough systematic testing over a large enough corpus to know whether the > parameters could be improved or whether there are types of code that it > performs really terribly on. I think the corpus doesn't have to be large, but rather diverse. The current heuristic was tested by Jacob and Jeff over the history of the linux kernel, (and some other C projects), which I consider a large corpus -- But it's all well formed C, i.e a well known coding style. [2] [2] don't read this too seriously: http://codegolf.stackexchange.com/questions/42510/reverse-indentation > > Right now my program doesn't output diffs in the usual format, but > rather as a diff notated with various information. So it's not in a form > that other people could use so easily. Here's an example: > >> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv >> * >> * Default behaviour is to initialize the notes tree from the tree object >> * specified by the given (or default) notes ref. >> */ >> #define NOTES_INIT_EMPTY 1 >> | -4 >> | -6 + /* >> || 28 + + * By default, the notes tree is only readable, and the notes ref can be >> || + + * any treeish. The notes tree can however be made writable with this flag, >> || + + * in which case only strict ref names can be used. >> || + + */ >> || + +#define NOTES_INIT_WRITABLE 2 >> | + + >> | +/* >> * Initialize the given notes_tree with the notes tree structure at the given >> * ref. If given ref is NULL, the value of the $GIT_NOTES_REF environment >> * variable is used, and if that is missing, the default notes ref is used >> * ("refs/notes/commits"). >> * >> * If you need to re-initialize a notes_tree structure (e.g. when switching from >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The two columns of vertical bars show the highest and lowest range of > lines that this insertion block can be shifted to. The numbers show the > score for breaking the diff above that line. The first column of +/- is > where this version of the heuristic would place the diff, and the second > column is where the master version of Git would place it. This is very valuable for understanding and testing different patterns! Great work! > > My program can also search for and output "slideable" insertion/deletion > blocks, so it can be used to find diffs that are theoretically capable > of being optimized, even if they don't happen to be handled differently > by two different algorithms. Oh right, that is what we can use to find good examples to test on. I wondered if we could "machine learn" the coefficients for these 6 parameters that you outline above to bring the best matches in most of the time. > > Let me know if you want to collaborate on this somehow. I am interested, but I dropped the ball last week. I'll pick it up again. :) Stefan > > Michael > > Scoring heuristic: > >> # A place to accumulate bonus factors (positive makes this >> # index more favored): >> bonus = 0 >> >> # Bonuses based on the location of blank lines: >> if pre_blank and not blank: >> bonus += 3 >> elif blank and not pre_blank: >> bonus += 2 >> elif blank and pre_blank: >> bonus += 1 >> >> # Now fill in missing indent values: >> if pre_indent is None: >> pre_indent = 0 >> >> if post_indent is None: >> post_indent = 0 >> >> if blank: >> indent = post_indent >> >> if indent > pre_indent: >> # The line is indented more than its predecessor. It is >> # preferable to keep these lines together, so we score it >> # based on the larger indent: >> score = indent >> bonus -= 4 >> >> elif indent < pre_indent: >> # The line is indented less than its predecessor. It could >> # be that this line is the start of a new block (e.g., of >> # an "else" block, or of a block without a block >> # terminator) or it could be the end of the previous >> # block. >> if indent < post_indent: >> # The following line is indented more. So it is likely >> # that this line is the start of a block. It's a >> # pretty good place to split, so score it based on the >> # smaller indent: >> score = indent >> bonus += 2 >> else: >> # This was probably the end of a block. We score based >> # on the line's own indent: >> score = indent >> >> else: >> # The line has the same indentation level as its >> # predecessor. We score it based on its own indent: >> score = indent >> >> return 10 * score - bonus > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-23 17:10 ` Michael Haggerty 2016-06-23 17:25 ` Stefan Beller @ 2016-06-23 17:37 ` Junio C Hamano 2016-06-23 20:13 ` Michael Haggerty 2016-06-30 13:54 ` Michael Haggerty 2 siblings, 1 reply; 15+ messages in thread From: Junio C Hamano @ 2016-06-23 17:37 UTC (permalink / raw) To: Michael Haggerty; +Cc: Stefan Beller, Jeff King, git, Jacob Keller Michael Haggerty <mhagger@alum.mit.edu> writes: > Scoring heuristic: > >> # A place to accumulate bonus factors (positive makes this >> # index more favored): >> bonus = 0 >> >> # Bonuses based on the location of blank lines: >> if pre_blank and not blank: >> bonus += 3 >> elif blank and not pre_blank: >> bonus += 2 >> elif blank and pre_blank: >> bonus += 1 >> >> # Now fill in missing indent values: >> if pre_indent is None: >> pre_indent = 0 >> >> if post_indent is None: >> post_indent = 0 These "missing" are to handle the lines at the boundary of shiftable block, or do you have pre_indent if a line is the earliest possible one that could be included in the hunk by shifting, as long as it is not at the beginning of the file? I think "this is indented deeper than the previous, and it wants to stay with the previous if possible" is an excellent heuristic, which you have below. I wonder if "however, the previous cannot be included in the hunk by shifting, hence we try to keep this together with it by favouring its exclusion from the hunk" could be part of the same heuristics, if you do have pre_indent value for a line at the boundary of shiftable block. Also I am wondering if 0 is a sensible thing to use as the fallback value. When pre_indent is not definable, turning it to 0 and declaring that our indented line is indented deeper than the previous line and penalizing with "bonus -= 4" (which you do) feels just as fishy as turning the undefined pre_indent to maxint and declaring that our indented line is indented shallower than the previous line (which you don't, but is just as justifiable, I would think). Thanks. >> if blank: >> indent = post_indent >> >> if indent > pre_indent: >> # The line is indented more than its predecessor. It is >> # preferable to keep these lines together, so we score it >> # based on the larger indent: >> score = indent >> bonus -= 4 >> >> elif indent < pre_indent: >> # The line is indented less than its predecessor. It could >> # be that this line is the start of a new block (e.g., of >> # an "else" block, or of a block without a block >> # terminator) or it could be the end of the previous >> # block. >> if indent < post_indent: >> # The following line is indented more. So it is likely >> # that this line is the start of a block. It's a >> # pretty good place to split, so score it based on the >> # smaller indent: >> score = indent >> bonus += 2 >> else: >> # This was probably the end of a block. We score based >> # on the line's own indent: >> score = indent >> >> else: >> # The line has the same indentation level as its >> # predecessor. We score it based on its own indent: >> score = indent >> >> return 10 * score - bonus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-23 17:37 ` Junio C Hamano @ 2016-06-23 20:13 ` Michael Haggerty 0 siblings, 0 replies; 15+ messages in thread From: Michael Haggerty @ 2016-06-23 20:13 UTC (permalink / raw) To: Junio C Hamano; +Cc: Stefan Beller, Jeff King, git, Jacob Keller On 06/23/2016 07:37 PM, Junio C Hamano wrote: > Michael Haggerty <mhagger@alum.mit.edu> writes: > >> Scoring heuristic: >> >>> # A place to accumulate bonus factors (positive makes this >>> # index more favored): >>> bonus = 0 >>> >>> # Bonuses based on the location of blank lines: >>> if pre_blank and not blank: >>> bonus += 3 >>> elif blank and not pre_blank: >>> bonus += 2 >>> elif blank and pre_blank: >>> bonus += 1 >>> >>> # Now fill in missing indent values: >>> if pre_indent is None: >>> pre_indent = 0 >>> >>> if post_indent is None: >>> post_indent = 0 > > These "missing" are to handle the lines at the boundary of shiftable > block, or do you have pre_indent if a line is the earliest possible > one that could be included in the hunk by shifting, as long as it is > not at the beginning of the file? The missing values are meant to handle the case where there are only blank lines between the add/delete block and the beginning/end of the file. Though currently (due to a limitation of the prototype) I only pass context up to the next add/delete line, so that fallback also kicks in if there are only blank lines between this add/delete block and the next added/deleted line. When that is fixed, I think it might help to give beginning/end of file a little bit more bonus. This isn't all implemented yet, but I think the best score for an add/delete block positioning would be the sum of three values. Let's take the following "complete file" diff as an example, and consider an add block with the following positioning: > diff --git a/file.txt b/file.txt > index 9405325..6a6b95c 100644 > --- a/file.txt > +++ b/file.txt > @@ -1,5 +1,7 @@ > a > b > c > +x > +c > d > e The score for splitting the old version of the file between "bc" and "de", plus the score for splitting the new version of the file between "bc" and "xc", plus the score for splitting the new version of the file between "xc" and "de". > I think "this is indented deeper than the previous, and it wants to > stay with the previous if possible" is an excellent heuristic, which > you have below. I wonder if "however, the previous cannot be > included in the hunk by shifting, hence we try to keep this together > with it by favouring its exclusion from the hunk" could be part of > the same heuristics, if you do have pre_indent value for a line at > the boundary of shiftable block. That's automatic, isn't it? If splitting between two lines is assigned a high cost, then the algorithm will avoid splitting there *either* by including both lines in the hunk *or* by omitting both lines. > Also I am wondering if 0 is a sensible thing to use as the fallback > value. When pre_indent is not definable, turning it to 0 and > declaring that our indented line is indented deeper than the > previous line and penalizing with "bonus -= 4" (which you do) feels > just as fishy as turning the undefined pre_indent to maxint and > declaring that our indented line is indented shallower than the > previous line (which you don't, but is just as justifiable, I would > think). Once the algorithm is fixed to consider the whole file, then the fallback value will only be used at the beginning and end of the file, where an indent of 0 seems like the natural thing to do. As I mentioned above, it might even help to give beginning/end of file a little more bonus than that. But really, testing against real-world diffs will be the best basis for adjusting the heuristic. Michael ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-23 17:10 ` Michael Haggerty 2016-06-23 17:25 ` Stefan Beller 2016-06-23 17:37 ` Junio C Hamano @ 2016-06-30 13:54 ` Michael Haggerty 2016-07-01 17:04 ` diff heuristics dramatically improved by considering line indentation and " Michael Haggerty 2016-07-01 18:01 ` [PATCH] diff compaction heuristic: favor shortest neighboring " Junio C Hamano 2 siblings, 2 replies; 15+ messages in thread From: Michael Haggerty @ 2016-06-30 13:54 UTC (permalink / raw) To: Stefan Beller, Jeff King; +Cc: git, Jacob Keller, Junio C Hamano On 06/23/2016 07:10 PM, Michael Haggerty wrote: > On 06/17/2016 06:09 PM, Stefan Beller wrote: >> I think before spending more time on discussing and implementing new >> (hopefully better) heuristics, I'd want to step back and try to be a bit more >> systematic, i.e. I'll want to collect lots of test cases in different languages >> and use cases. A mini test suite, maybe? > > Stefan, > > I've also been playing around with diff heuristics and also trying to > find some good test data. Have you put together some test cases yet? I put quite of work into tooling to evaluate diff heuristics, and just published it on GitHub: https://github.com/mhagger/diff-slider-tools The README there is hopefully enough to let people get started using it to test their own favorite heuristics. !!! Please be careful! The scripts are not hardened against strangely-named files! !!! I've run a bunch of tests of the heuristic that I described in the previous email against a bunch of repositories (to try to get a sampling of popular languages): ant, bugzilla, couchdb, docker, git, ipython, junit, nodejs, phpmyadmin, test-more, test-unit, xmonad. I haven't done a lot of systematic analysis yet, nor tried to optimize the heuristic systematically, but by eye the results look very good. Some constructs (C preprocessor directives, Shell here documents) sometimes confuse it because of their weird indentation, and of course code with inconsistent indentation or strangely-placed blank lines doesn't get the best results. But in a large majority of cases where it disagrees with the standard Git or with `git --compaction-heuristic`, it does a better job. Below are a few examples to show off some typical results. I promise I tried not to cherry pick only flattering examples. In the output below, `|` mark the highest and lowest possible positioning of the "slider" (a group of added or deleted lines whose exact identity is ambiguous). Current Git, by default, always chooses the lowest positioning. "c" is where Git master with `--compaction-heuristic` chooses to position the slider, and `i` is the choice of my indentation-based algorithm. The numbers are just the shift relative to the standard shift. Michael The algorithm does pretty well with XML and HTML: > fef3ea39f8bd474add292bb6437df6cbd22e1ba7:contributors.xml a394a0bdf8e6240dc09023a8260059c57c158a00:contributors.xml + 1624 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > <last>Toni</last> > > </name> > > <name> > > <first>Vincent</first> > > <last>Legoll</last> > -2 | > </name> > -1 | i > <name> > 0 || ci > <first>Vincent</first> > || ci > <last>Privat</last> > | ci > </name> > | c > <name> > > <first>Vimil</first> > > <last>Saju</last> > > </name> > > <name> > > <first>Vitold</first> > > <last>Sedyshev</last> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > 789422e131b6c2c003d94f394169a64297e986c6:manual/Tasks/signjar.html 7f51882300a8e40ff675867e055061867ba6c8bd:manual/Tasks/signjar.html + 151 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > </tr> > > <tr> > > <td valign="top">tsacert</td> > > <td valign="top">alias in the keystore for a timestamp authority for > > timestamped JAR files in Java1.5+</td> > -3 | > <td valign="top" align="center">No</td> > -2 | > </tr> > -1 | i > <tr> > 0 || ci > <td valign="top">tsaproxyhost</td> > || ci > <td valign="top">proxy host to be used when connecting to TSA server</td> > || ci > <td valign="top" align="center">No</td> > || ci > </tr> > || ci > <tr> > || ci > <td valign="top">getTsaproxyport</td> > || ci > <td valign="top">proxy port to be used when connecting to TSA server</td> > | ci > <td valign="top" align="center">No</td> > | ci > </tr> > | c > <tr> > > <td valign="top">executable</td> > > <td valign="top">Specify a particular <code>jarsigner</code> executable > > to use in place of the default binary (found in the same JDK as > > Apache Ant is running in).<br/> > > Must support the same command line options as the Sun JDK > > jarsigner command. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Java: > de5b4058b9bb039cdb17082fc543098de598ece2:src/main/org/apache/tools/ant/taskdefs/optional/ssh/AbstractSshMessage.java 3fe363e2120342f6de2b219c3bd74efe9aea2803:src/main/org/apache/tools/ant/taskdefs/optional/ssh/AbstractSshMessage.java + 208 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > * @return true if the verbose attribute is set > > * @since Ant 1.6.2 > > */ > > protected final boolean getVerbose() { > > return verbose; > -3 | > } > -2 | > > -1 | ci > /** > 0 || ci > * Is the compressed attribute set. > || ci > * @return true if the compressed attribute is set > || ci > * @since Ant 1.9.8 > || ci > */ > || ci > protected final boolean getCompressed() { > || ci > return compressed; > | ci > } > | ci > > | > /** > > * Track progress every 10% if 100kb < filesize < 1mb. For larger > > * files track progress for every percent transmitted. > > * @param filesize the size of the file been transmitted > > * @param totalLength the total transmission size > > * @param percentTransmitted the current percent transmitted > > * @return the percent that the file is of the total > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > cd1ff3566ed915392041314eb583a638bcb1eb64:src/main/org/apache/tools/ant/taskdefs/optional/ejb/WeblogicDeploymentTool.java a8d6367ddc214b4956b2a7d6b779930df5e43515:src/main/org/apache/tools/ant/taskdefs/optional/ejb/WeblogicDeploymentTool.java - 888 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > // empty > > } > > } > > > > if (newJarStream != null) { > -1 | i > try { > 0 || ci > newJarStream.close(); > || ci > } catch (IOException closeException) { > || ci > // empty > || ci > } > || ci > > | c > try { > > FILE_UTILS.rename(newWLJarFile, weblogicJarFile); > > } catch (IOException renameException) { > > log(renameException.getMessage(), Project.MSG_WARN); > > rebuild = true; > > } > > } > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Perl: > bb18f8a6d560fa141f812dddcf6f6d8fe8b49dbb:Bugzilla/DB.pm 16f1833e572297edd89faddb69364e09efecdfdb:Bugzilla/DB.pm + 98 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > ># PostgreSQL, and hopefully will get us as much Unicode support as possible, > ># depending on how well the regexp engines of the various databases support > ># Unicode. > >use constant WORD_START => '(^|[^[:alnum:]])'; > >use constant WORD_END => '($|[^[:alnum:]])'; > -2 | > > -1 | ci >##################################################################### > 0 || ci ># Overridden Superclass Methods > || ci >##################################################################### > || ci > > || ci >sub quote { > || ci > my $self = shift; > || ci > my $retval = $self->SUPER::quote(@_); > || ci > trick_taint($retval) if defined $retval; > || ci > return $retval; > || ci >} > | ci > > | >##################################################################### > ># Connection Methods > >##################################################################### > > > >sub connect_shadow { > > my $params = Bugzilla->params; > > die "Tried to connect to non-existent shadowdb" > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > dbfd6207290d1eee53fddec4c7c3b4aac0b2d47a:Bugzilla/Install/Requirements.pm 3c7ab8316f97fe84222ccacf0d1b4be9cd43e9f8:Bugzilla/Install/Requirements.pm - 338 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > feature => ['rest'] > > }, > > { > > package => 'HTTP-Response', > > module => 'HTTP::Response', > -4 | > version => 0, > -3 | > feature => ['rest'] > -2 | > }, > -1 | i > { > 0 || ci > package => 'URI-Escape', > || ci > module => 'URI::Escape', > | ci > version => 0, > | ci > feature => ['rest'] > | ci > }, > | c > { > > # We need the 'utf8_mode' method of HTML::Parser, for HTML::Scrubber. > > package => 'HTML-Parser', > > module => 'HTML::Parser', > > version => ($^V >= v5.13.3) ? '3.67' : '3.40', > > feature => ['html_desc'], > > }, > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Erlang: > 56f969b3c49375c5321eb3cceb0fe346f8535c22:src/couchdb/couch_httpd.erl b90e40212663474e873fde6cab343c31c1e635e7:src/couchdb/couch_httpd.erl + 332 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > case couch_httpd_cors:is_preflight_request(HttpReq) of > > #httpd{} -> > > case authenticate_request(HttpReq, AuthHandlers) of > > #httpd{} = Req -> > > HandlerFun(Req); > -2 | i > Response -> > -1 | i > Response > 0 || ci > end; > | c > Response -> > | c > Response > > end > > catch > > throw:{http_head_abort, Resp0} -> > > {ok, Resp0}; > > throw:{invalid_json, S} -> > > ?LOG_ERROR("attempted upload of invalid JSON (set log_level to debug to log it)", []), > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Go: > 7b2b4eb40cfd907f3aa20b630cc2feeb8460404a:opts/hosts.go fb3eb1c27ef5520571c599ead8a72b343748db39:opts/hosts.go + 132 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > u, err := url.Parse("tcp://" + addr) > > if err != nil { > > return "", err > > } > > host, port, err := net.SplitHostPort(u.Host) > -1 | i > if err != nil { > 0 || ci > // try port addition once > || ci > host, port, err = net.SplitHostPort(net.JoinHostPort(u.Host, defaultPort)) > || ci > } > | c > if err != nil { > > return "", fmt.Errorf("Invalid bind address format: %s", tryAddr) > > } > > > > if host == "" { > > host = defaultHost > > } > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ gettext source: > aef18cc6063106075afeb3a78ec72656b19c8bde:po/de.po b0e098ce46ce7f8ecd975720ccec8d0eb7787e50:po/de.po - 12954 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > >#~ msgid "unmerged: %s" > >#~ msgstr "nicht zusammengeführt: %s" > > > >#~ msgid "input paths are terminated by a null character" > >#~ msgstr "Eingabepfade sind durch ein NUL Zeichen abgeschlossen" > -2 | > > -1 | ci >#~ msgid "" > 0 || ci >#~ "The following untracked files would NOT be saved but need to be removed " > || ci >#~ "by stash save:" > || ci >#~ msgstr "" > || ci >#~ "Die folgenden unbeobachteten Dateien würden NICHT gespeichert werden,\n" > || ci >#~ "müssen aber durch \"stash save\" entfernt werden:" > | ci > > | >#~ msgid "" > >#~ "Aborting. Consider using either the --force or --include-untracked option." > >#~ msgstr "" > >#~ "Abgebrochen. Benutzen Sie entweder die Option --force oder --include-" > >#~ "untracked." > > > >#~ msgid " (fix conflicts and then run \"git am --resolved\")" > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ C: > eeff891ac756fd97a05476446f15269b714ce4cc:grep.c 79a77109d3d0d364910ff7fa8c605c554dc4c3e0:grep.c + 1133 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > regmatch_t match; > > enum grep_context ctx = GREP_CONTEXT_BODY; > > int ch = *eol; > > int eflags = 0; > > > -1 | i > if (sign == ':') > 0 || ci > match_color = opt->color_match_selected; > || ci > else > || ci > match_color = opt->color_match_context; > | c > if (sign == ':') > > line_color = opt->color_selected; > > else if (sign == '-') > > line_color = opt->color_context; > > else if (sign == '=') > > line_color = opt->color_function; > > *eol = '\0'; > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Tcl: > a9ae14a1c50a9b5bd051be629dba0fb04b6fd67a:git-gui.sh 25476c63e7d3357f955b44dd6a7c825aba819987:git-gui.sh - 3158 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > > ># -- Diff Body Context Menu > ># > > > >proc create_common_diff_popup {ctxm} { > -1 | i > $ctxm add command \ > 0 || ci > -label [mc "Show Less Context"] \ > || ci > -command show_less_context > || ci > lappend diff_actions [list $ctxm entryconf [$ctxm index last] -state] > || ci > $ctxm add command \ > || ci > -label [mc "Show More Context"] \ > || ci > -command show_more_context > || ci > lappend diff_actions [list $ctxm entryconf [$ctxm index last] -state] > || ci > $ctxm add separator > | c > $ctxm add command \ > > -label [mc Refresh] \ > > -command reshow_diff > > lappend diff_actions [list $ctxm entryconf [$ctxm index last] -state] > > $ctxm add command \ > > -label [mc Copy] \ > > -command {tk_textCopy $ui_diff} > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shell: > 34a82bda7766f000ef646130ed3f6af58ca23aa2:git-subtree.sh 942dce5578c8eb03fdc7f9109c8418d499e931ff:git-subtree.sh + 40 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > >rejoin= > >ignore_joins= > >annotate= > > > >debug() > -1 | i >{ > 0 || ci > if [ -n "$debug" ]; then > || ci > echo "$@" >&2 > || ci > fi > || ci >} > || ci > > || ci >say() > | c >{ > > if [ -z "$quiet" ]; then > > echo "$@" >&2 > > fi > >} > > > >assert() > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Python: > 9ccc64b8ea94a9083038b079a8e2691ae94591fe:IPython/html/base/handlers.py db6ce120fc0bc0a4b29969f43af98e4841d5e1be:IPython/html/base/handlers.py - 90 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > def cookie_name(self): > > default_cookie_name = non_alphanum.sub('-', 'username-{}'.format( > > self.request.host > > )) > > return self.settings.get('cookie_name', default_cookie_name) > -2 | > > -1 | i > @property > 0 || ci > def password(self): > || ci > """our password""" > || ci > return self.settings.get('password', '') > | ci > > | c > @property > > def logged_in(self): > > """Is a user currently logged in? > > > > """ > > user = self.get_current_user() > > return (user and not user == 'anonymous') > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ javascript (and JSON): > d3c3a928c3a2f9e961881b47ef0796e57ae8d429:tools/eslint/lib/rules/no-arrow-condition.js d7aa8fa088f3b8a31c7d85c6d71824c8c60e7c17:tools/eslint/lib/rules/no-confusing-arrow.js - 58 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > >//------------------------------------------------------------------------------ > >// Rule Definition > >//------------------------------------------------------------------------------ > > > >module.exports = function(context) { > -1 | i > /** > 0 || ci > * Reports if a conditional statement is an arrow function. > || ci > * @param {ASTNode} node - A node to check and report. > || ci > * @returns {void} > || ci > */ > || ci > function checkCondition(node) { > || ci > if (isArrowFunction(node)) { > || ci > context.report(node, "Arrow function `=>` used inside {{statementType}} instead of comparison operator.", {statementType: node.type}); > || ci > } > || ci > } > || ci > > | c > /** > > * Reports if an arrow function contains an ambiguous conditional. > > * @param {ASTNode} node - A node to check and report. > > * @returns {void} > > */ > > function checkArrowFunc(node) { > > if (isConditional(node)) { > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PHP: > cc9550579c02d652686bdc574bc91fb982f7d95b:libraries/ThemeManager.php ca5eb903b04035f78819ae1e14ff888155bd3e32:libraries/ThemeManager.php + 20 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > * > > * @package PhpMyAdmin > > */ > >class ThemeManager > >{ > -1 | i > /** > 0 || ci > * ThemeManager instance > || ci > * > || ci > * @access private > || ci > * @static > || ci > * @var ThemeManager > || ci > */ > || ci > private static $_instance; > || ci > > | c > /** > > * @var string path to theme folder > > * @access protected > > */ > > private $_themes_path = './themes/'; > > > > /** > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Ruby: > 0ae3800be7ca632c47c089d527b53a8f66960c60:test/test_assertions.rb c35d9e2de60441cae592a300dae1460b45562cce:test/test_assertions.rb - 120 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > end > > > > def test_assert_equal > > check_nothing_fails { > > assert_equal("string1", "string1") > -2 | > } > -1 | i > check_nothing_fails { > 0 || ci > assert_equal( "string1", "string1", "successful assert_equal") > | ci > } > | c > check_nothing_fails { > > assert_equal("string1", "string1", "successful assert_equal") > > } > > > > message = <<-EOM.chomp > ><"string1"> expected but was > ><"string2">. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It often fails to get C preprocessor directives right: > a08595f76159b09d57553e37a5123f1091bb13e7:http.c aeff8a61216bf6e0d663c08c583bc8552fa3c344:http.c + 429 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > curl_easy_setopt(result, CURLOPT_SSLKEY, ssl_key); > >#endif > >#if LIBCURL_VERSION_NUM >= 0x070908 > > if (ssl_capath != NULL) > > curl_easy_setopt(result, CURLOPT_CAPATH, ssl_capath); > -1 | i >#endif > 0 || ci >#if LIBCURL_VERSION_NUM >= 0x072c00 > || ci > if (ssl_pinnedkey != NULL) > || ci > curl_easy_setopt(result, CURLOPT_PINNEDPUBLICKEY, ssl_pinnedkey); > | c >#endif > > if (ssl_cainfo != NULL) > > curl_easy_setopt(result, CURLOPT_CAINFO, ssl_cainfo); > > > > if (curl_low_speed_limit > 0 && curl_low_speed_time > 0) { > > curl_easy_setopt(result, CURLOPT_LOW_SPEED_LIMIT, > > curl_low_speed_limit); > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ And it gets confused by unusual blank line placement: > ed55169834a3ce16a271def9630c858626ded34d:tools/eslint/node_modules/doctrine/lib/doctrine.js 2d441493a4a46a511ba1bdf93e442c3288fbe92d:tools/eslint/node_modules/doctrine/lib/doctrine.js + 330 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > name === 'external' || > > name === 'event')) { > > name += advance(); > > name += scanIdentifier(last); > > > -1 | i > } > 0 || ci > if(source.charCodeAt(index) === 0x5B /* '[' */ && source.charCodeAt(index + 1) === 0x5D /* ']' */){ > || ci > name += advance(); > || ci > name += advance(); > | c > } > > while (source.charCodeAt(index) === 0x2E /* '.' */ || > > source.charCodeAt(index) === 0x23 /* '#' */ || > > source.charCodeAt(index) === 0x7E /* '~' */) { > > name += advance(); > > name += scanIdentifier(last); > > } > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ or inconsistent indentation: > 1768d08d029dc3bf1ab88e26df0a9b40ae61227f:src/main/java/org/junit/internal/runners/statements/FailOnTimeout.java 5ca9da987a7d4dc00e082aaf552cbd8ee8c7bd33:src/main/java/org/junit/internal/runners/statements/FailOnTimeout.java + 136 > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > maxCpuTime = threadCpuTime; > > } > > } > > } > > return (stuckThread == mainThread) ? null : stuckThread; > -2 | i > } > -1 | i > > 0 || ci > /** > || ci > * Returns the CPU time used by a thread, if possible. > || ci > * @param thr The thread to query. > || ci > * @return The CPU time used by {@code thr}, or 0 if it cannot be determined. > || ci > */ > || ci > private long cpuTime (Thread thr) { > || ci > ThreadMXBean mxBean = ManagementFactory.getThreadMXBean(); > || ci > if (mxBean.isThreadCpuTimeSupported()) { > || ci > try { > || ci > return mxBean.getThreadCpuTime(thr.getId()); > || ci > } catch (UnsupportedOperationException e) { > || ci > } > || ci > } > || ci > return 0; > | c > } > | c > > > private class CallableStatement implements Callable<Throwable> { > > public Throwable call() throws Exception { > > try { > > fOriginalStatement.evaluate(); > > } catch (Exception e) { > > throw e; > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ permalink raw reply [flat|nested] 15+ messages in thread
* diff heuristics dramatically improved by considering line indentation and blank lines 2016-06-30 13:54 ` Michael Haggerty @ 2016-07-01 17:04 ` Michael Haggerty 2016-07-01 18:01 ` [PATCH] diff compaction heuristic: favor shortest neighboring " Junio C Hamano 1 sibling, 0 replies; 15+ messages in thread From: Michael Haggerty @ 2016-07-01 17:04 UTC (permalink / raw) To: Stefan Beller, Jeff King; +Cc: git, Jacob Keller, Junio C Hamano On 06/30/2016 03:54 PM, Michael Haggerty wrote: > On 06/23/2016 07:10 PM, Michael Haggerty wrote: >> On 06/17/2016 06:09 PM, Stefan Beller wrote: >>> I think before spending more time on discussing and implementing new >>> (hopefully better) heuristics, I'd want to step back and try to be a bit more >>> systematic, i.e. I'll want to collect lots of test cases in different languages >>> and use cases. A mini test suite, maybe? >> >> Stefan, >> >> I've also been playing around with diff heuristics and also trying to >> find some good test data. Have you put together some test cases yet? > > I put quite of work into tooling to evaluate diff heuristics, and just > published it on GitHub: > > https://github.com/mhagger/diff-slider-tools Today I hand-optimized about 2700 diff sliders in 12 different repositories and used that as a reference corpus against which I tested the accuracy of four different algorithms. The tools and the hand-optimized values are now in the GitHub repository mentioned above. See also this pull request [1]. If I didn't make any mistakes, the number of errors (i.e., cases where the algorithm choose a different slider shift than the one I picked by hand) look like this: > | repository | default | compaction | indent | indent-favor-edges | > | -------------------- | ------- | ---------- | ------ | ------------------ | > | ant | 225 | 102 | 7 | 7 | > | bugzilla | 208 | 81 | 14 | 14 | > | couchdb | 44 | 24 | 13 | 10 | > | docker | 180 | 160 | 29 | 18 | > | git | 446 | 59 | 27 | 19 | > | ipython | 358 | 163 | 61 | 11 | > | junit | 146 | 67 | 5 | 5 | > | nodejs | 489 | 78 | 12 | 12 | > | phpmyadmin | 330 | 49 | 1 | 0 | > | test-more | 15 | 2 | 2 | 0 | > | test-unit | 33 | 13 | 4 | 4 | > | -------------------- | ------- | ---------- | ------ | ------------------ | > | totals | 2474 | 798 | 175 | 100 | The algorithms are: * default -- the default behavior `git diff` on the current master * compaction -- `git diff --compaction-heuristic` * indent -- the indent-based algorithm as reported earlier in this thread * indent-favor-edges -- the indent-based algorithm, with some improved heuristics regarding sliders near the begin/end of file and probably one or two fixes I encourage other people to run the tests against some code samples in their favorite programming languages so that the testing can cover a wider range of inputs, and submit your data to the GitHub project. Also feel free to tweak the heuristic (there's probably still room for improvement). All the tools and raw data for my experiments are already there. I'll be going on vacation for the next two weeks so I probably can't follow up on this for a while. But I think we should build an algorithm like this into Git! Michael [1] https://github.com/mhagger/diff-slider-tools/pull/1 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-06-30 13:54 ` Michael Haggerty 2016-07-01 17:04 ` diff heuristics dramatically improved by considering line indentation and " Michael Haggerty @ 2016-07-01 18:01 ` Junio C Hamano 2016-07-04 14:33 ` Jakub Narębski 1 sibling, 1 reply; 15+ messages in thread From: Junio C Hamano @ 2016-07-01 18:01 UTC (permalink / raw) To: Michael Haggerty; +Cc: Stefan Beller, Jeff King, git, Jacob Keller Michael Haggerty <mhagger@alum.mit.edu> writes: > I put quite of work into tooling to evaluate diff heuristics, and just > published it on GitHub: > > https://github.com/mhagger/diff-slider-tools > > The README there is hopefully enough to let people get started using it > to test their own favorite heuristics. Intereting. A fair TL;DR of this would be that what we write and want to diff have the structure where things often are ordered in an inherent hierarchy, and things at the higher level are highlighted by being less indented, and the indent-based heuristics exploit that well to choose a slide that breaks at the higher-level boundary, e.g. > The algorithm does pretty well with XML and HTML: > >> fef3ea39f8bd474add292bb6437df6cbd22e1ba7:contributors.xml a394a0bdf8e6240dc09023a8260059c57c158a00:contributors.xml + 1624 >> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv >> > <last>Toni</last> >> > </name> >> > <name> >> > <first>Vincent</first> >> > <last>Legoll</last> >> -2 | > </name> >> -1 | i > <name> >> 0 || ci > <first>Vincent</first> >> || ci > <last>Privat</last> >> | ci > </name> >> | c > <name> And that trait is shared among common programming languages (Java, Perl, Go, C, etc.). The only case I can think of offhand that does not follow "higher level heading is less indented" pattern is an already typeset book, where chapter headers are often centered on the page while section headers may be at the left edge. But we are not likely to be interested to get that case right anyway, so it is OK. > gettext source: > >> aef18cc6063106075afeb3a78ec72656b19c8bde:po/de.po b0e098ce46ce7f8ecd975720ccec8d0eb7787e50:po/de.po - 12954 >> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv >> >#~ msgid "unmerged: %s" >> >#~ msgstr "nicht zusammengeführt: %s" >> > >> >#~ msgid "input paths are terminated by a null character" >> >#~ msgstr "Eingabepfade sind durch ein NUL Zeichen abgeschlossen" >> -2 | > >> -1 | ci >#~ msgid "" >> 0 || ci >#~ "The following untracked files would NOT be saved but need to be removed " >> || ci >#~ "by stash save:" >> || ci >#~ msgstr "" >> || ci >#~ "Die folgenden unbeobachteten Dateien würden NICHT gespeichert werden,\n" >> || ci >#~ "müssen aber durch \"stash save\" entfernt werden:" >> | ci > This is an interesting example of "have only a single level hierarchy. A line is either in one block or a blank between blocks". > It often fails to get C preprocessor directives right: > >> a08595f76159b09d57553e37a5123f1091bb13e7:http.c aeff8a61216bf6e0d663c08c583bc8552fa3c344:http.c + 429 >> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv >> > curl_easy_setopt(result, CURLOPT_SSLKEY, ssl_key); >> >#endif >> >#if LIBCURL_VERSION_NUM >= 0x070908 >> > if (ssl_capath != NULL) >> > curl_easy_setopt(result, CURLOPT_CAPATH, ssl_capath); >> -1 | i >#endif >> 0 || ci >#if LIBCURL_VERSION_NUM >= 0x072c00 >> || ci > if (ssl_pinnedkey != NULL) >> || ci > curl_easy_setopt(result, CURLOPT_PINNEDPUBLICKEY, ssl_pinnedkey); >> | c >#endif Yes, this is "non-human do not know 'end' is likely to be at the end of a logical block". > And it gets confused by unusual blank line placement: > >> ed55169834a3ce16a271def9630c858626ded34d:tools/eslint/node_modules/doctrine/lib/doctrine.js 2d441493a4a46a511ba1bdf93e442c3288fbe92d:tools/eslint/node_modules/doctrine/lib/doctrine.js + 330 >> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv >> > name === 'external' || >> > name === 'event')) { >> > name += advance(); >> > name += scanIdentifier(last); >> > >> -1 | i > } >> 0 || ci > if(source.charCodeAt(index) === 0x5B /* '[' */ && source.charCodeAt(index + 1) === 0x5D /* ']' */){ >> || ci > name += advance(); >> || ci > name += advance(); >> | c > } Likewise, this is showing that a "non-human not knowing } is a closing and { is an opening token". By the way, I no longer remember what indent level your "indent-only" heuristics gives to a blank line. Does the closing brace we see above (marked with -1) count as increasing the indentation level from the previous empty-line at indent 0, or does it count as dedenting from the previous empty-line that has virtually at the same indent level as "name += scanIdentifier(last);" line? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines 2016-07-01 18:01 ` [PATCH] diff compaction heuristic: favor shortest neighboring " Junio C Hamano @ 2016-07-04 14:33 ` Jakub Narębski 0 siblings, 0 replies; 15+ messages in thread From: Jakub Narębski @ 2016-07-04 14:33 UTC (permalink / raw) To: git; +Cc: Stefan Beller, Jeff King, git, Jacob Keller W dniu 2016-07-01 o 20:01, Junio C Hamano pisze: > Michael Haggerty <mhagger@alum.mit.edu> writes: >> It often fails to get C preprocessor directives right: >> >>> a08595f76159b09d57553e37a5123f1091bb13e7:http.c aeff8a61216bf6e0d663c08c583bc8552fa3c344:http.c + 429 >>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv >>> > curl_easy_setopt(result, CURLOPT_SSLKEY, ssl_key); >>> >#endif >>> >#if LIBCURL_VERSION_NUM >= 0x070908 >>> > if (ssl_capath != NULL) >>> > curl_easy_setopt(result, CURLOPT_CAPATH, ssl_capath); >>> -1 | i >#endif >>> 0 || ci >#if LIBCURL_VERSION_NUM >= 0x072c00 >>> || ci > if (ssl_pinnedkey != NULL) >>> || ci > curl_easy_setopt(result, CURLOPT_PINNEDPUBLICKEY, ssl_pinnedkey); >>> | c >#endif > > Yes, this is "non-human do not know 'end' is likely to be at the end > of a logical block". I wonder if taking into account xfuncname match to adjust heuristics (or to change "slider") would help there. I think good heuristic would be to break before xfuncname match (before new section). >> And it gets confused by unusual blank line placement: >> >>> ed55169834a3ce16a271def9630c858626ded34d:tools/eslint/node_modules/doctrine/lib/doctrine.js 2d441493a4a46a511ba1bdf93e442c3288fbe92d:tools/eslint/node_modules/doctrine/lib/doctrine.js + 330 >>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv >>> > name === 'external' || >>> > name === 'event')) { >>> > name += advance(); >>> > name += scanIdentifier(last); >>> > >>> -1 | i > } >>> 0 || ci > if(source.charCodeAt(index) === 0x5B /* '[' */ && source.charCodeAt(index + 1) === 0x5D /* ']' */){ >>> || ci > name += advance(); >>> || ci > name += advance(); >>> | c > } > > Likewise, this is showing that a "non-human not knowing } is a closing > and { is an opening token". If not encoding heuristic that [,{,( are opening token, and ],},) are closing token into heuristics, perhaps length of the line could be a consideration about where to start a diff chunk? -- Jakub Narębski ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2016-07-04 14:34 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-06-16 17:46 [PATCH] diff compaction heuristic: favor shortest neighboring blank lines Stefan Beller 2016-06-16 20:27 ` Junio C Hamano 2016-06-16 21:06 ` Stefan Beller 2016-06-16 21:10 ` Michael Haggerty 2016-06-16 21:36 ` Stefan Beller 2016-06-17 15:36 ` Jeff King 2016-06-17 16:09 ` Stefan Beller 2016-06-23 17:10 ` Michael Haggerty 2016-06-23 17:25 ` Stefan Beller 2016-06-23 17:37 ` Junio C Hamano 2016-06-23 20:13 ` Michael Haggerty 2016-06-30 13:54 ` Michael Haggerty 2016-07-01 17:04 ` diff heuristics dramatically improved by considering line indentation and " Michael Haggerty 2016-07-01 18:01 ` [PATCH] diff compaction heuristic: favor shortest neighboring " Junio C Hamano 2016-07-04 14:33 ` Jakub Narębski
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.