Re: [PATCH 2/4] line-log: adjust start/end of ranges individually

From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Eric Sunshine <sunshine@sunshineco.com>
Cc: gitgitgadget@gmail.com, Git List <git@vger.kernel.org>,
	Thomas Rast <tr@thomasrast.ch>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 2/4] line-log: adjust start/end of ranges individually
Date: Mon, 6 Aug 2018 14:52:06 +0200 (DST)	[thread overview]
Message-ID: <nycvar.QRO.7.76.6.1808061232120.71@tvgsbejvaqbjf.bet> (raw)
In-Reply-To: <CAPig+cRWcFVbA76_HT2iVD16bsUmbWdCgk_07rmiGneM5czdOQ@mail.gmail.com>

Hi Eric,

On Sun, 5 Aug 2018, Eric Sunshine wrote:

> On Sat, Aug 4, 2018 at 6:18 PM Johannes Schindelin via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> > When traversing commits and adjusting the ranges, things can get really
> > tricky. For example, when the line range of interest encloses several
> > hunks of a commit, the line range can actually shrink.
> >
> > Currently, range_set_shift_diff() does not anticipate that scenario and
> > blindly adjusts start and end by the same offset ("shift" the range).
> > [...]
> > Let's fix this by adjusting the start and the end offsets individually.
> >
> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > ---
> > diff --git a/line-log.c b/line-log.c
> > @@ -438,7 +438,13 @@ static void range_set_shift_diff(struct range_set *out,
> >                                 - (target[j].end-target[j].start);
> >                         j++;
> >                 }
> > -               range_set_append(out, src[i].start+offset, src[i].end+offset);
> > +               start_offset = offset;
> > +               while (j < diff->target.nr && src[i].end > target[j].end) {
> > +                       offset += (parent[j].end-parent[j].start)
> > +                               - (target[j].end-target[j].start);
> > +                       j++;
> > +               }
> > +               range_set_append(out, src[i].start+start_offset, src[i].end+offset);
> 
> I'm still trying to wrap my head around the original code, so I'm not
> even at the point of being able to say if this fix is correct. What
> happens if the "start_offset" loop consumes all of 'j' before it even
> gets to the new loop?

First of all, it is not really the `start_offset` loop, but more like the
"end_offset loop": we are still adjusting `offset`, and that is what we
use to adjust the end of the current line range, while we take pains to
keep the offset that needs to be used for the start of said line range.

> Why does the new loop use '>' whereas the existing uses '>='?

As a mathematician, I usually think of intervals as inclusive only on one
end. So I intuitively use the non-inclusive comparison on the other end.

In this instance, the first loop tries to make sure that the offset to be
used for the start offset has taken into account all of the relevant diff
hunks (meaning: the diff hunks whose target lines start before src[i],
i.e. the current line range to adjust).

The second loop, however, tries to make sure that the offset to adjust the
*end* of the current line range.

Hence what I wrote.

But now that I think about it again, I am no longer sure that the first
loop (the one I did not change) is sound, to begin with. Let's imagine a
very simple case, where we try to adjust a start line, say, 100, and the
diff has a hunk with header `@@ -90,20 +90,40 @@`. So the start line lies
somewhere around a quarter into the post-image.

The current line-log code adjusts this by a very crude method: since the
line number lies between the post-image's start, it is adjusted by adding
the length of the pre-image and subtracting the length of the post image,
i.e. 20 - 40. The result is that we pretend that it came from line number
80, which is squarely outside the pre-image range.

That method of adjustment is appropriate for lines *outside* the
post-image range, of course. If we had looked at line 140 (which is 10
lines after the post-image), then it would be totally correct to say that
it came from line 120.

But that method is pretty obviously broken for any line that lies *within*
a post-image.

So I think the entire loop is in dear need of adjustment.

The question is: how should it be adjusted? The safest way would be to
adjust start and end of the line range differently, where start is pulled
back to the pre-image's start (in the above example, 90), and end is
pushed up to the pre-image's end (in the above example, 110).

Of course, this has a slightly unintuitive consequence: imagine that some
commit made a block of, say, 20 lines a conditional one. Like,

	/* Some comment */
	do_something = 1;
	...
	print_it();
	...

could have become

	if (!dry_run) {
		/* Some comment */
		do_something = 1;
		...
		print_it();
		...
	}

If you use my proposed method to adjust the line ranges to figure out the
history of that `print_it()` line, this commit would blow up the line
range to the entire conditional block, which is probably not what you
wanted.

I don't really see a better way, though.

> Having said that, a much easier fix is to use range_set_append_unsafe()
> here, and then at the bottom of the loop, invoke
> 'sort_and_merge_range_set(out)' to restore range-set invariants and
> ensure that neighboring ranges are coalesced. Not only does that resolve
> the crash and other weird behavior, but it means you don't have to add a
> special-case to range_set_append(), thus the fix becomes simpler
> overall.

That would only paper over the bug, as the line range is adjusted
incorrectly. It really is.

When looking at a line range, say, 90--110, and traversing a commit that
added 5 lines from line 100-105, then the end and the start of that line
range really need to be adjusted independently. In this example, the start
would not even budge, but the end would need to be adjusted to 105.

The current code would adjust neither because the start of the line range
is outside of the post-image range.

No amount of restoring invariants can make up for that incorrect
adjustment.

> Aside from simplicity, I think the suggested use of
> range_set_append_unsafe() and sort_and_merge_range_set() _is_ the
> correct fix anyhow because this code isn't taking care to ensure that
> the range, after applying 'offset', doesn't abut or overlap with an
> earlier range, and sort_and_merge_range_set() is meant to be used
> exactly in cases like this when invariants may be broken.
> 
> So, while the suggested fix is simpler and "better" and fixes the
> crash, that doesn't necessarily mean that the values computed here are
> actually correct. As noted, I'm still trying to grok the computation
> of these values, but that's a separate issue from the crash itself.

I think sort_and_merge_range_set() is fine for "Postelizing user input"
(Postel's law: be stringent in your output and lenient in your input,
something we should also heed in general, not only in our code).

But I don't think our code should need it. If the line ranges need to
resort to re-sorting (pun intended), that smells a lot like a bug.

And the only way those ranges could be merged is when they start to touch
(like in the test case I provided, when the lines 1, 2, 3 and the lines 4,
5, become a single line range because a, b and then c, are removed from
the original 1, 2, 3, a, b, c, 4, 5). And my suggested fix makes that
change quite elegantly, while appending line ranges (which will work
correctly if done in the correct order, which should always be the case
because both our line ranges as well as the diff hunks are ordered).

What do you think about my proposed change to adjust start and end?

Ciao,
Dscho