jk/maint-merge-rename-create [Was: Re: What's cooking in git.git (Jun 2011, #02; Sat, 11)]

* jk/maint-merge-rename-create [Was: Re: What's cooking in git.git (Jun 2011, #02; Sat, 11)]
@ 2011-06-18 21:49 Elijah Newren
  2011-06-20 16:08 ` jk/maint-merge-rename-create Junio C Hamano
  0 siblings, 1 reply; 2+ messages in thread
From: Elijah Newren @ 2011-06-18 21:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Jay Soffian, Junio C Hamano, git

Hi,

On Thu, Jun 16, 2011 at 3:05 PM, Jeff King <peff@peff.net> wrote:
> Thanks. The sticking point in my series that there is a weird regression
> it introduces, and I haven't quite figured out the cause.
>
> I'm cc'ing Jay Soffian, who found it. You can reproduce with this recipe
> (sorry, the chromium repo is huge, but I don't have a smaller test case
> yet):
>
>  git clone http://git.chromium.org/git/chromium.git &&
>  cd chromium &&
>  git config merge.renameLimit 0 &&
>  git checkout 0f6d00c &&
>  git cherry-pick d7081a74
...
> If you have any input, I'd appreciate it.

The length of my email may make you change your mind about that.  :-)

Turning on break detection looks like it opens a huge can of worms;
including one issue that I don't understand, namely why break
detection doesn't work when I expect it to.  Let me start with that:

--- Why doesn't break detection work as I expect? ---
$ git init -q
$ printf "1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n" >a
$ printf "a\nb\nc\nd\ne\nf\ng\nh\ni\nj\n" >b
$ git add a b
$ git commit -q -mA

$ git mv a a2
$ git mv b b2
$ git diff --name-status -B -M HEAD
R100    a       a2
R100    b       b2
$ git mv a2 b
$ git mv b2 a
$ git diff --name-status -B -M HEAD
M       a
M       b

I would have expected the output from the last command above to be
R100    a       b
R100    b       a
Why are these renames not detected?  Is there a reason, or is this a bug?

--- Issues with merge-recursive and break detection ---
The chromium repository testcase Jay provided helps uncover a few
extra issues in merge-recursive when turning on break detection, but I
believe there are a number of others that need consideration as well.
Sorry for the lengthy email, but there are lots of little issues.

Whenever the read_tree logic can't handle a merge trivially at a given
path (as per Documentation/technical/trivial-merge.txt), it will load
the three relevant stages for that path into the index.  For
simplicity, let's ignore mode changes, and just consider sha1sums.
Let me name things so we can talk about the various cases:
  o_sha - sha1sum of given path in the merge-base
  a_sha - sha1sum of given path in HEAD
  b_sha - sha1sum of given path on other branch involved in merge

In the case of a rename, there are two files involved, and thus (up
to) 6 relevant stages that merge-recursive.c needs to consider.  Let
me use s_ as a prefix for the source path of a rename and d_ as a
prefix for the destination path of a rename.  That means for renames
the sha1sums we need to consider are:
  old_name -> new_name
  --------    --------
  s_o_sha     d_o_sha
  s_a_sha     d_a_sha
  s_b_sha     d_b_sha

Simple renames will have 3 of those 6 sha1sums be the null sha1; the 3
non-null sha1s will either be {s_o_sha, s_a_sha, d_b_sha} or {s_o_sha,
d_a_sha, s_b_sha}.  Git handles those just fine.

A slight complication is having both of d_[ab]_sha be non-null (while
d_o_sha remains null).  This case is either a rename/rename(2to1)
conflict or a rename/add conflict.  Either way, git has long had code
to handle both of those cases.

A slightly more complicated case is having all three of the s_*_sha be
non-null as well as exactly one of d_[ab]_sha.  Such cases would have
previously resulted in git ignoring old_name as a rename candidate,
since we were not previously using break detection.  Your series is
aimed at handling these cases, so I won't cover them in detail.  I
think you do so correctly for almost all such cases, though I'm
suspicious that there might still be issues with some corner cases
(e.g. with D/F conflicts or rename/rename(1to2) cases).

The problems the chromium repository is triggering is when all d_*_sha
are non-null.  That was impossible before your patch series, since
having all three be non-null meant it was ignored as a rename
candidate.  It also meant no code was ever needed for that case, so
none had been written.  There are a number of possibilites, I'll try
to highlight using example values, using d and e (and later f-i) as
variables standing for sha1sums:

Basic d_*_sha all non-null case:
  old_name    -> new_name
  -----------    -----------
  s_o_sha = d    d_o_sha = e
  s_a_sha = d    d_a_sha = e
  s_b_sha = 0    d_b_sha = d
This suggests that the content originally stored at new_name was
unmodified on one side of history and deleted (or renamed to something
else) on the other.  The content originally stored at old_name has
moved to new_name, so the correct resolution is to delete old_name and
write the contents of d to new_name.  git has no code for this; it
assumes d_o_sha is null and thus sees this case as a rename/add
conflict.

Slightly more complex; 3-way merge:
  s_o_sha = d1   d_o_sha = e
  s_a_sha = d2   d_a_sha = e
  s_b_sha = 0    d_b_sha = d3
This is like the last case, but now we need to do a three way merge of
d1,d2,d3 and store the results at new_name.  It appears that all of
the conflicts you saw with Jay's chromium testcase would be fixed if
there were code for this case.

Slightly more complex; 3-way merge + possible conflict:
  s_o_sha = d1   d_o_sha = e1
  s_a_sha = d2   d_a_sha = e2
  s_b_sha = 0    d_b_sha = d3
In addition to the three-way content merge of d1,d2,d3 that is
supposed to be stored at new_path, there is also a potential
modify/delete problem (e1, e2, -nothing-) that ALSO is supposed to be
stored at new_name.  (What do we call that -- rename+modify/delete
conflict?)  However, there is also a chance that it's not a
modify/delete, but rather another rename using new_path as the source
of the rename -- in which case there would be no conflict at new_path
after all (unless the 3-way merge of d1,d2,d3 had content conflicts).

Slightly more complex: 3-way merge + add conflict despite base presence:
  s_o_sha = d1   d_o_sha = e
  s_a_sha = d2   d_a_sha = f
  s_b_sha = 0    d_b_sha = d3
Here the original new_name was either rewritten, deleted, or renamed
on both sides of history.  e, f, and d3 all have nothing to do with
each other, which means that in addition to the three-way content
merge of d1,d2,d3 that is supposed to be stored at new_path, there is
also an added file that is unrelated.  In other words, we either have
a rename/add conflict or a rename/rename(2to1) conflict.

Just for fun:
  file_g         file_h         file_i
  ------------   ------------   ------------
  g_o_sha = g1   h_o_sha = h1   i_o_sha = i1
  g_a_sha = g2   h_a_sha = h2   i_a_sha = i2
  g_b_sha = i3   h_b_sha = g3   i_b_sha = h3
cyclic rename; the side of history being merged into head also had fun
with renames: (file_g, file_h, file_i) -> (file_h, file_i, file_g).
Also, each of file_g, file_h, and file_i were modified on both sides.
Correct resolution: 3-way merge of g1,g2,g3 stored in file_h, 3-way
merge of h1,h2,h3 stored in file_i, and 3-way merge of i1,i2,i3 stored
in file_g.

More complexity with more renames:
The rename/rename (either 1to2 or 2to1) conflicts will involve 9
sha1sums rather than six.  Also, as shown in the previous case, either
the source or the destination path of a rename could be involved (as
either source or destination) of yet another rename.

Don't forget the pesky D/F possibilities:
The possibility of D/F conflicts complicates things slightly -- for
example, both the source and destination paths of a rename could be
involved in a D/F conflict, and thus we may need to record the
conflicts at an alternate location.  Whether we need to do so or not
cannot be determined in process_renames(); therefore, the extra
handling we add for both source and destination entries (at least the
final index/working-directory changes) should be deferred until later.
 There may also be interesting ramifications for criss-cross merges
due to these break rewrite cases as well.

Stuff I'm still ignoring:
As Junio noted previously, break detection could be used to note that
a modify/delete conflict is really a delete/delete/add and thus there
should be no conflict.  Also, a rename/modify that has odd content
conflicts might be easier to handle if it was determined that it was
really a rename/delete/add-source case.  The only thing I'm
considering above is using break-detection as a way to get better
rename detection.

Elijah

^ permalink raw reply	[flat|nested] 2+ messages in thread