All of lore.kernel.org
 help / color / mirror / Atom feed
* Lost file after git merge
@ 2022-07-28  8:23 Laďa Tesařík
  2022-07-28 12:17 ` Ævar Arnfjörð Bjarmason
  2022-07-28 21:23 ` brian m. carlson
  0 siblings, 2 replies; 9+ messages in thread
From: Laďa Tesařík @ 2022-07-28  8:23 UTC (permalink / raw)
  To: git

Dear all,

recently we experienced strange behavior of git merge after deleting file in two branches and introducing it again in one of them (revision graph: https://i.stack.imgur.com/jNUZB.png):

1. I added a file called 'new_file' to a master branch.
2. Then I created branch feature/2 and deleted the file in master
3. Then I deleted the file in branch feature/2 as well.
4. I created 'new_file' on branch feature/2 again.
5. I merged feature/2 into master, merge introduced no changes, file 'new_file' is not present in master.

This could cause problems to our team in the future, fortunately we noticed it this time.
Please, does anybody have any explanation for this behavior?

Or is anybody aware of any Git or Source option that changes this behaviour?

Thank you very much,

Lada Tesarik

P. S.
In the picture the steps were done with git revert command, but result is the same if I delete file manually and perform git commit.
Tested with git versions 2.26 and 2.37

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Lost file after git merge
  2022-07-28  8:23 Lost file after git merge Laďa Tesařík
@ 2022-07-28 12:17 ` Ævar Arnfjörð Bjarmason
  2022-07-28 17:11   ` Junio C Hamano
  2022-07-28 21:23 ` brian m. carlson
  1 sibling, 1 reply; 9+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-07-28 12:17 UTC (permalink / raw)
  To: Laďa Tesařík; +Cc: git


On Thu, Jul 28 2022, Laďa Tesařík wrote:

> 1. I added a file called 'new_file' to a master branch.
> 2. Then I created branch feature/2 and deleted the file in master
> 3. Then I deleted the file in branch feature/2 as well.
> 4. I created 'new_file' on branch feature/2 again.
> 5. I merged feature/2 into master, merge introduced no changes, file 'new_file' is not present in master.
>
> This could cause problems to our team in the future, fortunately we noticed it this time.
> Please, does anybody have any explanation for this behavior?
>
> Or is anybody aware of any Git or Source option that changes this behaviour?
>
> Thank you very much,
>
> Lada Tesarik
>
> P. S.
> In the picture the steps were done with git revert command, but result is the same if I delete file manually and perform git commit.
> Tested with git versions 2.26 and 2.37

Those who know "merge" better will chime in, but I believe this is known
behavior, and you can see why if your "topic" branch doesn't
(re)introduce the file as it was on master, but adds a small change to
it.

I quickly experimented with this (with the fast-export stream at the
end). If you had a "hello" as the content on "master" and introduced
"hello2" on the topic you'd get:

	$ git merge topic
	CONFLICT (modify/delete): file.txt deleted in HEAD and modified in topic.  Version topic of file.txt left in tree.
	Automatic merge failed; fix conflicts and then commit the result.

So others will know better, but I vaguely recall this having to do with
this being the same in your version (not the below hello2 version, but
if you s/hello2/hello/):

	git rev-parse $(git merge-base topic master):
	git rev-parse topic:

I.e. the topic is the TREESAME as its divergence point with "master",
which IIRC is a shortcut the merge takes when deciding what changes to
keep, and whether they conflict.

--
blob
mark :1
data 6
hello

reset refs/heads/master
commit refs/heads/master
mark :2
author Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1659010359 +0200
committer Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1659010359 +0200
data 9
Add file
M 100644 :1 file.txt

commit refs/heads/master
mark :3
author Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1659010437 +0200
committer Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1659010437 +0200
data 81
Revert "Add file"

This reverts commit 75aa73c55b19219d47dfc7ec3883c88466678718.
from :2
D file.txt

commit refs/heads/topic
mark :4
author Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1659010369 +0200
committer Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1659010369 +0200
data 81
Revert "Add file"

This reverts commit 75aa73c55b19219d47dfc7ec3883c88466678718.
from :2
D file.txt

blob
mark :5
data 7
hello2

commit refs/heads/topic
mark :6
author Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1659010359 +0200
committer Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1659010644 +0200
data 9
Add file
from :4
M 100644 :5 file.txt


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Lost file after git merge
  2022-07-28 12:17 ` Ævar Arnfjörð Bjarmason
@ 2022-07-28 17:11   ` Junio C Hamano
  2022-07-29 20:23     ` René Scharfe
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2022-07-28 17:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Laďa Tesařík, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Thu, Jul 28 2022, Laďa Tesařík wrote:
>
>> 1. I added a file called 'new_file' to a master branch.
>> 2. Then I created branch feature/2 and deleted the file in master
>> 3. Then I deleted the file in branch feature/2 as well.
>> 4. I created 'new_file' on branch feature/2 again.

It heavily depends on how this creation is done, i.e. what went into
the created file.  Imagine that a file existed with content A at
commit 0, both commits 1 and 2 removed it on their forked history,
and then commit 3 added exactly the same content A to the same path:

          1---3
         /     \
    ----0---2---4---->

When you are about to merge 2 and 3 to create 4, what would a
three-way merge see?

    0 had content A at path P
    2 said "no we do not want content A at path P"
    3 said "we are happy with content A at path P"

So the net result is that 0-->3 "one side did not touch A at P" and
0-->2 "one side removed A at P".  

Three-way merge between X and Y is all about taking what X did if Y
didn't have any opinion on what X touched.  This is exactly that
case.  The history 0--->3 didn't have any opinion on what should be
in P or whether P should exist, and that is why there is no change
between these two endpoints.  The history 0--->2 does care---it feels
that it is detrimental to the project to have P hence it removed.

So the end result will remove P, if 3 added identical content as
existed at 0 and removed at 1.

If 3 added something different, then the picture becomes entirely
different.  The history 0--->3 no longer has "no opinion".  It
strongly believes that P having content A at 0 was wrong, and it
should have content B, hence it changed it.  Now when that opinion
collides with the opinion of the history 0--->2 that says it is
wrong to have content A at path P, the person who is creating the
merge at 4 needs to think and resolve.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Lost file after git merge
  2022-07-28  8:23 Lost file after git merge Laďa Tesařík
  2022-07-28 12:17 ` Ævar Arnfjörð Bjarmason
@ 2022-07-28 21:23 ` brian m. carlson
  1 sibling, 0 replies; 9+ messages in thread
From: brian m. carlson @ 2022-07-28 21:23 UTC (permalink / raw)
  To: Laďa Tesařík; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2723 bytes --]

On 2022-07-28 at 08:23:24, Laďa Tesařík wrote:
> Dear all,
> 
> recently we experienced strange behavior of git merge after deleting file in two branches and introducing it again in one of them (revision graph: https://i.stack.imgur.com/jNUZB.png):
> 
> 1. I added a file called 'new_file' to a master branch.
> 2. Then I created branch feature/2 and deleted the file in master
> 3. Then I deleted the file in branch feature/2 as well.
> 4. I created 'new_file' on branch feature/2 again.
> 5. I merged feature/2 into master, merge introduced no changes, file 'new_file' is not present in master.
> 
> This could cause problems to our team in the future, fortunately we noticed it this time.
> Please, does anybody have any explanation for this behavior?
> 
> Or is anybody aware of any Git or Source option that changes this behaviour?

I think this is covered in the Git FAQ:

If I make a change on two branches but revert it on one, why does the merge of those branches include the change?
	By default, when Git does a merge, it uses a strategy called the `ort`
	strategy, which does a fancy three-way merge.  In such a case, when Git
	performs the merge, it considers exactly three points: the two heads and a
	third point, called the _merge base_, which is usually the common ancestor of
	those commits.  Git does not consider the history or the individual commits
	that have happened on those branches at all.

  As a result, if both sides have a change and one side has reverted that change,
  the result is to include the change.  This is because the code has changed on
  one side and there is no net change on the other, and in this scenario, Git
  adopts the change.

  If this is a problem for you, you can do a rebase instead, rebasing the branch
  with the revert onto the other branch.  A rebase in this scenario will revert
  the change, because a rebase applies each individual commit, including the
  revert.  Note that rebases rewrite history, so you should avoid rebasing
  published branches unless you're sure you're comfortable with that.  See the
  NOTES section in git-rebase(1) for more details.

In this case, your change is to delete the file and the revert is to
re-add it, which is why this is happening.

I realize this is tricky, which is why I added this FAQ entry, and,
before that, the mention in git-merge(1) in the sentence starting with
"With the strategies".  It is, however, the correct behaviour, since a
three-way merge must necessarily adopt a change, whatever that is, when
the other side is unchanged, and as a result, there's no configuration
option for this.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Lost file after git merge
  2022-07-28 17:11   ` Junio C Hamano
@ 2022-07-29 20:23     ` René Scharfe
  2022-07-29 22:04       ` Junio C Hamano
  2022-07-30  2:16       ` Elijah Newren
  0 siblings, 2 replies; 9+ messages in thread
From: René Scharfe @ 2022-07-29 20:23 UTC (permalink / raw)
  To: Junio C Hamano, Ævar Arnfjörð Bjarmason
  Cc: Laďa Tesařík, git

Am 28.07.22 um 19:11 schrieb Junio C Hamano:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> On Thu, Jul 28 2022, Laďa Tesařík wrote:
>>
>>> 1. I added a file called 'new_file' to a master branch.
>>> 2. Then I created branch feature/2 and deleted the file in master
>>> 3. Then I deleted the file in branch feature/2 as well.
>>> 4. I created 'new_file' on branch feature/2 again.
>
> It heavily depends on how this creation is done, i.e. what went into
> the created file.  Imagine that a file existed with content A at
> commit 0, both commits 1 and 2 removed it on their forked history,
> and then commit 3 added exactly the same content A to the same path:
>
>           1---3
>          /     \
>     ----0---2---4---->
>
> When you are about to merge 2 and 3 to create 4, what would a
> three-way merge see?
>
>     0 had content A at path P
>     2 said "no we do not want content A at path P"
>     3 said "we are happy with content A at path P"
>
> So the net result is that 0-->3 "one side did not touch A at P" and
> 0-->2 "one side removed A at P".
>
> Three-way merge between X and Y is all about taking what X did if Y
> didn't have any opinion on what X touched.  This is exactly that
> case.  The history 0--->3 didn't have any opinion on what should be
> in P or whether P should exist, and that is why there is no change
> between these two endpoints.

The last sentence is not necessarily true.  You could also say that
0--->3 cared so much about path P having content A that it brought it
back from the void.  Determining whether a de-facto revert
- intended to return to an uncaring state of "take whatever main has" or
- meant to choose *that* specific content which incidentally is on main
is not possible from the snapshots at the merge point alone, I think.

Checking if 0...3 touched P and leaving that path unmerged out of
caution shouldn't be terribly expensive.

> The history 0--->2 does care---it feels
> that it is detrimental to the project to have P hence it removed.
>
> So the end result will remove P, if 3 added identical content as
> existed at 0 and removed at 1.
>
> If 3 added something different, then the picture becomes entirely
> different.  The history 0--->3 no longer has "no opinion".  It
> strongly believes that P having content A at 0 was wrong, and it
> should have content B, hence it changed it.  Now when that opinion
> collides with the opinion of the history 0--->2 that says it is
> wrong to have content A at path P, the person who is creating the
> merge at 4 needs to think and resolve.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Lost file after git merge
  2022-07-29 20:23     ` René Scharfe
@ 2022-07-29 22:04       ` Junio C Hamano
  2022-07-30  2:16       ` Elijah Newren
  1 sibling, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2022-07-29 22:04 UTC (permalink / raw)
  To: René Scharfe
  Cc: Ævar Arnfjörð Bjarmason, Laďa Tesařík, git

René Scharfe <l.s.r@web.de> writes:

>> Three-way merge between X and Y is all about taking what X did if Y
>> didn't have any opinion on what X touched.  This is exactly that
>> case.  The history 0--->3 didn't have any opinion on what should be
>> in P or whether P should exist, and that is why there is no change
>> between these two endpoints.
>
> The last sentence is not necessarily true.  You could also say that
> 0--->3 cared so much about path P having content A that it brought it
> back from the void.  Determining whether a de-facto revert
> - intended to return to an uncaring state of "take whatever main has" or
> - meant to choose *that* specific content which incidentally is on main
> is not possible from the snapshots at the merge point alone, I think.

Yes, it is doable.  It no longer is a three-way merge if we did so,
though ;-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Lost file after git merge
  2022-07-29 20:23     ` René Scharfe
  2022-07-29 22:04       ` Junio C Hamano
@ 2022-07-30  2:16       ` Elijah Newren
  2022-07-30 14:44         ` René Scharfe
  1 sibling, 1 reply; 9+ messages in thread
From: Elijah Newren @ 2022-07-30  2:16 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Laďa Tesařík, git

On Fri, Jul 29, 2022 at 1:34 PM René Scharfe <l.s.r@web.de> wrote:
>
> Am 28.07.22 um 19:11 schrieb Junio C Hamano:
> > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> >
> >> On Thu, Jul 28 2022, Laďa Tesařík wrote:
> >>
> >>> 1. I added a file called 'new_file' to a master branch.
> >>> 2. Then I created branch feature/2 and deleted the file in master
> >>> 3. Then I deleted the file in branch feature/2 as well.
> >>> 4. I created 'new_file' on branch feature/2 again.
> >
> > It heavily depends on how this creation is done, i.e. what went into
> > the created file.  Imagine that a file existed with content A at
> > commit 0, both commits 1 and 2 removed it on their forked history,
> > and then commit 3 added exactly the same content A to the same path:
> >
> >           1---3
> >          /     \
> >     ----0---2---4---->
> >
> > When you are about to merge 2 and 3 to create 4, what would a
> > three-way merge see?
> >
> >     0 had content A at path P
> >     2 said "no we do not want content A at path P"
> >     3 said "we are happy with content A at path P"
> >
> > So the net result is that 0-->3 "one side did not touch A at P" and
> > 0-->2 "one side removed A at P".
> >
> > Three-way merge between X and Y is all about taking what X did if Y
> > didn't have any opinion on what X touched.  This is exactly that
> > case.  The history 0--->3 didn't have any opinion on what should be
> > in P or whether P should exist, and that is why there is no change
> > between these two endpoints.
>
> The last sentence is not necessarily true.  You could also say that
> 0--->3 cared so much about path P having content A that it brought it
> back from the void.  Determining whether a de-facto revert
> - intended to return to an uncaring state of "take whatever main has" or
> - meant to choose *that* specific content which incidentally is on main
> is not possible from the snapshots at the merge point alone, I think.
>
> Checking if 0...3 touched P and leaving that path unmerged out of
> caution shouldn't be terribly expensive.

I think it might be terribly expensive.

Walking history can easily be the slow part of such an operation, e.g.
can_fast_forward() taking roughly 100 times as long as doing the
merge_incore_recursive() portion that creates the new merged toplevel
tree[1].  (And can_fast_forward() is a form of history walk that
doesn't involve traversing into any trees, so I suspect it's a cheaper
history traversal than what is being suggested).

Focusing on the tree traversal side, this suggested change would
essentially disable the trivial directory resolution optimizations in
merge-ort[2].  (Note that the trivial directory resolution sped up a
rebase that didn't involve very many renames by a factor of 25).  The
whole point of that optimization was to avoid walking into trees that
were only changed on one side, where possible.  Your proposed change
would be saying we always have to walk into trees that either side
modified...and do so for every intermediate commit as well so that we
can fully enumerate all (temporarily) changed files.


[1] https://lore.kernel.org/git/CABPp-BE48=97k_3tnNqXPjSEfA163F8hoE+HY0Zvz1SWB2B8EA@mail.gmail.com/
[2] https://lore.kernel.org/git/pull.988.v4.git.1626841444.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Lost file after git merge
  2022-07-30  2:16       ` Elijah Newren
@ 2022-07-30 14:44         ` René Scharfe
  2022-07-31  1:45           ` Elijah Newren
  0 siblings, 1 reply; 9+ messages in thread
From: René Scharfe @ 2022-07-30 14:44 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Laďa Tesařík, git

Am 30.07.22 um 04:16 schrieb Elijah Newren:
> On Fri, Jul 29, 2022 at 1:34 PM René Scharfe <l.s.r@web.de> wrote:
>>
>> Am 28.07.22 um 19:11 schrieb Junio C Hamano:
>>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>>>
>>>> On Thu, Jul 28 2022, Laďa Tesařík wrote:
>>>>
>>>>> 1. I added a file called 'new_file' to a master branch.
>>>>> 2. Then I created branch feature/2 and deleted the file in master
>>>>> 3. Then I deleted the file in branch feature/2 as well.
>>>>> 4. I created 'new_file' on branch feature/2 again.
>>>
>>> It heavily depends on how this creation is done, i.e. what went into
>>> the created file.  Imagine that a file existed with content A at
>>> commit 0, both commits 1 and 2 removed it on their forked history,
>>> and then commit 3 added exactly the same content A to the same path:
>>>
>>>           1---3
>>>          /     \
>>>     ----0---2---4---->
>>>
>>> When you are about to merge 2 and 3 to create 4, what would a
>>> three-way merge see?
>>>
>>>     0 had content A at path P
>>>     2 said "no we do not want content A at path P"
>>>     3 said "we are happy with content A at path P"
>>>
>>> So the net result is that 0-->3 "one side did not touch A at P" and
>>> 0-->2 "one side removed A at P".
>>>
>>> Three-way merge between X and Y is all about taking what X did if Y
>>> didn't have any opinion on what X touched.  This is exactly that
>>> case.  The history 0--->3 didn't have any opinion on what should be
>>> in P or whether P should exist, and that is why there is no change
>>> between these two endpoints.
>>
>> The last sentence is not necessarily true.  You could also say that
>> 0--->3 cared so much about path P having content A that it brought it
>> back from the void.  Determining whether a de-facto revert
>> - intended to return to an uncaring state of "take whatever main has" or
>> - meant to choose *that* specific content which incidentally is on main
>> is not possible from the snapshots at the merge point alone, I think.
>>
>> Checking if 0...3 touched P and leaving that path unmerged out of
>> caution shouldn't be terribly expensive.
>
> I think it might be terribly expensive.
>
> Walking history can easily be the slow part of such an operation, e.g.
> can_fast_forward() taking roughly 100 times as long as doing the
> merge_incore_recursive() portion that creates the new merged toplevel
> tree[1].  (And can_fast_forward() is a form of history walk that
> doesn't involve traversing into any trees, so I suspect it's a cheaper
> history traversal than what is being suggested).
>
> Focusing on the tree traversal side, this suggested change would
> essentially disable the trivial directory resolution optimizations in
> merge-ort[2].  (Note that the trivial directory resolution sped up a
> rebase that didn't involve very many renames by a factor of 25).  The
> whole point of that optimization was to avoid walking into trees that
> were only changed on one side, where possible.  Your proposed change
> would be saying we always have to walk into trees that either side
> modified...and do so for every intermediate commit as well so that we
> can fully enumerate all (temporarily) changed files.

True: Compared to just checking if a path was touched by 3, a history
traversal can take arbitrarily long.  At least it's bounded by the merge
base and a specific path.  And renames complicate the picture, but only
full renames (same blob or tree ID) need to be considered.  That feels
doable in a reasonable amount of time, but it's not as cheap as ignoring
the history.

Assuming that one side doesn't care about a path because it has the same
content as the merge base is tempting.  And reverts that break this
assumption are probably quite rare.  Still it led to an unintended
outcome here.  Reminds me of a recent chess robot incident [3].  Speed
is nice and safety has a cost, but do we already make the best possible
tradeoff here?

> [1] https://lore.kernel.org/git/CABPp-BE48=97k_3tnNqXPjSEfA163F8hoE+HY0Zvz1SWB2B8EA@mail.gmail.com/
> [2] https://lore.kernel.org/git/pull.988.v4.git.1626841444.gitgitgadget@gmail.com/

[3] https://www.theguardian.com/sport/2022/jul/24/chess-robot-grabs-and-breaks-finger-of-seven-year-old-opponent-moscow

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Lost file after git merge
  2022-07-30 14:44         ` René Scharfe
@ 2022-07-31  1:45           ` Elijah Newren
  0 siblings, 0 replies; 9+ messages in thread
From: Elijah Newren @ 2022-07-31  1:45 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Laďa Tesařík, git

Okay, you nerd-sniped me into responding why I'm super skeptical of this path...

On Sat, Jul 30, 2022 at 7:44 AM René Scharfe <l.s.r@web.de> wrote:
>
> Am 30.07.22 um 04:16 schrieb Elijah Newren:
> > On Fri, Jul 29, 2022 at 1:34 PM René Scharfe <l.s.r@web.de> wrote:
> >>
> >> Am 28.07.22 um 19:11 schrieb Junio C Hamano:
> >>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> >>>
> >>>> On Thu, Jul 28 2022, Laďa Tesařík wrote:
> >>>>
> >>>>> 1. I added a file called 'new_file' to a master branch.
> >>>>> 2. Then I created branch feature/2 and deleted the file in master
> >>>>> 3. Then I deleted the file in branch feature/2 as well.
> >>>>> 4. I created 'new_file' on branch feature/2 again.
> >>>
> >>> It heavily depends on how this creation is done, i.e. what went into
> >>> the created file.  Imagine that a file existed with content A at
> >>> commit 0, both commits 1 and 2 removed it on their forked history,
> >>> and then commit 3 added exactly the same content A to the same path:
> >>>
> >>>           1---3
> >>>          /     \
> >>>     ----0---2---4---->
> >>>
> >>> When you are about to merge 2 and 3 to create 4, what would a
> >>> three-way merge see?
> >>>
> >>>     0 had content A at path P
> >>>     2 said "no we do not want content A at path P"
> >>>     3 said "we are happy with content A at path P"
> >>>
> >>> So the net result is that 0-->3 "one side did not touch A at P" and
> >>> 0-->2 "one side removed A at P".
> >>>
> >>> Three-way merge between X and Y is all about taking what X did if Y
> >>> didn't have any opinion on what X touched.  This is exactly that
> >>> case.  The history 0--->3 didn't have any opinion on what should be
> >>> in P or whether P should exist, and that is why there is no change
> >>> between these two endpoints.
> >>
> >> The last sentence is not necessarily true.  You could also say that
> >> 0--->3 cared so much about path P having content A that it brought it
> >> back from the void.  Determining whether a de-facto revert
> >> - intended to return to an uncaring state of "take whatever main has" or
> >> - meant to choose *that* specific content which incidentally is on main
> >> is not possible from the snapshots at the merge point alone, I think.
> >>
> >> Checking if 0...3 touched P and leaving that path unmerged out of
> >> caution shouldn't be terribly expensive.
> >
> > I think it might be terribly expensive.
> >
> > Walking history can easily be the slow part of such an operation, e.g.
> > can_fast_forward() taking roughly 100 times as long as doing the
> > merge_incore_recursive() portion that creates the new merged toplevel
> > tree[1].  (And can_fast_forward() is a form of history walk that
> > doesn't involve traversing into any trees, so I suspect it's a cheaper
> > history traversal than what is being suggested).
> >
> > Focusing on the tree traversal side, this suggested change would
> > essentially disable the trivial directory resolution optimizations in
> > merge-ort[2].  (Note that the trivial directory resolution sped up a
> > rebase that didn't involve very many renames by a factor of 25).  The
> > whole point of that optimization was to avoid walking into trees that
> > were only changed on one side, where possible.  Your proposed change
> > would be saying we always have to walk into trees that either side
> > modified...and do so for every intermediate commit as well so that we
> > can fully enumerate all (temporarily) changed files.
>
> True: Compared to just checking if a path was touched by 3, a history
> traversal can take arbitrarily long.  At least it's bounded by the merge
> base and a specific path.

How is it bounded by a specific path?  How do you know which path to
check?  Can't this affect many paths?  I think to fix this issue, you
either have to walk all paths for all intermediate commits, OR:
  * get the list of paths modified between the merge base and one side
of history (just an endpoint diff, not a view of history along the
way)
  * get the list of paths modified between the merge base and the
other side of history (again, just an endpoint diff)
  * take the symmetric difference of those two lists above (let's name
the resulting set SD so we can refer to it later).
  * Walk all intermediate commits between the merge base and the two
endpoints being merged, and for each one investigate every path in it
corresponding to one of the paths in SD.
  * As we do the above walk into trees of all intermediate commits,
look for a sequence of commits that does some change and then later
reverts it

Also, I think using the phrase "the merge base" points out some
ill-posedness in this problem.  For actual merges, I think it's well
posed.  But...

What about cherry-picks, rebases, and reverts?  Those pass a funny
merge-base to the merge machinery.  Using that merge-base does allow
us to construct a valid commit range, but in that context, is that
range still correct to use with your rule?  Perhaps it does make sense
to use that range and apply your same rule, but...

What if we instead cherry-pick a different way -- using format-patch
and `git am -3`?  At the time git-am runs, it only has one commit
available to it.  It cannot determine any merge-base.  So what range
of commits should it walk to see if the upstream side has both made a
change to some file and then undone that change?  Do we walk all of
history?  Do we attempt to find some point in history that appears to
have the same versions of the files in the "from" versions of the
patch being applied?  What if such a commit does not exist?  What if
many such commits exist?  If many such commits exist, what if using
one of those commits would give us a different answer than using
another commit (meaning, what if we do see a change-and-then-unchange
if one of them is used as the merge-base, but do not see such a
sequence if the other one is)?  I'm certain I could construct examples
of each of those "What if" scenarios for git-am.

>  And renames complicate the picture, but only
> full renames (same blob or tree ID) need to be considered.  That feels
> doable in a reasonable amount of time, but it's not as cheap as ignoring
> the history.

Making things 25x slower and calling it "not as cheap"?  That's some
literary jiu-jitsu there.

Also, I think the 25x figure may well undercount the expense, perhaps
even dramatically, since I'm not sure things can be limited in the
form you envision...

> Assuming that one side doesn't care about a path because it has the same
> content as the merge base is tempting.  And reverts that break this
> assumption are probably quite rare.  Still it led to an unintended
> outcome here.  Reminds me of a recent chess robot incident [3].  Speed
> is nice and safety has a cost, but do we already make the best possible
> tradeoff here?

I mean, if you want to make an alternative merge strategy that isn't
based on a three-way merge, by all means go ahead.  We have pluggable
backends, after all.

It's not clear to me that your suggestion actually leads to a
situation with fewer unintended outcomes, though it is clear that
performance could be dramatically worse.  Let's explore this for a bit
(more than I already have above)...

Let's say you aim to fix this "unintended outcome" via paying
attention to the changes from individual commits along the way.  This
isn't the only such issue users report that is intrinsic to how
three-way content merges work.  Do you decide to also address the
others?  From the top of my head, others include:

A) generalize your logic here -- apply it to individual hunks rather
than only to whole files.  Using your exact same rationale would
suggest that if someone modifies a hunk on one side of history and
undid it later, then that region of code should not cleanly resolve to
what the other side of history has but should instead be marked as
conflicted.
B) handle commits cherry-picked to both sides "better".  If someone
applies the same cherry-pick to both sides, and they then further
modify the same lines on one side, there should be no "useless"
conflict, because both sides "built on top of" the same cherry-picked
state.
C) make rename detection match files based on changes in individual
commits rather than on the overall diff since the merge base

It turns out that others have explored part of this space before, e.g.
Bram Cohen's "Precise Codeville Merge"[4], which had names for
behaviors (A) and (B) (see [5] and [6], respectively; also note that
it points out that requirements (A) and (B) can come into conflict).
Perhaps if you are interested, you can port his code over and use his
algorithm?  However, it might be worth noting that he later gave up on
it and recommended that people use three-way merges instead[7].

Since Git seems to be the only system that does rename detection, (C)
is new territory.  But it's also relevant in that a user requested
that the merge algorithm incorporate information from changes in
individual commits along the history in order to solve a "bug" they
ran into[8].

Also, besides implementing Bram's "Precise Codeville Merge" in Git,
there may be another pre-existing solution available: using Michael
Haggerty's git-imerge[9] and making sure it fully fills out its matrix
of merges instead of opportunistically attempting to skip some.  That
would certainly make sure that changes from each individual commit was
considered.

But let's say you don't want to use either Precise Codeville Merge or
git-imerge, and want to make a copy of our default strategy
(merge-ort) and modify it.  Once you implement your new strategy, I'm
sure the other cases in (A), (B), and (C) will be brought up by people
and it's hard to see based on your rationale so far why your new
algorithm shouldn't attempt to solve those issues as well.  If you do,
your performance will really tank.  Not only did you kill the trivial
directory resolution optimization, you also kill the irrelevant
renames optimization (which, IMO, was the crowning jewel of the ort
optimizations).  But it's also worse than that because (A) and likely
(B) force you to compute pairwise diffs of every sequence of commits
since the merge base, and (C) forces those all to include rename
detection (both for full and partial renames).  But it gets worse.
Some of those commits in the history back to the merge base will be
merge commits, for which we have to be careful with the diff
computation.  We can't just ignore merge commits, because
change-and-then-unchanged code might have been done in merge commits
as part of an attempt to fix semantic conflicts.  So, to address merge
commits, we have to see the changes made by users in those merges
relative to the auto-merge state.  If we do that, the upshot is that
handling (A)-(C) implies that to merge two branches, we have to look
at all merge commits in the history of the two branches back to the
merge base, and remerge every one of them.

Maybe you're okay with all of the above, even with others giving up on
merge strategies other than three-way merge.  Maybe individually
merging all commits since the merge base doesn't phase you.  Maybe you
can make some kind of distinction where some of (A)-(C) don't matter.
Maybe you just capitulate and say you only care about the problem that
started this thread and ignore all three of those requests.  Maybe you
don't care about git-am and exclude it, overlooking the extra
difference we've added between it and the other commands (a bit sad
since merge-ort had removed one of those differences and I like
commands being more consistent).  Maybe the resulting "mere" 25x
slowdown is good enough for you for your repository sizes.  Or maybe
you'd be hiding this behind some non-default option that people could
select, allowing them to make the tradeoff.  Whether any or all of
that is true, I just don't see how to maintain it with the rationale
and description given, so I'm not comfortable with it affecting the
code I have to maintain.

So, I'm skeptical.  But feel free to create a new merge strategy and
prove me wrong.


[4] https://tonyg.github.io/revctrl.org/PreciseCodevilleMerge.html
[5] https://tonyg.github.io/revctrl.org/ImplicitUndo.html
[6] https://tonyg.github.io/revctrl.org/Convergence.html
[7] https://bramcohen.livejournal.com/52148.html
[8] https://lore.kernel.org/git/CABPp-BHF5j9XTrJOreHMaQ+TrZB1VgZvs-Bq2vFD1yhLBdkV3A@mail.gmail.com/
[9] https://github.com/mhagger/git-imerge

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-07-31  1:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-28  8:23 Lost file after git merge Laďa Tesařík
2022-07-28 12:17 ` Ævar Arnfjörð Bjarmason
2022-07-28 17:11   ` Junio C Hamano
2022-07-29 20:23     ` René Scharfe
2022-07-29 22:04       ` Junio C Hamano
2022-07-30  2:16       ` Elijah Newren
2022-07-30 14:44         ` René Scharfe
2022-07-31  1:45           ` Elijah Newren
2022-07-28 21:23 ` brian m. carlson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.