git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <dstolee@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Taylor Blau <me@ttaylorr.com>, Jeff King <peff@peff.net>,
	Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH v3 3/5] diffcore-rename: complete find_basename_matches()
Date: Sat, 13 Feb 2021 10:34:50 -0800	[thread overview]
Message-ID: <CABPp-BF_cj1GEYT75aj9funUk1mYvtM2OvKiNb1JAZzA6E5mgQ@mail.gmail.com> (raw)
In-Reply-To: <xmqqsg60u49i.fsf@gitster.c.googlers.com>

On Fri, Feb 12, 2021 at 5:48 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +     /* Now look for basename matchups and do similarity estimation */
> > +     for (i = 0; i < num_src; ++i) {
> > +             char *filename = rename_src[i].p->one->path;
> > +             const char *base = NULL;
> > +             intptr_t src_index;
> > +             intptr_t dst_index;
> > +
> > +             /* Find out if this basename is unique among sources */
> > +             base = get_basename(filename);
> > +             src_index = strintmap_get(&sources, base);
> > +             if (src_index == -1)
> > +                     continue; /* not a unique basename; skip it */
> > +             assert(src_index == i);
> > +
> > +             if (strintmap_contains(&dests, base)) {
> > +                     struct diff_filespec *one, *two;
> > +                     int score;
> > +
> > +                     /* Find out if this basename is unique among dests */
> > +                     dst_index = strintmap_get(&dests, base);
> > +                     if (dst_index == -1)
> > +                             continue; /* not a unique basename; skip it */
>
> It would be a lot easier to read if "we must have the same singleton
> in dests" in a single if condition, I suspect.  I.e.
>
>                 if (strintmap_contains(&dests, base) &&
>                     0 <= (dst_index = (strintmap_get(&dests, base)))) {

I can change that.  I can also simplify it further to

        if (0 <= (dst_index = (strintmap_get(&dests, base)))) {

since dests uses a default value of -1.  That will decrease the number
of strmap lookups here from 2 to 1.

> It is a bit sad that we iterate over rename_src[] array, even though
> we now have a map that presumably have fewer number of entries than
> the original array, though.

Oh, interesting; I forgot all about that.  I just looked up my
original implementation from February of last year and indeed I had
done exactly that
(https://github.com/newren/git/commit/43eaec6007c92b6af05e0ef0fcc047c1d1ba1de8).
However, when I added a later optimization that pairs up non-unique
basenames, I had to switch to looping over rename_src.

For various reasons (mostly starting with the fact that I had lots of
experimental ideas that were tried and thrown out but with pieces kept
around for ideas), I wasn't even close to having a clean history in my
original implementation of merge-ort and the diffcore-rename
optimizations.  And it was far, far easier to achieve the goal of a
clean history by picking out chunks of code from the end-state and
creating entirely new commits than attempting to use my existing
history.  But, of course, that method made me lose this intermediate
state.

>
> > +                     /* Ignore this dest if already used in a rename */
> > +                     if (rename_dst[dst_index].is_rename)
> > +                             continue; /* already used previously */
>
> Since we will only be matching between unique entries in src and
> dst, this "this has been used, so we cannot use it" will not change
> during this loop.  I wonder if the preparation done in the previous
> step, i.e. [PATCH v3 2/5], can take advantage of this fact, i.e.  a
> dst that has already been used (in the previous "exact" step) would
> not even have to be in &dests map, so that the strintmap_contains()
> check can reject it much earlier.

Good, catch again.  The previous step (v4 2/5) actually did already
check this, so this if-condition will always be false at this point.
Looking at the link above, this if-condition check wasn't there in the
original, but again was added due to altered state introduced by a
later optimization.  So, I should pull this check out of this patch
and add it back in to the later patch.

> Stepping back a bit, it appears to me that [2/5] and [3/5] considers
> a source file having unique basename among the sources even if there
> are many such files with the same basename, as long as all the other
> files with the same basename have been matched in the previous
> "exact" phase.  It probably does the same thing for destination
> side.
>
> Intended?
>
> It feels incompatible with the spirit of these two steps aim for
> (i.e. only use this optimization on a pair of src/dst with UNIQUE
> basenames).  For the purpose of "we only handle unique ones", the
> paths that already have matched should participate in deciding if
> the files that survived "exact" phase have unique basename among
> the original inpu?

Yeah, I should have been more careful with my wording.  Stated a
different way, what confidence can we associate with an exact rename?
Obviously, the confidence is high as we mark them as renames.  But if
the confidence is less than 100%, and enough less than 100% that it
casts a doubt on "related" inexact renames, then yes the basenames of
the exact renames should also be computed so that we can determine
what basenames are truly unique.  By the exact same argument, you
could take this a step further and say that we should calculate the
basenames of *all* files in the tree, not just add/delete pairs, and
only match up the ones via basename that are *truly* unique.  After
all, break detection exists, so perhaps we don't have full confidence
that files with an unchanged fullname are actually related.

From my view, though, both are too cautious and throwing out valuable
heuristics for common cases.  Starting with break detection, it is off
for a reason: we think unchanged filename is a strong enough heuristic
to just match up those files and consider the confidence of the match
in effect 100%.  Similarly, we put a lot of confidence in exact rename
detection.  If there are multiple adds/deletes with the same basename,
and all but one on each side are paired up by exact rename detection,
aren't the remaining two files a (very) likely rename pair?  I think
so, and believe they're worth including in the basename-based rename
detection step.  We do require basename-based matches to meet a much
higher similarity scoring threshold now, which I feel already
adequately adjusts for not doing full content similarity against all
other files.

Also, in the next series, I find an additional way to match up files
by basename when basenames are not unique, and which doesn't involve
pairwise comparing all the files with the same basename.  I only pick
at most one other file to compare to (and the selection is not
random).  So, my overall strategy for these two series is "find which
basenames are likely matches" even if I didn't word it very well.

I do agree, though, that I should add some more careful wording about
this in the series.  I'll include it in a re-roll.

  reply	other threads:[~2021-02-13 18:36 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-07 14:38   ` Derrick Stolee
2021-02-07 19:51     ` Junio C Hamano
2021-02-08  8:38       ` Elijah Newren
2021-02-08 11:43         ` Derrick Stolee
2021-02-08 16:25           ` Elijah Newren
2021-02-08 17:37         ` Junio C Hamano
2021-02-08 22:00           ` Elijah Newren
2021-02-08 23:43             ` Junio C Hamano
2021-02-08 23:52               ` Elijah Newren
2021-02-08  8:27     ` Elijah Newren
2021-02-08 11:31       ` Derrick Stolee
2021-02-08 16:09         ` Elijah Newren
2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-07  6:05   ` Elijah Newren
2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-09 13:17     ` Derrick Stolee
2021-02-09 16:56       ` Elijah Newren
2021-02-09 17:02         ` Derrick Stolee
2021-02-09 17:42           ` Elijah Newren
2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-09 13:25     ` Derrick Stolee
2021-02-09 17:17       ` Elijah Newren
2021-02-09 17:34         ` Derrick Stolee
2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-09 13:33     ` Derrick Stolee
2021-02-09 17:41       ` Elijah Newren
2021-02-09 18:59         ` Junio C Hamano
2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-09 12:59     ` Derrick Stolee
2021-02-09 17:03       ` Junio C Hamano
2021-02-09 17:44         ` Elijah Newren
2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-13  1:15       ` Junio C Hamano
2021-02-13  4:50         ` Elijah Newren
2021-02-13 23:56           ` Junio C Hamano
2021-02-14  1:24             ` Elijah Newren
2021-02-14  1:32               ` Junio C Hamano
2021-02-14  3:14                 ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-13  1:32       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-13  1:48       ` Junio C Hamano
2021-02-13 18:34         ` Elijah Newren [this message]
2021-02-13 23:55           ` Junio C Hamano
2021-02-14  3:08             ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-13  1:49       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-10 16:41       ` Junio C Hamano
2021-02-10 17:20         ` Elijah Newren
2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
2021-02-13  1:53       ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BF_cj1GEYT75aj9funUk1mYvtM2OvKiNb1JAZzA6E5mgQ@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).