git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <dstolee@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>
Subject: Re: [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates
Date: Tue, 9 Feb 2021 08:56:42 -0800	[thread overview]
Message-ID: <CABPp-BEsuOiUyvbkwPC384eho8pgSWuRdcvw9t5gkXhf+_j-3A@mail.gmail.com> (raw)
In-Reply-To: <dfbffe97-51de-9e8b-37a4-417909358323@gmail.com>

Hi,

On Tue, Feb 9, 2021 at 5:17 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > We want to make use of unique basenames to help inform rename detection,
> > so that more likely pairings can be checked first.  (src/moduleA/foo.txt
> > and source/module/A/foo.txt are likely related if there are no other
> > 'foo.txt' files among the deleted and added files.)  Add a new function,
> > not yet used, which creates a map of the unique basenames within
> > rename_src and another within rename_dst, together with the indices
> > within rename_src/rename_dst where those basenames show up.  Non-unique
> > basenames still show up in the map, but have an invalid index (-1).
> >
> > This function was inspired by the fact that in real world repositories,
> > most renames often do not involve a basename change.  Here are some
> > sample repositories and the percentage of their historical renames (as of
> > early 2020) that did not involve a basename change:
>
> I found this difficult to parse. Perhaps instead
>
>   "the percentage of their renames that preserved basenames".

Ooh, I like it; happy to make that change.

> We might also need something stronger, though: which percentage of renames
> preserved the basename but also had no other copy of that basename in the
> scope of all add/deletes?

I don't think it's useful to try to prove that this idea can save time
or how much time we can save before we try it; I think the only
purpose of these numbers should be to motivate the idea behind why it
was worth trying.  If we attempt to prove how much we'll save apriori,
then what you have is also too weak.  We would need "percentage of
total adds/deletes that are renames that preserved the basename but
also had no other copy of that basename in the scope of all
add/deletes".  But that is also wrong, actually; we need "for any
given two commits that we are likely to diff, what is the average
percentage of total adds/deletes between them that are renames that
preserved the basename but also had no other copy of that basename in
the scope of all add/deletes".  In particular, my script did not look
at the "any two given likely-to-be-diffed commits" viewpoint, I simply
added the number of renames within individual commits that preserved
renames, and divided by the total number of renames in individual
commits.  But even if we could calculate the "any two given
likely-to-be-diffed commits" viewpoint in some sane manner, it'd still
be misleading.  The next series is going to change the "no other copy
of that basename in the scope of all adds/deletes" caveat, by adding a
way to match up _some_ of those files (when it can find a way to
compare any given file to exactly one of the other files with the same
basename).  And even if you consider all the above and calculated it
in order to try to show how much could be saved, you might need to
start worrying about details like the fact that the first comparison
between files in diffcore-rename.c is _much_ more expensive than
subsequent comparisons (due to the fact that the spanhash is cached).

Trying to account for all these details and describe them fully is
completely beside the point, though; I didn't bother to check any of
this before implementing the algorithm -- I just looked up these very
rough numbers and felt they provided sufficient motivation that there
was an optimization worth trying.

> Is this reproducible from a shell command that could be documented here?

No, trying to parse log output with full handling of proper quoting in
the case of filenames with funny characters is too complex to attempt
in shell.  I was surprised by how long it turned out to be in python.
(And I dread attempting to calculate "something stronger" in any
accurate way given how involved just this rough calculation was.  That
idea seems harder to me than actually implementing this series.)

If you're curious, though, and don't care about
quickly-hacked-together-script-not-designed-for-reuse:
https://github.com/newren/git/blob/ort/rebase-testcase/count-renames.py

> > +MAYBE_UNUSED
> > +static int find_basename_matches(struct diff_options *options,
> > +                              int minimum_score,
> > +                              int num_src)
> > +{
> > +     int i;
> > +     struct strintmap sources;
> > +     struct strintmap dests;
> > +
> > +     /* Create maps of basename -> fullname(s) for sources and dests */
> > +     strintmap_init_with_options(&sources, -1, NULL, 0);
> > +     strintmap_init_with_options(&dests, -1, NULL, 0);
>
> Initially, I was wondering why we need the map for each side, but we will need
> to enforce uniqueness in each set, so OK.
>
>> > +     for (i = 0; i < num_src; ++i) {
> > +             char *filename = rename_src[i].p->one->path;
> > +             char *base;
> > +
> > +             /* exact renames removed in remove_unneeded_paths_from_src() */
> > +             assert(!rename_src[i].p->one->rename_used);
> > +
> > +             base = strrchr(filename, '/');
> > +             base = (base ? base+1 : filename);
>
> nit: "base + 1"

Will fix.

> Also, this is used here and below. Perhaps it's worth pulling out as a
> helper? I see similar code being duplicated in these existing spots:
>
> * diff-no-index.c:append_basename()
> * help.c:append_similar_ref()
> * packfile.c:pack_basename()
> * replace-object.c:register_replace_ref()
> * setup.c:read_gitfile_gently()
> * builtin/rebase.c:cmd_rebase()
> * builtin/stash.c:do_create_stash()
> * builtin/worktree.c:add_worktree()
> * contrib/credential/gnome-keyring/git-credential-gnome-keyring.c:usage()
> * contrib/credential/libsecret/git-credential-libsecret.c:usage()
> * trace2/tr2_dst.c:tr2_dst_try_auto_path()

Honestly asking: would anyone ever search for such a two-line helper
function?  I wouldn't have even thought to look, since it seems so
simple.

However, my real concern here is that this type of change would risk
introducing conflicts with unrelated series.  This series is the
second in what will be a 9-series deep dependency chain of
optimizations[1], and the later series are going to be longer than
these first two were (the latter ones are 6-11 patches each).  We've
already discussed previously whether we possibly want to hold the
first couple optimization series out of the upcoming git-2.31 release
in order to keep the optimizations all together, but that might
increase the risk of conflicts with unrelated patches if we try a
bigger tree refactor like this.  (Junio never commented on that,
though.)  It might be better to keep the series touching only
merge-ort.c & diffcore-rename.c, and then do cleanups like the one you
suggest here after the whole series.

That said, it's not a difficult initial change, so I'm mostly
expressing this concern out of making things harder for Junio.  It'd
be best to get his opinion -- Junio, your thoughts?

[1] https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Anewren+Optimization+batch

> There are other places that use strchr(_, '/') but they seem to be related
> to peeling basenames off of paths and using the leading portion of the path.
>
> > +             /* Record index within rename_src (i) if basename is unique */
> > +             if (strintmap_contains(&sources, base))
> > +                     strintmap_set(&sources, base, -1);
> > +             else
> > +                     strintmap_set(&sources, base, i);
> > +     }
> > +     for (i = 0; i < rename_dst_nr; ++i) {
> > +             char *filename = rename_dst[i].p->two->path;
> > +             char *base;
> > +
> > +             if (rename_dst[i].is_rename)
> > +                     continue; /* involved in exact match already. */
> > +
> > +             base = strrchr(filename, '/');
> > +             base = (base ? base+1 : filename);
> > +
> > +             /* Record index within rename_dst (i) if basename is unique */
> > +             if (strintmap_contains(&dests, base))
> > +                     strintmap_set(&dests, base, -1);
> > +             else
> > +                     strintmap_set(&dests, base, i);
> > +     }
> > +
> > +     /* TODO: Make use of basenames source and destination basenames */
> > +
> > +     strintmap_clear(&sources);
> > +     strintmap_clear(&dests);
> > +
> > +     return 0;
> > +}
>
> Thanks,
> -Stolee

Thanks for the review!

  reply	other threads:[~2021-02-09 16:57 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-07 14:38   ` Derrick Stolee
2021-02-07 19:51     ` Junio C Hamano
2021-02-08  8:38       ` Elijah Newren
2021-02-08 11:43         ` Derrick Stolee
2021-02-08 16:25           ` Elijah Newren
2021-02-08 17:37         ` Junio C Hamano
2021-02-08 22:00           ` Elijah Newren
2021-02-08 23:43             ` Junio C Hamano
2021-02-08 23:52               ` Elijah Newren
2021-02-08  8:27     ` Elijah Newren
2021-02-08 11:31       ` Derrick Stolee
2021-02-08 16:09         ` Elijah Newren
2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-07  6:05   ` Elijah Newren
2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-09 13:17     ` Derrick Stolee
2021-02-09 16:56       ` Elijah Newren [this message]
2021-02-09 17:02         ` Derrick Stolee
2021-02-09 17:42           ` Elijah Newren
2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-09 13:25     ` Derrick Stolee
2021-02-09 17:17       ` Elijah Newren
2021-02-09 17:34         ` Derrick Stolee
2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-09 13:33     ` Derrick Stolee
2021-02-09 17:41       ` Elijah Newren
2021-02-09 18:59         ` Junio C Hamano
2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-09 12:59     ` Derrick Stolee
2021-02-09 17:03       ` Junio C Hamano
2021-02-09 17:44         ` Elijah Newren
2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-13  1:15       ` Junio C Hamano
2021-02-13  4:50         ` Elijah Newren
2021-02-13 23:56           ` Junio C Hamano
2021-02-14  1:24             ` Elijah Newren
2021-02-14  1:32               ` Junio C Hamano
2021-02-14  3:14                 ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-13  1:32       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-13  1:48       ` Junio C Hamano
2021-02-13 18:34         ` Elijah Newren
2021-02-13 23:55           ` Junio C Hamano
2021-02-14  3:08             ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-13  1:49       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-10 16:41       ` Junio C Hamano
2021-02-10 17:20         ` Elijah Newren
2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
2021-02-13  1:53       ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BEsuOiUyvbkwPC384eho8pgSWuRdcvw9t5gkXhf+_j-3A@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).