From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Derrick Stolee <dstolee@microsoft.com>,
Jonathan Tan <jonathantanmy@google.com>,
Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>,
Jeff King <peff@peff.net>, Elijah Newren <newren@gmail.com>,
Derrick Stolee <stolee@gmail.com>,
Elijah Newren <newren@gmail.com>
Subject: [PATCH v2 0/4] Optimization batch 7: use file basenames to guide rename detection
Date: Tue, 09 Feb 2021 11:32:02 +0000 [thread overview]
Message-ID: <pull.843.v2.git.1612870326.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.843.git.1612651937.gitgitgadget@gmail.com>
This series depends on ort-perf-batch-6[1].
This series uses file basenames in a basic fashion to guide rename
detection.
Changes since v1:
* Significant edits to the commit messages to make them explain the basic
idea better and (hopefully) prevent misunderstandings.
* Add a commit at the end that updates the documentation and includes a new
testcase
* Modify the code to make it clearer that it can already handle using a
different score for basename comparison similarity, even if we don't use
that ability yet (see below)
Changes not yet included -- I need input about what is wanted:
* Stolee suggested not creating a separate score for basename comparisons,
at least not yet. Junio suggested it may be prudent to use a higher score
for that than whatever -M option the user provided for normal
comparisons...but didn't suggest whether it should be a separate
user-specified option, or some kind of weighted average of the -M option
and MAX_SCORE (e.g. use 60% if -M is 50%, or use 80% if -M is 75%). I
tweaked the code to make it clearer that it already is able to handle
such a score difference, but I'm not sure what whether we want an
automatically computed higher value or a user-controlled possibly higher
value.
[1] https://lore.kernel.org/git/xmqqlfc4byt6.fsf@gitster.c.googlers.com/ [2]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf
Elijah Newren (4):
diffcore-rename: compute basenames of all source and dest candidates
diffcore-rename: complete find_basename_matches()
diffcore-rename: guide inexact rename detection based on basenames
gitdiffcore doc: mention new preliminary step for rename detection
Documentation/gitdiffcore.txt | 15 +++
diffcore-rename.c | 185 +++++++++++++++++++++++++++++++++-
t/t4001-diff-rename.sh | 24 +++++
3 files changed, 220 insertions(+), 4 deletions(-)
base-commit: 7ae9460d3dba84122c2674b46e4339b9d42bdedd
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/843
Range-diff vs v1:
1: 377a4a39fa86 ! 1: 381a45d239bb diffcore-rename: compute basenames of all source and dest candidates
@@ Commit message
diffcore-rename: compute basenames of all source and dest candidates
We want to make use of unique basenames to help inform rename detection,
- so that more likely pairings can be checked first. Add a new function,
+ so that more likely pairings can be checked first. (src/moduleA/foo.txt
+ and source/module/A/foo.txt are likely related if there are no other
+ 'foo.txt' files among the deleted and added files.) Add a new function,
not yet used, which creates a map of the unique basenames within
rename_src and another within rename_dst, together with the indices
within rename_src/rename_dst where those basenames show up. Non-unique
basenames still show up in the map, but have an invalid index (-1).
+ This function was inspired by the fact that in real world repositories,
+ most renames often do not involve a basename change. Here are some
+ sample repositories and the percentage of their historical renames (as of
+ early 2020) that did not involve a basename change:
+ * linux: 76%
+ * gcc: 64%
+ * gecko: 79%
+ * webkit: 89%
+
Signed-off-by: Elijah Newren <newren@gmail.com>
## diffcore-rename.c ##
2: 5fb4493247ff ! 2: dcd0175229aa diffcore-rename: complete find_basename_matches()
@@ Commit message
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
+ This means we are adding a set of preliminary additional comparisons,
+ but for each file we only compare it with at most one other file. For
+ example, if there was a include/media/device.h that was deleted and a
+ src/module/media/device.h that was added, and there were no other
+ device.h files added or deleted between the commits being compared,
+ then these two files would be compared in the preliminary step.
+
+ This commit does not yet actually employ this new optimization, it
+ merely adds a function which can be used for this purpose. The next
+ commit will do the necessary plumbing to make use of it.
+
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
- that is okay for four reasons: (1) That seems somewhat unlikely in
- practice, (2) it's easy to explain to the users what happened if it does
- ever occur (or even for them to intuitively figure out), and (3) as the
- next patch will show it provides such a large performance boost that
- it's worth the tradeoff. Reason (4) takes a full paragraph to
+ that is okay for four reasons: (1) it's easy to explain to the users
+ what happened if it does ever occur (or even for them to intuitively
+ figure out), (2) as the next patch will show it provides such a large
+ performance boost that it's worth the tradeoff, and (3) it's somewhat
+ unlikely that despite having unique matching basenames that other files
+ serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
@@ Commit message
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
- We do not use this heuristic together with either break or copy
- detection. The point of break detection is to say that filename
- similarity does not imply file content similarity, and we only want to
- know about file content similarity. The point of copy detection is to
- use more resources to check for additional similarities, while this is
- an optimization that uses far less resources but which might also result
- in finding slightly fewer similarities. So the idea behind this
- optimization goes against both of those features, and will be turned off
- for both.
-
- Note that this optimization is not yet effective in any situation, as
- the function is still unused. The next commit will hook it into the
- code so that it is used when rename detection is wanted, but neither
- copy nor break detection are.
-
Signed-off-by: Elijah Newren <newren@gmail.com>
## diffcore-rename.c ##
3: 1d941c35076e ! 3: ce2173aa1fb7 diffcore-rename: guide inexact rename detection based on basenames
@@ Commit message
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
- files based on basenames.
+ files based on basenames. As a quick reminder (see the last two commit
+ messages for more details), this means for example that
+ docs/extensions.txt and docs/config/extensions.txt are considered likely
+ renames if there are no 'extensions.txt' files elsewhere among the added
+ and deleted files, and if a similarity check confirms they are similar,
+ then they are marked as a rename without looking for a better similarity
+ match among other files. This is a behavioral change, as covered in
+ more detail in the previous commit message.
+
+ We do not use this heuristic together with either break or copy
+ detection. The point of break detection is to say that filename
+ similarity does not imply file content similarity, and we only want to
+ know about file content similarity. The point of copy detection is to
+ use more resources to check for additional similarities, while this is
+ an optimization that uses far less resources but which might also result
+ in finding slightly fewer similarities. So the idea behind this
+ optimization goes against both of those features, and will be turned off
+ for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
+ remove_unneeded_paths_from_src(want_copies);
+ trace2_region_leave("diff", "cull after exact", options->repo);
+ } else {
++ /* Determine minimum score to match basenames */
++ int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5;
++
+ /*
+ * Cull sources:
+ * - remove ones involved in renames (found via exact match)
@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
+
+ /* Utilize file basenames to quickly find renames. */
+ trace2_region_enter("diff", "basename matches", options->repo);
-+ rename_count += find_basename_matches(options, minimum_score,
++ rename_count += find_basename_matches(options,
++ min_basename_score,
+ rename_src_nr);
+ trace2_region_leave("diff", "basename matches", options->repo);
+
-: ------------ > 4: a0e75d8cd6bd gitdiffcore doc: mention new preliminary step for rename detection
--
gitgitgadget
next prev parent reply other threads:[~2021-02-09 11:46 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-07 14:38 ` Derrick Stolee
2021-02-07 19:51 ` Junio C Hamano
2021-02-08 8:38 ` Elijah Newren
2021-02-08 11:43 ` Derrick Stolee
2021-02-08 16:25 ` Elijah Newren
2021-02-08 17:37 ` Junio C Hamano
2021-02-08 22:00 ` Elijah Newren
2021-02-08 23:43 ` Junio C Hamano
2021-02-08 23:52 ` Elijah Newren
2021-02-08 8:27 ` Elijah Newren
2021-02-08 11:31 ` Derrick Stolee
2021-02-08 16:09 ` Elijah Newren
2021-02-07 5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-07 6:05 ` Elijah Newren
2021-02-09 11:32 ` Elijah Newren via GitGitGadget [this message]
2021-02-09 11:32 ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-09 13:17 ` Derrick Stolee
2021-02-09 16:56 ` Elijah Newren
2021-02-09 17:02 ` Derrick Stolee
2021-02-09 17:42 ` Elijah Newren
2021-02-09 11:32 ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-09 13:25 ` Derrick Stolee
2021-02-09 17:17 ` Elijah Newren
2021-02-09 17:34 ` Derrick Stolee
2021-02-09 11:32 ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-09 13:33 ` Derrick Stolee
2021-02-09 17:41 ` Elijah Newren
2021-02-09 18:59 ` Junio C Hamano
2021-02-09 11:32 ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-09 12:59 ` Derrick Stolee
2021-02-09 17:03 ` Junio C Hamano
2021-02-09 17:44 ` Elijah Newren
2021-02-10 15:15 ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-10 15:15 ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-13 1:15 ` Junio C Hamano
2021-02-13 4:50 ` Elijah Newren
2021-02-13 23:56 ` Junio C Hamano
2021-02-14 1:24 ` Elijah Newren
2021-02-14 1:32 ` Junio C Hamano
2021-02-14 3:14 ` Elijah Newren
2021-02-10 15:15 ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-13 1:32 ` Junio C Hamano
2021-02-10 15:15 ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-13 1:48 ` Junio C Hamano
2021-02-13 18:34 ` Elijah Newren
2021-02-13 23:55 ` Junio C Hamano
2021-02-14 3:08 ` Elijah Newren
2021-02-10 15:15 ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-13 1:49 ` Junio C Hamano
2021-02-10 15:15 ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-10 16:41 ` Junio C Hamano
2021-02-10 17:20 ` Elijah Newren
2021-02-11 8:15 ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-11 8:15 ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-11 8:15 ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-11 8:15 ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-11 8:15 ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-11 8:15 ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-11 8:15 ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
2021-02-13 1:53 ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-14 7:51 ` [PATCH v5 " Elijah Newren via GitGitGadget
2021-02-14 7:51 ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-14 7:51 ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
2021-02-14 7:51 ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-14 7:51 ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-14 7:51 ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-14 7:51 ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=pull.843.v2.git.1612870326.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jonathantanmy@google.com \
--cc=me@ttaylorr.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).