git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Derrick Stolee <dstolee@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>, Elijah Newren <newren@gmail.com>,
	Derrick Stolee <stolee@gmail.com>,
	Elijah Newren <newren@gmail.com>
Subject: [PATCH v2 0/4] Optimization batch 7: use file basenames to guide rename detection
Date: Tue, 09 Feb 2021 11:32:02 +0000	[thread overview]
Message-ID: <pull.843.v2.git.1612870326.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.843.git.1612651937.gitgitgadget@gmail.com>

This series depends on ort-perf-batch-6[1].

This series uses file basenames in a basic fashion to guide rename
detection.

Changes since v1:

 * Significant edits to the commit messages to make them explain the basic
   idea better and (hopefully) prevent misunderstandings.
 * Add a commit at the end that updates the documentation and includes a new
   testcase
 * Modify the code to make it clearer that it can already handle using a
   different score for basename comparison similarity, even if we don't use
   that ability yet (see below)

Changes not yet included -- I need input about what is wanted:

 * Stolee suggested not creating a separate score for basename comparisons,
   at least not yet. Junio suggested it may be prudent to use a higher score
   for that than whatever -M option the user provided for normal
   comparisons...but didn't suggest whether it should be a separate
   user-specified option, or some kind of weighted average of the -M option
   and MAX_SCORE (e.g. use 60% if -M is 50%, or use 80% if -M is 75%). I
   tweaked the code to make it clearer that it already is able to handle
   such a score difference, but I'm not sure what whether we want an
   automatically computed higher value or a user-controlled possibly higher
   value.

[1] https://lore.kernel.org/git/xmqqlfc4byt6.fsf@gitster.c.googlers.com/ [2]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

Elijah Newren (4):
  diffcore-rename: compute basenames of all source and dest candidates
  diffcore-rename: complete find_basename_matches()
  diffcore-rename: guide inexact rename detection based on basenames
  gitdiffcore doc: mention new preliminary step for rename detection

 Documentation/gitdiffcore.txt |  15 +++
 diffcore-rename.c             | 185 +++++++++++++++++++++++++++++++++-
 t/t4001-diff-rename.sh        |  24 +++++
 3 files changed, 220 insertions(+), 4 deletions(-)


base-commit: 7ae9460d3dba84122c2674b46e4339b9d42bdedd
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/843

Range-diff vs v1:

 1:  377a4a39fa86 ! 1:  381a45d239bb diffcore-rename: compute basenames of all source and dest candidates
     @@ Commit message
          diffcore-rename: compute basenames of all source and dest candidates
      
          We want to make use of unique basenames to help inform rename detection,
     -    so that more likely pairings can be checked first.  Add a new function,
     +    so that more likely pairings can be checked first.  (src/moduleA/foo.txt
     +    and source/module/A/foo.txt are likely related if there are no other
     +    'foo.txt' files among the deleted and added files.)  Add a new function,
          not yet used, which creates a map of the unique basenames within
          rename_src and another within rename_dst, together with the indices
          within rename_src/rename_dst where those basenames show up.  Non-unique
          basenames still show up in the map, but have an invalid index (-1).
      
     +    This function was inspired by the fact that in real world repositories,
     +    most renames often do not involve a basename change.  Here are some
     +    sample repositories and the percentage of their historical renames (as of
     +    early 2020) that did not involve a basename change:
     +      * linux: 76%
     +      * gcc: 64%
     +      * gecko: 79%
     +      * webkit: 89%
     +
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
 2:  5fb4493247ff ! 2:  dcd0175229aa diffcore-rename: complete find_basename_matches()
     @@ Commit message
          sufficiently similar, we record the rename; if not, we include those
          files in the more exhaustive matrix comparison.
      
     +    This means we are adding a set of preliminary additional comparisons,
     +    but for each file we only compare it with at most one other file.  For
     +    example, if there was a include/media/device.h that was deleted and a
     +    src/module/media/device.h that was added, and there were no other
     +    device.h files added or deleted between the commits being compared,
     +    then these two files would be compared in the preliminary step.
     +
     +    This commit does not yet actually employ this new optimization, it
     +    merely adds a function which can be used for this purpose.  The next
     +    commit will do the necessary plumbing to make use of it.
     +
          Note that this optimization might give us different results than without
          the optimization, because it's possible that despite files with the same
          basename being sufficiently similar to be considered a rename, there's
          an even better match between files without the same basename.  I think
     -    that is okay for four reasons: (1) That seems somewhat unlikely in
     -    practice, (2) it's easy to explain to the users what happened if it does
     -    ever occur (or even for them to intuitively figure out), and (3) as the
     -    next patch will show it provides such a large performance boost that
     -    it's worth the tradeoff.  Reason (4) takes a full paragraph to
     +    that is okay for four reasons: (1) it's easy to explain to the users
     +    what happened if it does ever occur (or even for them to intuitively
     +    figure out), (2) as the next patch will show it provides such a large
     +    performance boost that it's worth the tradeoff, and (3) it's somewhat
     +    unlikely that despite having unique matching basenames that other files
     +    serve as better matches.  Reason (4) takes a full paragraph to
          explain...
      
          If the previous three reasons aren't enough, consider what rename
     @@ Commit message
          basename and are sufficiently similar to be considered a rename, mark
          them as such without comparing the two to all other rename candidates.
      
     -    We do not use this heuristic together with either break or copy
     -    detection.  The point of break detection is to say that filename
     -    similarity does not imply file content similarity, and we only want to
     -    know about file content similarity.  The point of copy detection is to
     -    use more resources to check for additional similarities, while this is
     -    an optimization that uses far less resources but which might also result
     -    in finding slightly fewer similarities.  So the idea behind this
     -    optimization goes against both of those features, and will be turned off
     -    for both.
     -
     -    Note that this optimization is not yet effective in any situation, as
     -    the function is still unused.  The next commit will hook it into the
     -    code so that it is used when rename detection is wanted, but neither
     -    copy nor break detection are.
     -
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
 3:  1d941c35076e ! 3:  ce2173aa1fb7 diffcore-rename: guide inexact rename detection based on basenames
     @@ Commit message
      
          Make use of the new find_basename_matches() function added in the last
          two patches, to find renames more rapidly in cases where we can match up
     -    files based on basenames.
     +    files based on basenames.  As a quick reminder (see the last two commit
     +    messages for more details), this means for example that
     +    docs/extensions.txt and docs/config/extensions.txt are considered likely
     +    renames if there are no 'extensions.txt' files elsewhere among the added
     +    and deleted files, and if a similarity check confirms they are similar,
     +    then they are marked as a rename without looking for a better similarity
     +    match among other files.  This is a behavioral change, as covered in
     +    more detail in the previous commit message.
     +
     +    We do not use this heuristic together with either break or copy
     +    detection.  The point of break detection is to say that filename
     +    similarity does not imply file content similarity, and we only want to
     +    know about file content similarity.  The point of copy detection is to
     +    use more resources to check for additional similarities, while this is
     +    an optimization that uses far less resources but which might also result
     +    in finding slightly fewer similarities.  So the idea behind this
     +    optimization goes against both of those features, and will be turned off
     +    for both.
      
          For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
          performance work; instrument with trace2_region_* calls", 2020-10-28),
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +		remove_unneeded_paths_from_src(want_copies);
      +		trace2_region_leave("diff", "cull after exact", options->repo);
      +	} else {
     ++		/* Determine minimum score to match basenames */
     ++		int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5;
     ++
      +		/*
      +		 * Cull sources:
      +		 *   - remove ones involved in renames (found via exact match)
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +
      +		/* Utilize file basenames to quickly find renames. */
      +		trace2_region_enter("diff", "basename matches", options->repo);
     -+		rename_count += find_basename_matches(options, minimum_score,
     ++		rename_count += find_basename_matches(options,
     ++						      min_basename_score,
      +						      rename_src_nr);
      +		trace2_region_leave("diff", "basename matches", options->repo);
      +
 -:  ------------ > 4:  a0e75d8cd6bd gitdiffcore doc: mention new preliminary step for rename detection

-- 
gitgitgadget

  parent reply	other threads:[~2021-02-09 11:46 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-07 14:38   ` Derrick Stolee
2021-02-07 19:51     ` Junio C Hamano
2021-02-08  8:38       ` Elijah Newren
2021-02-08 11:43         ` Derrick Stolee
2021-02-08 16:25           ` Elijah Newren
2021-02-08 17:37         ` Junio C Hamano
2021-02-08 22:00           ` Elijah Newren
2021-02-08 23:43             ` Junio C Hamano
2021-02-08 23:52               ` Elijah Newren
2021-02-08  8:27     ` Elijah Newren
2021-02-08 11:31       ` Derrick Stolee
2021-02-08 16:09         ` Elijah Newren
2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-07  6:05   ` Elijah Newren
2021-02-09 11:32 ` Elijah Newren via GitGitGadget [this message]
2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-09 13:17     ` Derrick Stolee
2021-02-09 16:56       ` Elijah Newren
2021-02-09 17:02         ` Derrick Stolee
2021-02-09 17:42           ` Elijah Newren
2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-09 13:25     ` Derrick Stolee
2021-02-09 17:17       ` Elijah Newren
2021-02-09 17:34         ` Derrick Stolee
2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-09 13:33     ` Derrick Stolee
2021-02-09 17:41       ` Elijah Newren
2021-02-09 18:59         ` Junio C Hamano
2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-09 12:59     ` Derrick Stolee
2021-02-09 17:03       ` Junio C Hamano
2021-02-09 17:44         ` Elijah Newren
2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-13  1:15       ` Junio C Hamano
2021-02-13  4:50         ` Elijah Newren
2021-02-13 23:56           ` Junio C Hamano
2021-02-14  1:24             ` Elijah Newren
2021-02-14  1:32               ` Junio C Hamano
2021-02-14  3:14                 ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-13  1:32       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-13  1:48       ` Junio C Hamano
2021-02-13 18:34         ` Elijah Newren
2021-02-13 23:55           ` Junio C Hamano
2021-02-14  3:08             ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-13  1:49       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-10 16:41       ` Junio C Hamano
2021-02-10 17:20         ` Elijah Newren
2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
2021-02-13  1:53       ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.843.v2.git.1612870326.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).