git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Elijah Newren <newren@gmail.com>
Subject: [PATCH v2 00/10] Optimization batch 8: use file basenames even more
Date: Tue, 23 Feb 2021 23:43:57 +0000	[thread overview]
Message-ID: <pull.844.v2.git.1614123848.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.844.git.1613289544.gitgitgadget@gmail.com>

This series depends on en/diffcore-rename (a concatenation of what I was
calling ort-perf-batch-6 and ort-perf-batch-7).

There are no changes since v1; it's just a resend a week and a half later to
bump it so it isn't lost.

=== Optimization idea ===

This series uses file basenames (portions of the path after the last '/',
including file extension) in a more involved fashion to guide rename
detection. It's a follow-on improvement to "Optimization #3" from my Git
Merge 2020 talk[1]. The basic idea behind this series is the same as the
last series: people frequently move files across directories while keeping
the filenames the same, thus files with the same basename are likely rename
candidates. However, the previous optimization only applies when basenames
are unique among remaining adds and deletes after exact rename detection, so
we need to do something else to match up the remaining basenames. When there
are many files with the same basename (e.g. .gitignore, Makefile,
build.gradle, or maybe even setup.c, AbtractFactory.java, etc.), being able
to "guess" which directory a given file likely would have moved to can
provide us with a likely rename candidate if there is a file with the same
basename in that directory. Since exact rename detection is done first, we
can use nearby exact renames to help us guess where any given non-unique
basename file may have moved; it just means doing "directory rename
detection" limited to exact renames.

There are definitely cases when this strategy still won't help us: (1) We
only use this strategy when the directory in which the original file was
found has also been removed, (2) a lack of exact renames from the given
directory will prevents us from making a new directory prediction, (3) even
if we predict a new directory there may be no file with the given basename
in it, and (4) even if there is an unmatched add with the appropriate
basename in the predicted directory, it may not meet the higher
min_basename_score similarity threshold.

It may be worth noting that directory rename detection at most predicts one
new directory, which we use to ensure that we only compare any given file
with at most one other file. That's important for compatibility with future
optimizations.

However, despite the caveats and limited applicability, this idea provides
some nice speedups.

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with (as noted in the same commit message)
were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


=== Alternative, rejected idea ===

There was an alternative idea to the series presented here that I also
tried: instead of using directory rename detection based on exact renames to
predict where files would be renamed and then comparing to the file with the
same basename in the new directory, one could instead take all files with
the same basename -- both sources and destinations -- and then do a smaller
M x N comparison on all those files to find renames. Any non-matches after
that step could be combined with all other files for the big inexact rename
detection step.

There are two problems with such a strategy, though.

One is that in the worst case, you approximately double the cost of rename
detection (if most potential rename pairs all have the same basename but
they aren't actually matches, you end up comparing twice).

The second issue isn't clear until trying to combine this idea with later
performance optimizations. The next optimization will provide a way to
filter out several of the rename sources. If our inexact rename detection
matrix is sized 1 x 4000 because we can remove all but one source file, but
we have 100 files with the same basename, then a 100 x 100 comparison is
actually more costly than a 1 x 4000 comparison -- and we don't need most of
the renames from the 100 x 100 comparison. The advantage of the directory
rename detection based idea for finding which basenames to match up, is that
the cost for each file is linear (or, said another way, scales proportional
to doing a diff on that file). As such, the costs for this preliminary
optimization are nicely controlled and the worst case scenario is it has
spent a little extra time upfront but still has to do the full inexact
rename detection.

[1]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

Elijah Newren (10):
  Move computation of dir_rename_count from merge-ort to diffcore-rename
  diffcore-rename: add functions for clearing dir_rename_count
  diffcore-rename: move dir_rename_counts into a dir_rename_info struct
  diffcore-rename: extend cleanup_dir_rename_info()
  diffcore-rename: compute dir_rename_counts in stages
  diffcore-rename: add a mapping of destination names to their indices
  diffcore-rename: add a dir_rename_guess field to dir_rename_info
  diffcore-rename: add a new idx_possible_rename function
  diffcore-rename: limit dir_rename_counts computation to relevant dirs
  diffcore-rename: use directory rename guided basename comparisons

 Documentation/gitdiffcore.txt |   2 +-
 diffcore-rename.c             | 439 ++++++++++++++++++++++++++++++++--
 diffcore.h                    |   7 +
 merge-ort.c                   | 144 +----------
 4 files changed, 439 insertions(+), 153 deletions(-)


base-commit: aeca14f748afc7fb5b65bca56ea2ebd970729814
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-844%2Fnewren%2Fort-perf-batch-8-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-844/newren/ort-perf-batch-8-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/844

Range-diff vs v1:

  1:  fec4f1d44c06 =  1:  fec4f1d44c06 Move computation of dir_rename_count from merge-ort to diffcore-rename
  2:  612da82f049c =  2:  612da82f049c diffcore-rename: add functions for clearing dir_rename_count
  3:  93f98fc0b264 =  3:  93f98fc0b264 diffcore-rename: move dir_rename_counts into a dir_rename_info struct
  4:  f7bdad78219d =  4:  f7bdad78219d diffcore-rename: extend cleanup_dir_rename_info()
  5:  3a29cf9e526f =  5:  3a29cf9e526f diffcore-rename: compute dir_rename_counts in stages
  6:  dffecc064dd3 =  6:  dffecc064dd3 diffcore-rename: add a mapping of destination names to their indices
  7:  4983a1c2f908 =  7:  4983a1c2f908 diffcore-rename: add a dir_rename_guess field to dir_rename_info
  8:  cbd055ab3399 =  8:  cbd055ab3399 diffcore-rename: add a new idx_possible_rename function
  9:  4e095ea7c439 =  9:  4e095ea7c439 diffcore-rename: limit dir_rename_counts computation to relevant dirs
 10:  1df498b3a2f0 = 10:  805c101cfd84 diffcore-rename: use directory rename guided basename comparisons

-- 
gitgitgadget

  parent reply	other threads:[~2021-02-23 23:56 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 02/10] diffcore-rename: add functions for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 05/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 06/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-14  7:59 ` [PATCH 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info Elijah Newren via GitGitGadget
2021-02-14  7:59 ` [PATCH 08/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
2021-02-14  7:59 ` [PATCH 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-14  7:59 ` [PATCH 10/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-23 23:43 ` Elijah Newren via GitGitGadget [this message]
2021-02-23 23:43   ` [PATCH v2 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-24 15:25     ` Derrick Stolee
2021-02-24 18:50       ` Elijah Newren
2021-02-23 23:43   ` [PATCH v2 02/10] diffcore-rename: add functions for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-23 23:44   ` [PATCH v2 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-23 23:44   ` [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-24 15:37     ` Derrick Stolee
2021-02-25  2:16     ` Ævar Arnfjörð Bjarmason
2021-02-25  2:26       ` Ævar Arnfjörð Bjarmason
2021-02-25  2:34       ` Junio C Hamano
2021-02-23 23:44   ` [PATCH v2 05/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-24 15:43     ` Derrick Stolee
2021-02-23 23:44   ` [PATCH v2 06/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-23 23:44   ` [PATCH v2 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info Elijah Newren via GitGitGadget
2021-02-23 23:44   ` [PATCH v2 08/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
2021-02-24 17:35     ` Derrick Stolee
2021-02-25  1:13       ` Elijah Newren
2021-02-23 23:44   ` [PATCH v2 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-23 23:44   ` [PATCH v2 10/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-24 17:44     ` Derrick Stolee
2021-02-24 17:50   ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
2021-02-25  1:38     ` Elijah Newren
2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 02/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
2021-02-26 15:52       ` Derrick Stolee
2021-02-26  1:58     ` [PATCH v3 03/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-26 15:55       ` Derrick Stolee
2021-02-26  1:58     ` [PATCH v3 05/10] diffcore-rename: add function for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 07/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 08/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts Elijah Newren via GitGitGadget
2021-02-26 16:34     ` [PATCH v3 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
2021-02-26 19:28       ` Elijah Newren
2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 02/10] diffcore-rename: provide basic implementation of idx_possible_rename() Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 03/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 05/10] diffcore-rename: add function for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 07/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 08/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts Elijah Newren via GitGitGadget
2021-03-09 21:52       ` [PATCH v4 00/10] Optimization batch 8: use file basenames even more Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.844.v2.git.1614123848.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).