* [PATCH 0/2] Optimization batch 6: make full use of exact renames @ 2021-02-03 5:49 Elijah Newren via GitGitGadget 2021-02-03 5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Elijah Newren via GitGitGadget @ 2021-02-03 5:49 UTC (permalink / raw) To: git Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees, Elijah Newren This series depends on en/merge-ort-perf. For the very curious who are wondering about the first five optimization batches; see the end of this email. This series makes full use of exact renames; see commit messages for details. It represents "Optimization #1" from my Git Merge 2020 talk[1]. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), the changes in just this series improves the performance as follows: Before Series After Series no-renames: 14.263 s ± 0.053 s 13.815 s ± 0.062 s mega-renames: 5504.231 s ± 5.150 s 1799.937 s ± 0.493 s just-one-mega: 158.534 s ± 0.498 s 51.289 s ± 0.019 s As a reminder, before any merge-ort/diffcore-rename performance work, the performance results we started with (as noted in the same commit message) were: no-renames-am: 6.940 s ± 0.485 s no-renames: 18.912 s ± 0.174 s mega-renames: 5964.031 s ± 10.459 s just-one-mega: 149.583 s ± 0.751 s [1] https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf === Previous optimization batches === I'm labeling this as the "6th" batch, due to other optimizations submitted previously, and a number of optimizations baked into the design of fast-rebase and merge-ort. 1. Previously submitted hashmap/strmap optimizations 1a) 33f20d8217 (hashmap: introduce a new hashmap_partial_clear()) 1b) 6ccdfc2a20 (strmap: enable faster clearing and reusing of strmaps) 1c) a208ec1f0b (strmap: enable allocations to come from a mem_pool) 1d) 23a276a9c4 (strmap: take advantage of FLEXPTR_ALLOC_STR when relevant) 2. Previously submitted diffcore-rename optimizations 2a) b970b4ef62 (diffcore-rename: simplify and accelerate register_rename_src()) 2b) 9db2ac5616 (diffcore-rename: accelerate rename_dst setup) 2c) 350410f6b1 (diffcore-rename: remove unnecessary duplicate entry checks) 3. fast-rebase optimizations 3a) Avoid updating working-tree/index with every intermediate patch 3b) avoid reading/writing rebase metadata until conflict or completion 4. Small stuff baked into merge-ort design 4a) Using pahole to note I can reduce size of merged_info by 8 bytes 4b) Avoid recomparing hashes (due to use of match_masks) 4c) Avoid unconditional dropping and re-reading of the index 4d) avoid checking index matches HEAD with every patch; do it at start only 5. Big stuff baked into merge-ort design 5a) Avoid quadratic behavior with O(N) insertions/removals of index entries 5b) Avoid numerous expensive mini-tree traversals done by merge-recursive 5c) Avoid recursing into trees where both sides match merge base Elijah Newren (2): diffcore-rename: no point trying to find a match better than exact diffcore-rename: filter rename_src list when possible diffcore-rename.c | 69 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 62 insertions(+), 7 deletions(-) base-commit: 557ac0350d9efa1f59c708779ca3fb3aee121131 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-842%2Fnewren%2Fort-perf-batch-6-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-842/newren/ort-perf-batch-6-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/842 -- gitgitgadget ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact 2021-02-03 5:49 [PATCH 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget @ 2021-02-03 5:49 ` Elijah Newren via GitGitGadget 2021-02-03 11:44 ` Derrick Stolee 2021-02-03 5:49 ` [PATCH 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget 2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget 2 siblings, 1 reply; 25+ messages in thread From: Elijah Newren via GitGitGadget @ 2021-02-03 5:49 UTC (permalink / raw) To: git Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees, Elijah Newren, Elijah Newren From: Elijah Newren <newren@gmail.com> diffcore_rename() had some code to avoid having destination paths that already had an exact rename detected from being re-checked for other renames. Source paths, however, were re-checked because we wanted to allow the possibility of detecting copies. But if copy detection isn't turned on, then this merely amounts to attempting to find a better-than-exact match, which naturally ends up being an expensive no-op. In particular, copy detection is never turned on by the merge machinery. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 14.263 s ± 0.053 s 14.119 s ± 0.101 s mega-renames: 5504.231 s ± 5.150 s 1802.044 s ± 0.828 s just-one-mega: 158.534 s ± 0.498 s 51.391 s ± 0.028 s Signed-off-by: Elijah Newren <newren@gmail.com> --- diffcore-rename.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index 8fe6c9384bc..e3047da3aaf 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -463,6 +463,7 @@ void diffcore_rename(struct diff_options *options) struct diff_score *mx; int i, j, rename_count, skip_unmodified = 0; int num_destinations, dst_cnt; + int num_sources; struct progress *progress = NULL; trace2_region_enter("diff", "setup", options->repo); @@ -532,12 +533,15 @@ void diffcore_rename(struct diff_options *options) * files still remain as options for rename/copies!) */ num_destinations = (rename_dst_nr - rename_count); + num_sources = rename_src_nr; + if (detect_rename != DIFF_DETECT_COPY) + num_sources -= rename_count; /* All done? */ - if (!num_destinations) + if (!num_destinations || !num_sources) goto cleanup; - switch (too_many_rename_candidates(num_destinations, rename_src_nr, + switch (too_many_rename_candidates(num_destinations, num_sources, options)) { case 1: goto cleanup; @@ -553,7 +557,7 @@ void diffcore_rename(struct diff_options *options) if (options->show_rename_progress) { progress = start_delayed_progress( _("Performing inexact rename detection"), - (uint64_t)num_destinations * (uint64_t)rename_src_nr); + (uint64_t)num_destinations * (uint64_t)num_sources); } mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations), @@ -573,6 +577,10 @@ void diffcore_rename(struct diff_options *options) struct diff_filespec *one = rename_src[j].p->one; struct diff_score this_src; + if (one->rename_used && + detect_rename != DIFF_DETECT_COPY) + continue; + if (skip_unmodified && diff_unmodified_pair(rename_src[j].p)) continue; @@ -594,7 +602,7 @@ void diffcore_rename(struct diff_options *options) } dst_cnt++; display_progress(progress, - (uint64_t)dst_cnt * (uint64_t)rename_src_nr); + (uint64_t)dst_cnt * (uint64_t)num_sources); } stop_progress(&progress); -- gitgitgadget ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact 2021-02-03 5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget @ 2021-02-03 11:44 ` Derrick Stolee 2021-02-03 16:31 ` Elijah Newren 2021-02-03 18:46 ` Junio C Hamano 0 siblings, 2 replies; 25+ messages in thread From: Derrick Stolee @ 2021-02-03 11:44 UTC (permalink / raw) To: Elijah Newren via GitGitGadget, git Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees, Elijah Newren On 2/3/2021 12:49 AM, Elijah Newren via GitGitGadget wrote: > From: Elijah Newren <newren@gmail.com> > > diffcore_rename() had some code to avoid having destination paths that > already had an exact rename detected from being re-checked for other > renames. Source paths, however, were re-checked because we wanted to > allow the possibility of detecting copies. But if copy detection isn't > turned on, then this merely amounts to attempting to find a > better-than-exact match, which naturally ends up being an expensive > no-op. In particular, copy detection is never turned on by the merge > machinery. ... > + num_sources = rename_src_nr; > + if (detect_rename != DIFF_DETECT_COPY) > + num_sources -= rename_count; Ok, delete the renamed files from the sources. Using a new variable because rename_src_nr is actually a static global to diffcore-rename.c, describing the number of entries in the rename_src table. This is scary, but I think your new local is a good way to change the local logic of this method without adjusting that global. > > /* All done? */ > - if (!num_destinations) > + if (!num_destinations || !num_sources) > goto cleanup; And add an extra quit condition which is very possible to hit. Is it only hit when every "delete" is actually a rename? > - switch (too_many_rename_candidates(num_destinations, rename_src_nr, > + switch (too_many_rename_candidates(num_destinations, num_sources, > options)) { This is all about checking if we need to skip inexact renames. Makes sense to use the new number. > + if (one->rename_used && > + detect_rename != DIFF_DETECT_COPY) > + continue; > + Have we "consumed" this input? Skip over it. Good. And this is inside a double-loop: for (dst_cnt = i = 0; i < rename_dst_nr; i++) { ... for (j = 0; j < rename_src_nr; j++) { Keeping rename_src_nr in the inner loop makes sense, but this new 'continue;' gives most of the speedup, I imagine. This is a nice speedup for such a simple optimization. Thanks, -Stolee ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact 2021-02-03 11:44 ` Derrick Stolee @ 2021-02-03 16:31 ` Elijah Newren 2021-02-03 18:46 ` Junio C Hamano 1 sibling, 0 replies; 25+ messages in thread From: Elijah Newren @ 2021-02-03 16:31 UTC (permalink / raw) To: Derrick Stolee Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees On Wed, Feb 3, 2021 at 3:44 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 2/3/2021 12:49 AM, Elijah Newren via GitGitGadget wrote: > > From: Elijah Newren <newren@gmail.com> > > > > diffcore_rename() had some code to avoid having destination paths that > > already had an exact rename detected from being re-checked for other > > renames. Source paths, however, were re-checked because we wanted to > > allow the possibility of detecting copies. But if copy detection isn't > > turned on, then this merely amounts to attempting to find a > > better-than-exact match, which naturally ends up being an expensive > > no-op. In particular, copy detection is never turned on by the merge > > machinery. > ... > > + num_sources = rename_src_nr; > > + if (detect_rename != DIFF_DETECT_COPY) > > + num_sources -= rename_count; > > Ok, delete the renamed files from the sources. Using a new variable > because rename_src_nr is actually a static global to diffcore-rename.c, > describing the number of entries in the rename_src table. This is > scary, but I think your new local is a good way to change the local > logic of this method without adjusting that global. I thought about changing rename_src, rename_src_nr, rename_dst, and rename_dst_nr to all be in some struct and make one of those on the stack locally in diffcore_rename() and then pass that structure around. Would be nice to get rid of more global state. But I've got enough things in the queue that I never made the jump. > > > > /* All done? */ > > - if (!num_destinations) > > + if (!num_destinations || !num_sources) > > goto cleanup; > > And add an extra quit condition which is very possible to hit. > Is it only hit when every "delete" is actually a rename? Right, when every "delete" gets paired by exact rename detection to some "add" and is marked as a rename, meaning we have no more "deletes" to pair with anything. In later series, there will be additional reasons for num_sources to decrease and possibly hit 0. > > - switch (too_many_rename_candidates(num_destinations, rename_src_nr, > > + switch (too_many_rename_candidates(num_destinations, num_sources, > > options)) { > > This is all about checking if we need to skip inexact renames. Makes > sense to use the new number. > > > + if (one->rename_used && > > + detect_rename != DIFF_DETECT_COPY) > > + continue; > > + > > Have we "consumed" this input? Skip over it. Good. And this is inside > a double-loop: > > for (dst_cnt = i = 0; i < rename_dst_nr; i++) { > ... > for (j = 0; j < rename_src_nr; j++) { > > Keeping rename_src_nr in the inner loop makes sense, but this new > 'continue;' gives most of the speedup, I imagine. > > This is a nice speedup for such a simple optimization. > > Thanks, > -Stolee ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact 2021-02-03 11:44 ` Derrick Stolee 2021-02-03 16:31 ` Elijah Newren @ 2021-02-03 18:46 ` Junio C Hamano 2021-02-03 19:10 ` Elijah Newren 1 sibling, 1 reply; 25+ messages in thread From: Junio C Hamano @ 2021-02-03 18:46 UTC (permalink / raw) To: Derrick Stolee Cc: Elijah Newren via GitGitGadget, git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees, Elijah Newren Derrick Stolee <stolee@gmail.com> writes: >> /* All done? */ >> - if (!num_destinations) >> + if (!num_destinations || !num_sources) >> goto cleanup; > > And add an extra quit condition which is very possible to hit. > Is it only hit when every "delete" is actually a rename? When every delete is actually an unmodified exact rename, I think. If this simple change gives us good performance gain, that is superb. Nicely done. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact 2021-02-03 18:46 ` Junio C Hamano @ 2021-02-03 19:10 ` Elijah Newren 0 siblings, 0 replies; 25+ messages in thread From: Elijah Newren @ 2021-02-03 19:10 UTC (permalink / raw) To: Junio C Hamano Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees On Wed, Feb 3, 2021 at 10:46 AM Junio C Hamano <gitster@pobox.com> wrote: > > Derrick Stolee <stolee@gmail.com> writes: > > >> /* All done? */ > >> - if (!num_destinations) > >> + if (!num_destinations || !num_sources) > >> goto cleanup; > > > > And add an extra quit condition which is very possible to hit. > > Is it only hit when every "delete" is actually a rename? > > When every delete is actually an unmodified exact rename, I think. Yes, exactly. > If this simple change gives us good performance gain, that is > superb. > > Nicely done. Thanks. I probably should have separated this from the speculative not-quite-working changes that weren't ready yet and pushed just it sooner.[1] But I kinda wanted to get all the performance changes together, and...well, all the other performance changes took a while to complete (especially given multiple starts and stops). [1] https://lore.kernel.org/git/20171110222156.23221-2-newren@gmail.com/ ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 2/2] diffcore-rename: filter rename_src list when possible 2021-02-03 5:49 [PATCH 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget 2021-02-03 5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget @ 2021-02-03 5:49 ` Elijah Newren via GitGitGadget [not found] ` <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com> 2021-02-03 19:12 ` Junio C Hamano 2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget 2 siblings, 2 replies; 25+ messages in thread From: Elijah Newren via GitGitGadget @ 2021-02-03 5:49 UTC (permalink / raw) To: git Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees, Elijah Newren, Elijah Newren From: Elijah Newren <newren@gmail.com> We have to look at each entry in rename_src a total of rename_dst_nr times. When we're not detecting copies, any exact renames or ignorable rename paths will just be skipped over. While checking that these can be skipped over is a relatively cheap check, it's still a waste of time to do that check more than once, let alone rename_dst_nr times. When rename_src_nr is a few thousand times bigger than the number of relevant sources (such as when cherry-picking a commit that only touched a handful of files, but from a side of history that has different names for some high level directories), this time can add up. First make an initial pass over the rename_src array and move all the relevant entries to the front, so that we can iterate over just those relevant entries. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 14.119 s ± 0.101 s 13.815 s ± 0.062 s mega-renames: 1802.044 s ± 0.828 s 1799.937 s ± 0.493 s just-one-mega: 51.391 s ± 0.028 s 51.289 s ± 0.019 s Signed-off-by: Elijah Newren <newren@gmail.com> --- diffcore-rename.c | 67 ++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 57 insertions(+), 10 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index e3047da3aaf..2a8e7b84b9c 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -454,6 +454,55 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i return count; } +static int remove_unneeded_paths_from_src(int num_src, + int detecting_copies) +{ + int i, new_num_src; + + /* + * Note on reasons why we cull unneeded sources but not destinations: + * 1) Pairings are stored in rename_dst (not rename_src), which we + * need to keep around. So, we just can't cull rename_dst even + * if we wanted to. But doing so wouldn't help because... + * + * 2) There is a matrix pairwise comparison that follows the + * "Performing inexact rename detection" progress message. + * Iterating over the destinations is done in the outer loop, + * hence we only iterate over each of those once and we can + * easily skip the outer loop early if the destination isn't + * relevant. That's only one check per destination path to + * skip. + * + * By contrast, the sources are iterated in the inner loop; if + * we check whether a source can be skipped, then we'll be + * checking it N separate times, once for each destination. + * We don't want to have to iterate over known-not-needed + * sources N times each, so avoid that by removing the sources + * from rename_src here. + */ + if (detecting_copies) + return num_src; /* nothing to remove */ + if (break_idx) + return num_src; /* culling incompatbile with break detection */ + + for (i = 0, new_num_src = 0; i < num_src; i++) { + /* + * renames are stored in rename_dst, so if a rename has + * already been detected using this source, we can just + * remove the source knowing rename_dst has its info. + */ + if (rename_src[i].p->one->rename_used) + continue; + + if (new_num_src < i) + memcpy(&rename_src[new_num_src], &rename_src[i], + sizeof(struct diff_rename_src)); + new_num_src++; + } + + return new_num_src; +} + void diffcore_rename(struct diff_options *options) { int detect_rename = options->detect_rename; @@ -463,10 +512,11 @@ void diffcore_rename(struct diff_options *options) struct diff_score *mx; int i, j, rename_count, skip_unmodified = 0; int num_destinations, dst_cnt; - int num_sources; + int num_sources, want_copies; struct progress *progress = NULL; trace2_region_enter("diff", "setup", options->repo); + want_copies = (detect_rename == DIFF_DETECT_COPY); if (!minimum_score) minimum_score = DEFAULT_RENAME_SCORE; @@ -529,13 +579,10 @@ void diffcore_rename(struct diff_options *options) goto cleanup; /* - * Calculate how many renames are left (but all the source - * files still remain as options for rename/copies!) + * Calculate how many renames are left */ num_destinations = (rename_dst_nr - rename_count); - num_sources = rename_src_nr; - if (detect_rename != DIFF_DETECT_COPY) - num_sources -= rename_count; + num_sources = remove_unneeded_paths_from_src(rename_src_nr, want_copies); /* All done? */ if (!num_destinations || !num_sources) @@ -573,13 +620,13 @@ void diffcore_rename(struct diff_options *options) for (j = 0; j < NUM_CANDIDATE_PER_DST; j++) m[j].dst = -1; - for (j = 0; j < rename_src_nr; j++) { + for (j = 0; j < num_sources; j++) { struct diff_filespec *one = rename_src[j].p->one; struct diff_score this_src; - if (one->rename_used && - detect_rename != DIFF_DETECT_COPY) - continue; + assert(!one->rename_used || + detect_rename == DIFF_DETECT_COPY || + break_idx); if (skip_unmodified && diff_unmodified_pair(rename_src[j].p)) -- gitgitgadget ^ permalink raw reply related [flat|nested] 25+ messages in thread
[parent not found: <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com>]
* Re: [PATCH 2/2] diffcore-rename: filter rename_src list when possible [not found] ` <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com> @ 2021-02-03 17:12 ` Elijah Newren 0 siblings, 0 replies; 25+ messages in thread From: Elijah Newren @ 2021-02-03 17:12 UTC (permalink / raw) To: Derrick Stolee Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees On Wed, Feb 3, 2021 at 4:00 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 2/3/2021 12:49 AM, Elijah Newren via GitGitGadget wrote: > > From: Elijah Newren <newren@gmail.com> > > > > We have to look at each entry in rename_src a total of rename_dst_nr > > times. When we're not detecting copies, any exact renames or ignorable > > rename paths will just be skipped over. While checking that these can > > be skipped over is a relatively cheap check, it's still a waste of time > > to do that check more than once, let alone rename_dst_nr times. When > > rename_src_nr is a few thousand times bigger than the number of relevant > > sources (such as when cherry-picking a commit that only touched a > > handful of files, but from a side of history that has different names > > for some high level directories), this time can add up. > > > > First make an initial pass over the rename_src array and move all the > > relevant entries to the front, so that we can iterate over just those > > relevant entries. > > > > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin > > performance work; instrument with trace2_region_* calls", 2020-10-28), > > this change improves the performance as follows: > > > > Before After > > no-renames: 14.119 s ± 0.101 s 13.815 s ± 0.062 s > > mega-renames: 1802.044 s ± 0.828 s 1799.937 s ± 0.493 s > > just-one-mega: 51.391 s ± 0.028 s 51.289 s ± 0.019 s > > > > Signed-off-by: Elijah Newren <newren@gmail.com> > > --- > > diffcore-rename.c | 67 ++++++++++++++++++++++++++++++++++++++++------- > > 1 file changed, 57 insertions(+), 10 deletions(-) > > > > diff --git a/diffcore-rename.c b/diffcore-rename.c > > index e3047da3aaf..2a8e7b84b9c 100644 > > --- a/diffcore-rename.c > > +++ b/diffcore-rename.c > > @@ -454,6 +454,55 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i > > return count; > > } > > > > +static int remove_unneeded_paths_from_src(int num_src, > > + int detecting_copies) > > +{ > > + int i, new_num_src; > > + > > + /* > > + * Note on reasons why we cull unneeded sources but not destinations: > > + * 1) Pairings are stored in rename_dst (not rename_src), which we > > + * need to keep around. So, we just can't cull rename_dst even > > + * if we wanted to. But doing so wouldn't help because... > > + * > > + * 2) There is a matrix pairwise comparison that follows the > > + * "Performing inexact rename detection" progress message. > > + * Iterating over the destinations is done in the outer loop, > > + * hence we only iterate over each of those once and we can > > + * easily skip the outer loop early if the destination isn't > > + * relevant. That's only one check per destination path to > > + * skip. > > + * > > + * By contrast, the sources are iterated in the inner loop; if > > + * we check whether a source can be skipped, then we'll be > > + * checking it N separate times, once for each destination. > > + * We don't want to have to iterate over known-not-needed > > + * sources N times each, so avoid that by removing the sources > > + * from rename_src here. > > + */ > > + if (detecting_copies) > > + return num_src; /* nothing to remove */ > > + if (break_idx) > > + return num_src; /* culling incompatbile with break detection */ > > While the comments here are certainly descriptive, they could probably > be sidelined to "return when culling is incompatible" with this before > the big comment: Yeah, I guess at this stage a bigger explanation isn't needed. But I think I'd end up just adding these comments back in later at an unrelated time if I were to take them out at this point. By the time merge-ort is finished, I'll have over a dozen additional optimizations, many of them in diffcore-rename.c. There will be files for which four different optimizations are all applicable, but they are mutually exclusive so we have to pick one of the four and exclude the other three. Best results come from knowing which one provides the best speedup and why (and which is second if the best doesn't apply, etc.). So what I'll present as four optimizations originally felt to me like 10 or so different ones, because I didn't start knowing the best, meaning I kept adding later additional optimizations that fed the appropriate data in all the right places so that I could tweak which one(s) took priority and get the best speedups. You'll get the cleaner version to review that pretends I knew the right answers all along. (Not to mention that I think I tried a lot more optimization ideas that failed than the number I tried that succeeded, and longer comments of mine sometimes highlight details that could have prevented the need to attempt some of the failed ideas.) Anyway, long story short: at some point along the way, the more detailed description felt very appropriate, but I don't have a clear breakpoint for where that is and the longer description most logically goes along with this early patch so I squashed it in here. *shrug* > if (detecting_copies || break_idx) > return num_src; > > This is a very minor nit pick, though. Moving these conditions above the comment makes sense, yes. I guess combining the two if conditions here could also be done, though the reasoning is different as noted by the comments so I would have to toss the comments. Since a later optimization will need to add an additional condition that is only relevant for one of the two reasons, I would just need to split again later and I think I would want the comments back at that time. So I think the split and keeping the comments seems more natural to me. > > + > > + for (i = 0, new_num_src = 0; i < num_src; i++) { > > + /* > > + * renames are stored in rename_dst, so if a rename has > > + * already been detected using this source, we can just > > + * remove the source knowing rename_dst has its info. > > + */ > > + if (rename_src[i].p->one->rename_used) > > + continue; > > + > > + if (new_num_src < i) > > + memcpy(&rename_src[new_num_src], &rename_src[i], > > + sizeof(struct diff_rename_src)); > > 'struct diff_rename_src' is currently just a pointer and a short, so > this is a very small amount of data, but this conditional move doesn't > hurt. > > > + new_num_src++; > > + } > > + > > + return new_num_src; > > +} > > + > > > > void diffcore_rename(struct diff_options *options) > > { > > int detect_rename = options->detect_rename; > > @@ -463,10 +512,11 @@ void diffcore_rename(struct diff_options *options) > > struct diff_score *mx; > > int i, j, rename_count, skip_unmodified = 0; > > int num_destinations, dst_cnt; > > - int num_sources; > > + int num_sources, want_copies; > > struct progress *progress = NULL; > > > > trace2_region_enter("diff", "setup", options->repo); > > + want_copies = (detect_rename == DIFF_DETECT_COPY); > > I was looking at this and thinking "why wasn't this in the previous > patch?" I see that you delete the "if (detect_rename != DIFF_DETECT_COPY)" > below in favor of the call to the helper. > > _If_ you were re-rolling, then maybe it would be slightly cleaner > to add 'want_copies' to the previous patch? Oh, yeah, I think that does make sense. And you highlight some good reasons to re-roll below, so I'll do that. > > /* > > - * Calculate how many renames are left (but all the source > > - * files still remain as options for rename/copies!) > > + * Calculate how many renames are left > > */ > > num_destinations = (rename_dst_nr - rename_count); > > - num_sources = rename_src_nr; > > - if (detect_rename != DIFF_DETECT_COPY) > > - num_sources -= rename_count; > > + num_sources = remove_unneeded_paths_from_src(rename_src_nr, want_copies); > > So here we update the local variable... > > > > > /* All done? */ > > if (!num_destinations || !num_sources) > > @@ -573,13 +620,13 @@ void diffcore_rename(struct diff_options *options) > > for (j = 0; j < NUM_CANDIDATE_PER_DST; j++) > > m[j].dst = -1; > > > > - for (j = 0; j < rename_src_nr; j++) { > > + for (j = 0; j < num_sources; j++) { > > ...and drop the use of the static global. However, don't we need to > update the global itself? There are other loops limited by > rename_src_nr that should only work on this smaller list, right? > > In particular, it seems that the prefetch() method would benefit > from looking at this smaller list. It fetches the blob contents in > a partial clone inside of estimate_similarity(). When there are a > lot of exact renames but even one possible inexact rename, it seems > to currently download the content for all of the blobs including the > exactly-matched ones. Ah, indeed. The prefetch() method didn't exist at the time I wrote this patch, but there's always the possibility of other reasons. It turns out I did make rename_src_nr match num_sources, but that diff hunk got placed in a later series. I agree with you that it should go here instead. > For interesting tests around this prefetch() call, I think the > 'diff with rename detection batches blobs' test in t4067 should help. > > > struct diff_filespec *one = rename_src[j].p->one; > > struct diff_score this_src; > > > > - if (one->rename_used && > > - detect_rename != DIFF_DETECT_COPY) > > - continue; > > + assert(!one->rename_used || > > + detect_rename == DIFF_DETECT_COPY || > > + break_idx); > > Looks like you could use 'want_copies' here (and in the previous > patch). Yep, good catch. I'll make the change. > Thanks, > -Stolee Thanks for reviewing! ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] diffcore-rename: filter rename_src list when possible 2021-02-03 5:49 ` [PATCH 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget [not found] ` <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com> @ 2021-02-03 19:12 ` Junio C Hamano 2021-02-03 19:19 ` Elijah Newren 1 sibling, 1 reply; 25+ messages in thread From: Junio C Hamano @ 2021-02-03 19:12 UTC (permalink / raw) To: Elijah Newren via GitGitGadget Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees, Elijah Newren "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > +static int remove_unneeded_paths_from_src(int num_src, > + int detecting_copies) > +{ > + int i, new_num_src; > + > + /* > + * Note on reasons why we cull unneeded sources but not destinations: > + * 1) Pairings are stored in rename_dst (not rename_src), which we > + * need to keep around. So, we just can't cull rename_dst even > + * if we wanted to. But doing so wouldn't help because... > + * > + * 2) There is a matrix pairwise comparison that follows the > + * "Performing inexact rename detection" progress message. > + * Iterating over the destinations is done in the outer loop, > + * hence we only iterate over each of those once and we can > + * easily skip the outer loop early if the destination isn't > + * relevant. That's only one check per destination path to > + * skip. > + * > + * By contrast, the sources are iterated in the inner loop; if > + * we check whether a source can be skipped, then we'll be > + * checking it N separate times, once for each destination. > + * We don't want to have to iterate over known-not-needed > + * sources N times each, so avoid that by removing the sources > + * from rename_src here. > + */ > + if (detecting_copies) > + return num_src; /* nothing to remove */ > + if (break_idx) > + return num_src; /* culling incompatbile with break detection */ > + > + for (i = 0, new_num_src = 0; i < num_src; i++) { > + /* > + * renames are stored in rename_dst, so if a rename has > + * already been detected using this source, we can just > + * remove the source knowing rename_dst has its info. > + */ > + if (rename_src[i].p->one->rename_used) > + continue; > + > + if (new_num_src < i) > + memcpy(&rename_src[new_num_src], &rename_src[i], > + sizeof(struct diff_rename_src)); > + new_num_src++; > + } > + > + return new_num_src; > +} Essentially we are compacting rename_src[] array from num_src elements down to new_num_src elements; we are losing pointers, but I presume these are all borrowed pointers that we do not own and we are not responsible for freeing? If we were to free them, the compaction would leave duplicates after the new tail (new_num_src) and we'd end up having to worry about double-freeing, so hopefully all we need to do is just to free the entire array of pointers, and not the pointees. Having to do this just once and being able to reduce the number of entries we need to iterate over does sound like a good simple optimization. > void diffcore_rename(struct diff_options *options) > { > int detect_rename = options->detect_rename; > @@ -463,10 +512,11 @@ void diffcore_rename(struct diff_options *options) > struct diff_score *mx; > int i, j, rename_count, skip_unmodified = 0; > int num_destinations, dst_cnt; > - int num_sources; > + int num_sources, want_copies; > struct progress *progress = NULL; > > trace2_region_enter("diff", "setup", options->repo); > + want_copies = (detect_rename == DIFF_DETECT_COPY); > if (!minimum_score) > minimum_score = DEFAULT_RENAME_SCORE; > > @@ -529,13 +579,10 @@ void diffcore_rename(struct diff_options *options) > goto cleanup; > > /* > - * Calculate how many renames are left (but all the source > - * files still remain as options for rename/copies!) > + * Calculate how many renames are left > */ > num_destinations = (rename_dst_nr - rename_count); > - num_sources = rename_src_nr; > - if (detect_rename != DIFF_DETECT_COPY) > - num_sources -= rename_count; > + num_sources = remove_unneeded_paths_from_src(rename_src_nr, want_copies); OK, this is in a sense an extended version of the previous step. I am not sure if rename_src_nr can be left out of sync with reality like this patch does, though. The reference to that variable in register_rename_src() and find_exact_renames() are OK as we are not going to call them after we futz with the rename_src[] array, but the reference in prefetch(), which does not actually happen early but only when we start running estimate_similarity(), which is after we compacted the rename_src[] array, would be affected, no? Thanks. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] diffcore-rename: filter rename_src list when possible 2021-02-03 19:12 ` Junio C Hamano @ 2021-02-03 19:19 ` Elijah Newren 0 siblings, 0 replies; 25+ messages in thread From: Elijah Newren @ 2021-02-03 19:19 UTC (permalink / raw) To: Junio C Hamano Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees On Wed, Feb 3, 2021 at 11:12 AM Junio C Hamano <gitster@pobox.com> wrote: > > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > +static int remove_unneeded_paths_from_src(int num_src, > > + int detecting_copies) > > +{ > > + int i, new_num_src; > > + > > + /* > > + * Note on reasons why we cull unneeded sources but not destinations: > > + * 1) Pairings are stored in rename_dst (not rename_src), which we > > + * need to keep around. So, we just can't cull rename_dst even > > + * if we wanted to. But doing so wouldn't help because... > > + * > > + * 2) There is a matrix pairwise comparison that follows the > > + * "Performing inexact rename detection" progress message. > > + * Iterating over the destinations is done in the outer loop, > > + * hence we only iterate over each of those once and we can > > + * easily skip the outer loop early if the destination isn't > > + * relevant. That's only one check per destination path to > > + * skip. > > + * > > + * By contrast, the sources are iterated in the inner loop; if > > + * we check whether a source can be skipped, then we'll be > > + * checking it N separate times, once for each destination. > > + * We don't want to have to iterate over known-not-needed > > + * sources N times each, so avoid that by removing the sources > > + * from rename_src here. > > + */ > > + if (detecting_copies) > > + return num_src; /* nothing to remove */ > > + if (break_idx) > > + return num_src; /* culling incompatbile with break detection */ > > + > > + for (i = 0, new_num_src = 0; i < num_src; i++) { > > + /* > > + * renames are stored in rename_dst, so if a rename has > > + * already been detected using this source, we can just > > + * remove the source knowing rename_dst has its info. > > + */ > > + if (rename_src[i].p->one->rename_used) > > + continue; > > + > > + if (new_num_src < i) > > + memcpy(&rename_src[new_num_src], &rename_src[i], > > + sizeof(struct diff_rename_src)); > > + new_num_src++; > > + } > > + > > + return new_num_src; > > +} > > Essentially we are compacting rename_src[] array from num_src > elements down to new_num_src elements; we are losing pointers, but I > presume these are all borrowed pointers that we do not own and we > are not responsible for freeing? If we were to free them, the > compaction would leave duplicates after the new tail (new_num_src) > and we'd end up having to worry about double-freeing, so hopefully > all we need to do is just to free the entire array of pointers, and > not the pointees. > > Having to do this just once and being able to reduce the number of > entries we need to iterate over does sound like a good simple > optimization. Correct, they are all just borrowed pointers and we only need to free the array, not the pointers within the array. > > void diffcore_rename(struct diff_options *options) > > { > > int detect_rename = options->detect_rename; > > @@ -463,10 +512,11 @@ void diffcore_rename(struct diff_options *options) > > struct diff_score *mx; > > int i, j, rename_count, skip_unmodified = 0; > > int num_destinations, dst_cnt; > > - int num_sources; > > + int num_sources, want_copies; > > struct progress *progress = NULL; > > > > trace2_region_enter("diff", "setup", options->repo); > > + want_copies = (detect_rename == DIFF_DETECT_COPY); > > if (!minimum_score) > > minimum_score = DEFAULT_RENAME_SCORE; > > > > @@ -529,13 +579,10 @@ void diffcore_rename(struct diff_options *options) > > goto cleanup; > > > > /* > > - * Calculate how many renames are left (but all the source > > - * files still remain as options for rename/copies!) > > + * Calculate how many renames are left > > */ > > num_destinations = (rename_dst_nr - rename_count); > > - num_sources = rename_src_nr; > > - if (detect_rename != DIFF_DETECT_COPY) > > - num_sources -= rename_count; > > + num_sources = remove_unneeded_paths_from_src(rename_src_nr, want_copies); > > OK, this is in a sense an extended version of the previous step. > > I am not sure if rename_src_nr can be left out of sync with reality > like this patch does, though. The reference to that variable in > register_rename_src() and find_exact_renames() are OK as we are not > going to call them after we futz with the rename_src[] array, but > the reference in prefetch(), which does not actually happen early > but only when we start running estimate_similarity(), which is after > we compacted the rename_src[] array, would be affected, no? Yes, good catch. This is the same issue Stolee caught. I did set rename_src_nr to the new num_sources in my "ort" branch, but when breaking up the changes to upstream them, that line of code somehow got separated into a later patch (that I haven't submitted yet) and I just didn't notice it when reviewing this early series. It belongs in this patch as both of you pointed out. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2 0/2] Optimization batch 6: make full use of exact renames 2021-02-03 5:49 [PATCH 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget 2021-02-03 5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget 2021-02-03 5:49 ` [PATCH 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget @ 2021-02-03 20:03 ` Elijah Newren via GitGitGadget 2021-02-03 20:03 ` [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget ` (3 more replies) 2 siblings, 4 replies; 25+ messages in thread From: Elijah Newren via GitGitGadget @ 2021-02-03 20:03 UTC (permalink / raw) To: git Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren, Elijah Newren This series depends on en/merge-ort-perf and makes full use of exact renames; see commit messages for details. Thanks to Stolee and Junio for reviewing v1. Changes since v1: * Update rename_src_nr when updating rename_src * Introduce want_copies in the first patch and use it in a few more places * Move a comment below a few exit-early if-checks. Elijah Newren (2): diffcore-rename: no point trying to find a match better than exact diffcore-rename: filter rename_src list when possible diffcore-rename.c | 69 +++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 61 insertions(+), 8 deletions(-) base-commit: 557ac0350d9efa1f59c708779ca3fb3aee121131 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-842%2Fnewren%2Fort-perf-batch-6-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-842/newren/ort-perf-batch-6-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/842 Range-diff vs v1: 1: 3e69857f37e ! 1: 770e894b4ab diffcore-rename: no point trying to find a match better than exact @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) struct diff_score *mx; int i, j, rename_count, skip_unmodified = 0; int num_destinations, dst_cnt; -+ int num_sources; ++ int num_sources, want_copies; struct progress *progress = NULL; trace2_region_enter("diff", "setup", options->repo); ++ want_copies = (detect_rename == DIFF_DETECT_COPY); + if (!minimum_score) + minimum_score = DEFAULT_RENAME_SCORE; + +@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) + p->one->rename_used++; + register_rename_src(p); + } +- else if (detect_rename == DIFF_DETECT_COPY) { ++ else if (want_copies) { + /* + * Increment the "rename_used" score by + * one, to indicate ourselves as a user. @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) * files still remain as options for rename/copies!) */ num_destinations = (rename_dst_nr - rename_count); + num_sources = rename_src_nr; -+ if (detect_rename != DIFF_DETECT_COPY) ++ if (!want_copies) + num_sources -= rename_count; /* All done? */ @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) struct diff_filespec *one = rename_src[j].p->one; struct diff_score this_src; -+ if (one->rename_used && -+ detect_rename != DIFF_DETECT_COPY) ++ if (one->rename_used && !want_copies) + continue; + if (skip_unmodified && @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) } stop_progress(&progress); +@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) + STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare); + + rename_count += find_renames(mx, dst_cnt, minimum_score, 0); +- if (detect_rename == DIFF_DETECT_COPY) ++ if (want_copies) + rename_count += find_renames(mx, dst_cnt, minimum_score, 1); + free(mx); + trace2_region_leave("diff", "inexact renames", options->repo); 2: 580ba9a10f5 ! 2: 7ae9460d3db diffcore-rename: filter rename_src list when possible @@ diffcore-rename.c: static int find_renames(struct diff_score *mx, int dst_cnt, i return count; } -+static int remove_unneeded_paths_from_src(int num_src, -+ int detecting_copies) ++static void remove_unneeded_paths_from_src(int detecting_copies) +{ + int i, new_num_src; + ++ if (detecting_copies) ++ return; /* nothing to remove */ ++ if (break_idx) ++ return; /* culling incompatbile with break detection */ ++ + /* + * Note on reasons why we cull unneeded sources but not destinations: + * 1) Pairings are stored in rename_dst (not rename_src), which we @@ diffcore-rename.c: static int find_renames(struct diff_score *mx, int dst_cnt, i + * sources N times each, so avoid that by removing the sources + * from rename_src here. + */ -+ if (detecting_copies) -+ return num_src; /* nothing to remove */ -+ if (break_idx) -+ return num_src; /* culling incompatbile with break detection */ -+ -+ for (i = 0, new_num_src = 0; i < num_src; i++) { ++ for (i = 0, new_num_src = 0; i < rename_src_nr; i++) { + /* + * renames are stored in rename_dst, so if a rename has + * already been detected using this source, we can just @@ diffcore-rename.c: static int find_renames(struct diff_score *mx, int dst_cnt, i + new_num_src++; + } + -+ return new_num_src; ++ rename_src_nr = new_num_src; +} + void diffcore_rename(struct diff_options *options) { int detect_rename = options->detect_rename; -@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) - struct diff_score *mx; - int i, j, rename_count, skip_unmodified = 0; - int num_destinations, dst_cnt; -- int num_sources; -+ int num_sources, want_copies; - struct progress *progress = NULL; - - trace2_region_enter("diff", "setup", options->repo); -+ want_copies = (detect_rename == DIFF_DETECT_COPY); - if (!minimum_score) - minimum_score = DEFAULT_RENAME_SCORE; - @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) goto cleanup; @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) + * Calculate how many renames are left */ num_destinations = (rename_dst_nr - rename_count); -- num_sources = rename_src_nr; -- if (detect_rename != DIFF_DETECT_COPY) ++ remove_unneeded_paths_from_src(want_copies); + num_sources = rename_src_nr; +- if (!want_copies) - num_sources -= rename_count; -+ num_sources = remove_unneeded_paths_from_src(rename_src_nr, want_copies); /* All done? */ if (!num_destinations || !num_sources) @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) - for (j = 0; j < NUM_CANDIDATE_PER_DST; j++) - m[j].dst = -1; - -- for (j = 0; j < rename_src_nr; j++) { -+ for (j = 0; j < num_sources; j++) { struct diff_filespec *one = rename_src[j].p->one; struct diff_score this_src; -- if (one->rename_used && -- detect_rename != DIFF_DETECT_COPY) +- if (one->rename_used && !want_copies) - continue; -+ assert(!one->rename_used || -+ detect_rename == DIFF_DETECT_COPY || -+ break_idx); ++ assert(!one->rename_used || want_copies || break_idx); if (skip_unmodified && diff_unmodified_pair(rename_src[j].p)) -- gitgitgadget ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact 2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget @ 2021-02-03 20:03 ` Elijah Newren via GitGitGadget 2021-02-03 20:03 ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget ` (2 subsequent siblings) 3 siblings, 0 replies; 25+ messages in thread From: Elijah Newren via GitGitGadget @ 2021-02-03 20:03 UTC (permalink / raw) To: git Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren, Elijah Newren, Elijah Newren From: Elijah Newren <newren@gmail.com> diffcore_rename() had some code to avoid having destination paths that already had an exact rename detected from being re-checked for other renames. Source paths, however, were re-checked because we wanted to allow the possibility of detecting copies. But if copy detection isn't turned on, then this merely amounts to attempting to find a better-than-exact match, which naturally ends up being an expensive no-op. In particular, copy detection is never turned on by the merge machinery. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 14.263 s ± 0.053 s 14.119 s ± 0.101 s mega-renames: 5504.231 s ± 5.150 s 1802.044 s ± 0.828 s just-one-mega: 158.534 s ± 0.498 s 51.391 s ± 0.028 s Signed-off-by: Elijah Newren <newren@gmail.com> --- diffcore-rename.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index 8fe6c9384bc..8b118628b4e 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -463,9 +463,11 @@ void diffcore_rename(struct diff_options *options) struct diff_score *mx; int i, j, rename_count, skip_unmodified = 0; int num_destinations, dst_cnt; + int num_sources, want_copies; struct progress *progress = NULL; trace2_region_enter("diff", "setup", options->repo); + want_copies = (detect_rename == DIFF_DETECT_COPY); if (!minimum_score) minimum_score = DEFAULT_RENAME_SCORE; @@ -502,7 +504,7 @@ void diffcore_rename(struct diff_options *options) p->one->rename_used++; register_rename_src(p); } - else if (detect_rename == DIFF_DETECT_COPY) { + else if (want_copies) { /* * Increment the "rename_used" score by * one, to indicate ourselves as a user. @@ -532,12 +534,15 @@ void diffcore_rename(struct diff_options *options) * files still remain as options for rename/copies!) */ num_destinations = (rename_dst_nr - rename_count); + num_sources = rename_src_nr; + if (!want_copies) + num_sources -= rename_count; /* All done? */ - if (!num_destinations) + if (!num_destinations || !num_sources) goto cleanup; - switch (too_many_rename_candidates(num_destinations, rename_src_nr, + switch (too_many_rename_candidates(num_destinations, num_sources, options)) { case 1: goto cleanup; @@ -553,7 +558,7 @@ void diffcore_rename(struct diff_options *options) if (options->show_rename_progress) { progress = start_delayed_progress( _("Performing inexact rename detection"), - (uint64_t)num_destinations * (uint64_t)rename_src_nr); + (uint64_t)num_destinations * (uint64_t)num_sources); } mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations), @@ -573,6 +578,9 @@ void diffcore_rename(struct diff_options *options) struct diff_filespec *one = rename_src[j].p->one; struct diff_score this_src; + if (one->rename_used && !want_copies) + continue; + if (skip_unmodified && diff_unmodified_pair(rename_src[j].p)) continue; @@ -594,7 +602,7 @@ void diffcore_rename(struct diff_options *options) } dst_cnt++; display_progress(progress, - (uint64_t)dst_cnt * (uint64_t)rename_src_nr); + (uint64_t)dst_cnt * (uint64_t)num_sources); } stop_progress(&progress); @@ -602,7 +610,7 @@ void diffcore_rename(struct diff_options *options) STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare); rename_count += find_renames(mx, dst_cnt, minimum_score, 0); - if (detect_rename == DIFF_DETECT_COPY) + if (want_copies) rename_count += find_renames(mx, dst_cnt, minimum_score, 1); free(mx); trace2_region_leave("diff", "inexact renames", options->repo); -- gitgitgadget ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible 2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget 2021-02-03 20:03 ` [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget @ 2021-02-03 20:03 ` Elijah Newren via GitGitGadget 2021-02-13 1:04 ` Junio C Hamano 2021-02-13 1:06 ` Junio C Hamano 2021-02-03 21:56 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Junio C Hamano 2021-02-14 7:34 ` [PATCH v3 " Elijah Newren via GitGitGadget 3 siblings, 2 replies; 25+ messages in thread From: Elijah Newren via GitGitGadget @ 2021-02-03 20:03 UTC (permalink / raw) To: git Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren, Elijah Newren, Elijah Newren From: Elijah Newren <newren@gmail.com> We have to look at each entry in rename_src a total of rename_dst_nr times. When we're not detecting copies, any exact renames or ignorable rename paths will just be skipped over. While checking that these can be skipped over is a relatively cheap check, it's still a waste of time to do that check more than once, let alone rename_dst_nr times. When rename_src_nr is a few thousand times bigger than the number of relevant sources (such as when cherry-picking a commit that only touched a handful of files, but from a side of history that has different names for some high level directories), this time can add up. First make an initial pass over the rename_src array and move all the relevant entries to the front, so that we can iterate over just those relevant entries. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 14.119 s ± 0.101 s 13.815 s ± 0.062 s mega-renames: 1802.044 s ± 0.828 s 1799.937 s ± 0.493 s just-one-mega: 51.391 s ± 0.028 s 51.289 s ± 0.019 s Signed-off-by: Elijah Newren <newren@gmail.com> --- diffcore-rename.c | 57 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 51 insertions(+), 6 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index 8b118628b4e..74930716e70 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -454,6 +454,54 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i return count; } +static void remove_unneeded_paths_from_src(int detecting_copies) +{ + int i, new_num_src; + + if (detecting_copies) + return; /* nothing to remove */ + if (break_idx) + return; /* culling incompatbile with break detection */ + + /* + * Note on reasons why we cull unneeded sources but not destinations: + * 1) Pairings are stored in rename_dst (not rename_src), which we + * need to keep around. So, we just can't cull rename_dst even + * if we wanted to. But doing so wouldn't help because... + * + * 2) There is a matrix pairwise comparison that follows the + * "Performing inexact rename detection" progress message. + * Iterating over the destinations is done in the outer loop, + * hence we only iterate over each of those once and we can + * easily skip the outer loop early if the destination isn't + * relevant. That's only one check per destination path to + * skip. + * + * By contrast, the sources are iterated in the inner loop; if + * we check whether a source can be skipped, then we'll be + * checking it N separate times, once for each destination. + * We don't want to have to iterate over known-not-needed + * sources N times each, so avoid that by removing the sources + * from rename_src here. + */ + for (i = 0, new_num_src = 0; i < rename_src_nr; i++) { + /* + * renames are stored in rename_dst, so if a rename has + * already been detected using this source, we can just + * remove the source knowing rename_dst has its info. + */ + if (rename_src[i].p->one->rename_used) + continue; + + if (new_num_src < i) + memcpy(&rename_src[new_num_src], &rename_src[i], + sizeof(struct diff_rename_src)); + new_num_src++; + } + + rename_src_nr = new_num_src; +} + void diffcore_rename(struct diff_options *options) { int detect_rename = options->detect_rename; @@ -530,13 +578,11 @@ void diffcore_rename(struct diff_options *options) goto cleanup; /* - * Calculate how many renames are left (but all the source - * files still remain as options for rename/copies!) + * Calculate how many renames are left */ num_destinations = (rename_dst_nr - rename_count); + remove_unneeded_paths_from_src(want_copies); num_sources = rename_src_nr; - if (!want_copies) - num_sources -= rename_count; /* All done? */ if (!num_destinations || !num_sources) @@ -578,8 +624,7 @@ void diffcore_rename(struct diff_options *options) struct diff_filespec *one = rename_src[j].p->one; struct diff_score this_src; - if (one->rename_used && !want_copies) - continue; + assert(!one->rename_used || want_copies || break_idx); if (skip_unmodified && diff_unmodified_pair(rename_src[j].p)) -- gitgitgadget ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible 2021-02-03 20:03 ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget @ 2021-02-13 1:04 ` Junio C Hamano 2021-02-13 4:24 ` Elijah Newren 2021-02-13 1:06 ` Junio C Hamano 1 sibling, 1 reply; 25+ messages in thread From: Junio C Hamano @ 2021-02-13 1:04 UTC (permalink / raw) To: Elijah Newren via GitGitGadget Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > + return; /* culling incompatbile with break detection */ incompatible. > /* > - * Calculate how many renames are left (but all the source > - * files still remain as options for rename/copies!) > + * Calculate how many renames are left > */ This no longer needs to be a multi-line comment, does it? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible 2021-02-13 1:04 ` Junio C Hamano @ 2021-02-13 4:24 ` Elijah Newren 0 siblings, 0 replies; 25+ messages in thread From: Elijah Newren @ 2021-02-13 4:24 UTC (permalink / raw) To: Junio C Hamano Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees, Derrick Stolee On Fri, Feb 12, 2021 at 5:04 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > + return; /* culling incompatbile with break detection */ > > incompatible. Thanks. > > /* > > - * Calculate how many renames are left (but all the source > > - * files still remain as options for rename/copies!) > > + * Calculate how many renames are left > > */ > > This no longer needs to be a multi-line comment, does it? Indeed. Will fix both. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible 2021-02-03 20:03 ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget 2021-02-13 1:04 ` Junio C Hamano @ 2021-02-13 1:06 ` Junio C Hamano 2021-02-13 4:43 ` Elijah Newren 1 sibling, 1 reply; 25+ messages in thread From: Junio C Hamano @ 2021-02-13 1:06 UTC (permalink / raw) To: Elijah Newren via GitGitGadget Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > @@ -578,8 +624,7 @@ void diffcore_rename(struct diff_options *options) > struct diff_filespec *one = rename_src[j].p->one; > struct diff_score this_src; > > - if (one->rename_used && !want_copies) > - continue; > + assert(!one->rename_used || want_copies || break_idx); Doesn't assert becomes a no-op in production builds? Shouldn't this be a BUG() instead? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible 2021-02-13 1:06 ` Junio C Hamano @ 2021-02-13 4:43 ` Elijah Newren 0 siblings, 0 replies; 25+ messages in thread From: Elijah Newren @ 2021-02-13 4:43 UTC (permalink / raw) To: Junio C Hamano Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees, Derrick Stolee On Fri, Feb 12, 2021 at 5:06 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > @@ -578,8 +624,7 @@ void diffcore_rename(struct diff_options *options) > > struct diff_filespec *one = rename_src[j].p->one; > > struct diff_score this_src; > > > > - if (one->rename_used && !want_copies) > > - continue; > > + assert(!one->rename_used || want_copies || break_idx); > > Doesn't assert becomes a no-op in production builds? Shouldn't this > be a BUG() instead? I guess it depends on the reason for the line. If we're just trying to help others understand the code by documenting conditions that are true (and perhaps help them when they are refactoring), then comments like /* The following is true at this point: !one->rename_used || want_copies || break_idx */ could also be used, but it's kind of verbose. The form assert(!one->rename_used || want_copies || break_idx); is shorter, clearer, and is likely to be up-to-date because even if most builds and users turn off assertions, some folks will build and run with assertions on. If it's a refactoring-aid, then the latter form is also more likely to help the future developer (who may be future me). If, however, the purpose is to check for bad input or worries about programming logic possibly have edge cases, and we're afraid that such conditions might cause severe and hard to debug problems later in the code, then we absolutely would rather use a BUG(). Most of my uses of assert() fall in the former category. I use BUG() when it's a line meant for the latter category. There are a few that perhaps fall in between, where I just have to make a judgement call. I'd like to say that I'm more likely to use BUG() for those, but the lack of a BUG_ON() does tend to make it pretty annoying to use and certainly discourages it. Granted, that's the way I look at it. I'm curious if others have a different view. It certainly seems like the project likes to use both forms nearly equally: $ git grep assert\( origin/master | wc -l 468 $ git grep BUG\( origin/master | wc -l 506 so I'm curious if there are other factors that make folks pick one or the other. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames 2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget 2021-02-03 20:03 ` [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget 2021-02-03 20:03 ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget @ 2021-02-03 21:56 ` Junio C Hamano 2021-02-03 23:06 ` Elijah Newren 2021-02-14 7:34 ` [PATCH v3 " Elijah Newren via GitGitGadget 3 siblings, 1 reply; 25+ messages in thread From: Junio C Hamano @ 2021-02-03 21:56 UTC (permalink / raw) To: Elijah Newren via GitGitGadget Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > This series depends on en/merge-ort-perf and makes full use of exact > renames; see commit messages for details. > > Thanks to Stolee and Junio for reviewing v1. > > Changes since v1: > > * Update rename_src_nr when updating rename_src > * Introduce want_copies in the first patch and use it in a few more places > * Move a comment below a few exit-early if-checks. > > Elijah Newren (2): > diffcore-rename: no point trying to find a match better than exact > diffcore-rename: filter rename_src list when possible > > diffcore-rename.c | 69 +++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 61 insertions(+), 8 deletions(-) Thanks, these look bettrer. With these changes, I guess there are only two things I find myself somewhat embarrassing in the rename machinery that is still there since I invented it. - We still need to go full matrix while finding the "best" pairing. I cannot think of a way to avoid it (that is what makes it embarrassing) but wish there were some way to. In an early attempt, I tried to retire rename_src[j], once rename_dst[i] has been found to be a "good enough" match for it, from the pool of rename src candidates to find a good match for rename_dst[k] for i < k, but naive implementation of it would not work well for obvious reasons---rename_src[j] may match a lot better with rename_dst[k] than rename_dst[i] but we do not know that until we try to estimate similarity with rename_dst[k]. - The .cnt_data member was designed to be a concise summary of the blob characteristics so that two .cnt_data can be "compared" fairly cheaply to see how "similar" two blobs are [*], but (1) it is rather big to be called a "concise summary", and (2) it was not chosen after real performance measurement, and we've been using it for the past 15 years without revisiting its design. Side note: In a very early prototype, the approach to assess similarity between two blobs was very different---there was no attempt to compute "concise summary" for each blob, but we just attempted to create delta (as in the pack data) between src and dst blobs and measured how small a delta we can use to transform from src to dst. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames 2021-02-03 21:56 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Junio C Hamano @ 2021-02-03 23:06 ` Elijah Newren 2021-02-03 23:26 ` Junio C Hamano 2021-02-03 23:36 ` Jeff King 0 siblings, 2 replies; 25+ messages in thread From: Elijah Newren @ 2021-02-03 23:06 UTC (permalink / raw) To: Junio C Hamano Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees, Derrick Stolee On Wed, Feb 3, 2021 at 1:56 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > This series depends on en/merge-ort-perf and makes full use of exact > > renames; see commit messages for details. > > > > Thanks to Stolee and Junio for reviewing v1. > > > > Changes since v1: > > > > * Update rename_src_nr when updating rename_src > > * Introduce want_copies in the first patch and use it in a few more places > > * Move a comment below a few exit-early if-checks. > > > > Elijah Newren (2): > > diffcore-rename: no point trying to find a match better than exact > > diffcore-rename: filter rename_src list when possible > > > > diffcore-rename.c | 69 +++++++++++++++++++++++++++++++++++++++++------ > > 1 file changed, 61 insertions(+), 8 deletions(-) > > Thanks, these look bettrer. > > With these changes, I guess there are only two things I find myself > somewhat embarrassing in the rename machinery that is still there > since I invented it. > > - We still need to go full matrix while finding the "best" > pairing. I cannot think of a way to avoid it (that is what makes > it embarrassing) but wish there were some way to. It turns out that exact renames aren't the only thing that can reduce the size of the matrix. My next few series will add some more. The matrix does remain, even at the end of all my performance work, but multiple ways to shrink it certainly help a lot. > In an early attempt, I tried to retire rename_src[j], once > rename_dst[i] has been found to be a "good enough" match for it, > from the pool of rename src candidates to find a good match for > rename_dst[k] for i < k, but naive implementation of it would not > work well for obvious reasons---rename_src[j] may match a lot > better with rename_dst[k] than rename_dst[i] but we do not know > that until we try to estimate similarity with rename_dst[k]. You may really like the next two series I submit. I have a smarter way to find a "good enough" match (comparing to exactly one other file and often finding sufficient similarity), and one that'll make intuitive sense to users. Then I have other series after those that optimize in different ways. > - The .cnt_data member was designed to be a concise summary of the > blob characteristics so that two .cnt_data can be "compared" > fairly cheaply to see how "similar" two blobs are [*], but (1) it > is rather big to be called a "concise summary", and (2) it was > not chosen after real performance measurement, and we've been > using it for the past 15 years without revisiting its design. .cnt_data seemed kind of clever to me, though it did have a few things that seemed surprising: 1) Small "lines". The similarity works by mapping each "line" to an integer using some simple hash, and keeps track of the list of hashed integers for each file. Once you have a list of hashed integers for each file, similarity is determined by looking through the two lists for two files and seeing what percentage of integers is found in both (but see item #3 for how percentage is defined). One special consideration here was "lines". I think because of a desire to handle binary files, there was a decision to split at 64-bytes or LF, whichever comes first, so each real line of code might be broken into multiple "lines". The 64-bytes thing might make things a little weird, though. If you have a lot of lines that are just over 64-characters long, they'll be split into two "lines", with the second ones being very short. Really short lines are much more likely to look like other really short lines, which might pad your similarity count. I'm not sure what to do differently, though. 2) Size-agnostic hashing. The integer that each "line" is mapped to is strictly between 0 and HASHBASE-1, where HASHBASE is #define'd to 107927. By the pigeon-hole principle, since this hash size is fixed, any two sufficiently large files will be marked by this algorithm as similar. I created some testcases and verified that this was indeed true. I created two files at random using disjoint "alphabets" (e.g. one file only contained letters, the other file only contained digits), and found that at a certain size the algorithm would always return >50% similarity. The size was somewhere between 1MB and 8MB; I can't find the table where I recorded that anymore. So, basically, sufficiently large files that are sufficiently close in size (since the algorithm checks for size similarity as an optimization) will always be marked as a rename or copy. I think we haven't gotten reports of this "bug" because files that large are for all intents and purposes binaries that people likely won't modify. And if they do modify the files, then they'll conflict, but people won't look at the files to resolve the conflict; instead, they'll generate a new "good" binary externally and then stick it into place. I investigated this just a little bit, and even considered turning off rename detection for sufficiently large files, but sizes aren't available until the tight inner loop and adding any extra checking there actually costs enough to offset most any benefit. There might have been a way to avoid that, but ultimately I just dropped the issue and did nothing. 3) It uses a similarity measure that diverges from what researches used for MinHash and other fancy algorithms. In particular, size(A intersect B) / size(A union B) != size(A intersect B) / max(size(A), size(B)) The formula on the right hand side would mean that if file A is a subset of file B, say the first 10% of file B, then it will be treated as 100% similar when most humans would look at it and say it is only 10% similar. The MinHash definition for similarity measure seems like it makes more sense...though if people have been passing numbers like "60%" to -M, then this change of definition will force them to recalibrate what numbers they pass. I think it'd be really nice to use the MinHash definition and that it might even fix bugs to do so, but I was worried folks might perceive it as a backward incompatible change if they need to recalibrate their numberings. So, I ultimately didn't do anything here either...yet. But this is the one that I'd most like to do something about. Are others concerned about the possible need to recalibrate, or is that okay? Maybe the performance gains I'm adding elsewhere will offset possible grumpy users? > Side note: In a very early prototype, the approach to assess > similarity between two blobs was very different---there was no > attempt to compute "concise summary" for each blob, but we just > attempted to create delta (as in the pack data) between src and > dst blobs and measured how small a delta we can use to transform > from src to dst. Interesting; I didn't know that. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames 2021-02-03 23:06 ` Elijah Newren @ 2021-02-03 23:26 ` Junio C Hamano 2021-02-03 23:36 ` Jeff King 1 sibling, 0 replies; 25+ messages in thread From: Junio C Hamano @ 2021-02-03 23:26 UTC (permalink / raw) To: Elijah Newren Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees, Derrick Stolee Elijah Newren <newren@gmail.com> writes: > 3) It uses a similarity measure that diverges from what researches > used for MinHash and other fancy algorithms. In particular, > > size(A intersect B) / size(A union B) != size(A intersect B) / > max(size(A), size(B)) > > The formula on the right hand side would mean that if file A is a > subset of file B, say the first 10% of file B, then it will be treated > as 100% similar when most humans would look at it and say it is only > 10% similar. If you are talking about "you start from 100 lines file and appended 900 lines of your own, then you still have 100% of the original material remaining in the file", it is quite deliberate that we used it as an indication that the original "100 lines" file is a good candidate to have been renamed to the resulting "1000 lines" file. It is "what you have kept from the original" measure. Of course, taken to the extreme, this means that rename does not have to be symmetrical. "diff A B" may find that the original 100-line file in A has grown into 1000-line file in B elsewhere, but "diff B A" or "diff -R A B" would not necessarily pair these two blobs as matching. > Maybe the performance gains I'm adding elsewhere will offset possible > grumpy users? Users, as they are, it would never happen. When they have something to complain about, they will, regardless of what else you do. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames 2021-02-03 23:06 ` Elijah Newren 2021-02-03 23:26 ` Junio C Hamano @ 2021-02-03 23:36 ` Jeff King 2021-02-04 0:05 ` Elijah Newren 1 sibling, 1 reply; 25+ messages in thread From: Jeff King @ 2021-02-03 23:36 UTC (permalink / raw) To: Elijah Newren Cc: Junio C Hamano, Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau, Karsten Blees, Derrick Stolee On Wed, Feb 03, 2021 at 03:06:26PM -0800, Elijah Newren wrote: > > In an early attempt, I tried to retire rename_src[j], once > > rename_dst[i] has been found to be a "good enough" match for it, > > from the pool of rename src candidates to find a good match for > > rename_dst[k] for i < k, but naive implementation of it would not > > work well for obvious reasons---rename_src[j] may match a lot > > better with rename_dst[k] than rename_dst[i] but we do not know > > that until we try to estimate similarity with rename_dst[k]. > > You may really like the next two series I submit. I have a smarter > way to find a "good enough" match (comparing to exactly one other file > and often finding sufficient similarity), and one that'll make > intuitive sense to users. Here's a really old thread with an approach that may or may not be similar to what you're thinking of: https://lore.kernel.org/git/596909b30710220240g665054d8hc40bc5d2234ba9e1@mail.gmail.com/ Though maybe start with this summary message: https://lore.kernel.org/git/596909b30710220309l1a28e646r9fd47f967dc32574@mail.gmail.com/ I remember experimenting some with it at the time, but never making things faster. It's entirely possible (likely, even) that I was simply not doing it well enough. It's also been long enough since I looked at the rename code that I'm not sure how different it is in practice. It still has a quadratic matrix in the end. We basically do a similar strategy of rolling-hash-fingerprint-and-see-where-things-collide, but I think we may end up with more work during the quadratic part (again, it's been a while, so I may just be totally off-base). I've also wondered if something similar might be helpful for delta compression (which is now done with an O(m*n) sliding window). -Peff ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames 2021-02-03 23:36 ` Jeff King @ 2021-02-04 0:05 ` Elijah Newren 0 siblings, 0 replies; 25+ messages in thread From: Elijah Newren @ 2021-02-04 0:05 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau, Karsten Blees, Derrick Stolee On Wed, Feb 3, 2021 at 3:37 PM Jeff King <peff@peff.net> wrote: > > On Wed, Feb 03, 2021 at 03:06:26PM -0800, Elijah Newren wrote: > > > > In an early attempt, I tried to retire rename_src[j], once > > > rename_dst[i] has been found to be a "good enough" match for it, > > > from the pool of rename src candidates to find a good match for > > > rename_dst[k] for i < k, but naive implementation of it would not > > > work well for obvious reasons---rename_src[j] may match a lot > > > better with rename_dst[k] than rename_dst[i] but we do not know > > > that until we try to estimate similarity with rename_dst[k]. > > > > You may really like the next two series I submit. I have a smarter > > way to find a "good enough" match (comparing to exactly one other file > > and often finding sufficient similarity), and one that'll make > > intuitive sense to users. > > Here's a really old thread with an approach that may or may not be > similar to what you're thinking of: > > https://lore.kernel.org/git/596909b30710220240g665054d8hc40bc5d2234ba9e1@mail.gmail.com/ > > Though maybe start with this summary message: > > https://lore.kernel.org/git/596909b30710220309l1a28e646r9fd47f967dc32574@mail.gmail.com/ Interesting thread; thanks for the link. It's not remotely similar to what I have done, but a brief glance through it reminds me of the ideas at https://github.com/gitgitgadget/git/issues/519. > I remember experimenting some with it at the time, but never making > things faster. It's entirely possible (likely, even) that I was simply > not doing it well enough. > > It's also been long enough since I looked at the rename code that I'm > not sure how different it is in practice. It still has a quadratic > matrix in the end. We basically do a similar strategy of > rolling-hash-fingerprint-and-see-where-things-collide, but I think we > may end up with more work during the quadratic part (again, it's been > a while, so I may just be totally off-base). I'm not sure if I should spoil the surprise for the patchsets I haven't yet submitted but... when I get done, for the testcases I have looked at, rename detection is no longer the slowest part of merging/rebasing/cherry-picking -- not even when there are tens of thousands of renames on one side of history. And I didn't achieve that by making other parts of the code slower. If someone can come up with a real-world testcase where rename detection is still really slow in a merge/rebase/cherry-pick with my implemented optimizations, I've got at least one more optimization that I hadn't bothered implementing because everything was fast enough for me already. And the idea you linked above and the ideas in the gitgitgadget issue linked above all have more possibilities that are complementary to what I have done that might help further. > I've also wondered if something similar might be helpful for delta > compression (which is now done with an O(m*n) sliding window). ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v3 0/2] Optimization batch 6: make full use of exact renames 2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget ` (2 preceding siblings ...) 2021-02-03 21:56 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Junio C Hamano @ 2021-02-14 7:34 ` Elijah Newren via GitGitGadget 2021-02-14 7:35 ` [PATCH v3 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget 2021-02-14 7:35 ` [PATCH v3 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget 3 siblings, 2 replies; 25+ messages in thread From: Elijah Newren via GitGitGadget @ 2021-02-14 7:34 UTC (permalink / raw) To: git Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren, Elijah Newren This series makes full use of exact renames; removing not only a destination pair, but a source pair as well when an exact rename is found and copy detection is not turned on. Changes since v2: * Fix a comment typo, and fix a multi-line comment that didn't need to be a multi-line comment Elijah Newren (2): diffcore-rename: no point trying to find a match better than exact diffcore-rename: filter rename_src list when possible diffcore-rename.c | 71 ++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 61 insertions(+), 10 deletions(-) base-commit: f0117958910fbc734457a83a9f8ecc3c62463417 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-842%2Fnewren%2Fort-perf-batch-6-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-842/newren/ort-perf-batch-6-v3 Pull-Request: https://github.com/gitgitgadget/git/pull/842 Range-diff vs v2: 1: 770e894b4abd = 1: a59c1960f614 diffcore-rename: no point trying to find a match better than exact 2: 7ae9460d3dba ! 2: dd6595b45640 diffcore-rename: filter rename_src list when possible @@ diffcore-rename.c: static int find_renames(struct diff_score *mx, int dst_cnt, i + if (detecting_copies) + return; /* nothing to remove */ + if (break_idx) -+ return; /* culling incompatbile with break detection */ ++ return; /* culling incompatible with break detection */ + + /* + * Note on reasons why we cull unneeded sources but not destinations: @@ diffcore-rename.c: static int find_renames(struct diff_score *mx, int dst_cnt, i { int detect_rename = options->detect_rename; @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options) + if (minimum_score == MAX_SCORE) goto cleanup; - /* +- /* - * Calculate how many renames are left (but all the source - * files still remain as options for rename/copies!) -+ * Calculate how many renames are left - */ +- */ ++ /* Calculate how many renames are left */ num_destinations = (rename_dst_nr - rename_count); + remove_unneeded_paths_from_src(want_copies); num_sources = rename_src_nr; -- gitgitgadget ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v3 1/2] diffcore-rename: no point trying to find a match better than exact 2021-02-14 7:34 ` [PATCH v3 " Elijah Newren via GitGitGadget @ 2021-02-14 7:35 ` Elijah Newren via GitGitGadget 2021-02-14 7:35 ` [PATCH v3 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget 1 sibling, 0 replies; 25+ messages in thread From: Elijah Newren via GitGitGadget @ 2021-02-14 7:35 UTC (permalink / raw) To: git Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren, Elijah Newren, Elijah Newren From: Elijah Newren <newren@gmail.com> diffcore_rename() had some code to avoid having destination paths that already had an exact rename detected from being re-checked for other renames. Source paths, however, were re-checked because we wanted to allow the possibility of detecting copies. But if copy detection isn't turned on, then this merely amounts to attempting to find a better-than-exact match, which naturally ends up being an expensive no-op. In particular, copy detection is never turned on by the merge machinery. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 14.263 s ± 0.053 s 14.119 s ± 0.101 s mega-renames: 5504.231 s ± 5.150 s 1802.044 s ± 0.828 s just-one-mega: 158.534 s ± 0.498 s 51.391 s ± 0.028 s Signed-off-by: Elijah Newren <newren@gmail.com> --- diffcore-rename.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index 8fe6c9384bcb..8b118628b4ef 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -463,9 +463,11 @@ void diffcore_rename(struct diff_options *options) struct diff_score *mx; int i, j, rename_count, skip_unmodified = 0; int num_destinations, dst_cnt; + int num_sources, want_copies; struct progress *progress = NULL; trace2_region_enter("diff", "setup", options->repo); + want_copies = (detect_rename == DIFF_DETECT_COPY); if (!minimum_score) minimum_score = DEFAULT_RENAME_SCORE; @@ -502,7 +504,7 @@ void diffcore_rename(struct diff_options *options) p->one->rename_used++; register_rename_src(p); } - else if (detect_rename == DIFF_DETECT_COPY) { + else if (want_copies) { /* * Increment the "rename_used" score by * one, to indicate ourselves as a user. @@ -532,12 +534,15 @@ void diffcore_rename(struct diff_options *options) * files still remain as options for rename/copies!) */ num_destinations = (rename_dst_nr - rename_count); + num_sources = rename_src_nr; + if (!want_copies) + num_sources -= rename_count; /* All done? */ - if (!num_destinations) + if (!num_destinations || !num_sources) goto cleanup; - switch (too_many_rename_candidates(num_destinations, rename_src_nr, + switch (too_many_rename_candidates(num_destinations, num_sources, options)) { case 1: goto cleanup; @@ -553,7 +558,7 @@ void diffcore_rename(struct diff_options *options) if (options->show_rename_progress) { progress = start_delayed_progress( _("Performing inexact rename detection"), - (uint64_t)num_destinations * (uint64_t)rename_src_nr); + (uint64_t)num_destinations * (uint64_t)num_sources); } mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations), @@ -573,6 +578,9 @@ void diffcore_rename(struct diff_options *options) struct diff_filespec *one = rename_src[j].p->one; struct diff_score this_src; + if (one->rename_used && !want_copies) + continue; + if (skip_unmodified && diff_unmodified_pair(rename_src[j].p)) continue; @@ -594,7 +602,7 @@ void diffcore_rename(struct diff_options *options) } dst_cnt++; display_progress(progress, - (uint64_t)dst_cnt * (uint64_t)rename_src_nr); + (uint64_t)dst_cnt * (uint64_t)num_sources); } stop_progress(&progress); @@ -602,7 +610,7 @@ void diffcore_rename(struct diff_options *options) STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare); rename_count += find_renames(mx, dst_cnt, minimum_score, 0); - if (detect_rename == DIFF_DETECT_COPY) + if (want_copies) rename_count += find_renames(mx, dst_cnt, minimum_score, 1); free(mx); trace2_region_leave("diff", "inexact renames", options->repo); -- gitgitgadget ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v3 2/2] diffcore-rename: filter rename_src list when possible 2021-02-14 7:34 ` [PATCH v3 " Elijah Newren via GitGitGadget 2021-02-14 7:35 ` [PATCH v3 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget @ 2021-02-14 7:35 ` Elijah Newren via GitGitGadget 1 sibling, 0 replies; 25+ messages in thread From: Elijah Newren via GitGitGadget @ 2021-02-14 7:35 UTC (permalink / raw) To: git Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren, Elijah Newren, Elijah Newren From: Elijah Newren <newren@gmail.com> We have to look at each entry in rename_src a total of rename_dst_nr times. When we're not detecting copies, any exact renames or ignorable rename paths will just be skipped over. While checking that these can be skipped over is a relatively cheap check, it's still a waste of time to do that check more than once, let alone rename_dst_nr times. When rename_src_nr is a few thousand times bigger than the number of relevant sources (such as when cherry-picking a commit that only touched a handful of files, but from a side of history that has different names for some high level directories), this time can add up. First make an initial pass over the rename_src array and move all the relevant entries to the front, so that we can iterate over just those relevant entries. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 14.119 s ± 0.101 s 13.815 s ± 0.062 s mega-renames: 1802.044 s ± 0.828 s 1799.937 s ± 0.493 s just-one-mega: 51.391 s ± 0.028 s 51.289 s ± 0.019 s Signed-off-by: Elijah Newren <newren@gmail.com> --- diffcore-rename.c | 59 ++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 51 insertions(+), 8 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index 8b118628b4ef..6fd0c4a2f485 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -454,6 +454,54 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i return count; } +static void remove_unneeded_paths_from_src(int detecting_copies) +{ + int i, new_num_src; + + if (detecting_copies) + return; /* nothing to remove */ + if (break_idx) + return; /* culling incompatible with break detection */ + + /* + * Note on reasons why we cull unneeded sources but not destinations: + * 1) Pairings are stored in rename_dst (not rename_src), which we + * need to keep around. So, we just can't cull rename_dst even + * if we wanted to. But doing so wouldn't help because... + * + * 2) There is a matrix pairwise comparison that follows the + * "Performing inexact rename detection" progress message. + * Iterating over the destinations is done in the outer loop, + * hence we only iterate over each of those once and we can + * easily skip the outer loop early if the destination isn't + * relevant. That's only one check per destination path to + * skip. + * + * By contrast, the sources are iterated in the inner loop; if + * we check whether a source can be skipped, then we'll be + * checking it N separate times, once for each destination. + * We don't want to have to iterate over known-not-needed + * sources N times each, so avoid that by removing the sources + * from rename_src here. + */ + for (i = 0, new_num_src = 0; i < rename_src_nr; i++) { + /* + * renames are stored in rename_dst, so if a rename has + * already been detected using this source, we can just + * remove the source knowing rename_dst has its info. + */ + if (rename_src[i].p->one->rename_used) + continue; + + if (new_num_src < i) + memcpy(&rename_src[new_num_src], &rename_src[i], + sizeof(struct diff_rename_src)); + new_num_src++; + } + + rename_src_nr = new_num_src; +} + void diffcore_rename(struct diff_options *options) { int detect_rename = options->detect_rename; @@ -529,14 +577,10 @@ void diffcore_rename(struct diff_options *options) if (minimum_score == MAX_SCORE) goto cleanup; - /* - * Calculate how many renames are left (but all the source - * files still remain as options for rename/copies!) - */ + /* Calculate how many renames are left */ num_destinations = (rename_dst_nr - rename_count); + remove_unneeded_paths_from_src(want_copies); num_sources = rename_src_nr; - if (!want_copies) - num_sources -= rename_count; /* All done? */ if (!num_destinations || !num_sources) @@ -578,8 +622,7 @@ void diffcore_rename(struct diff_options *options) struct diff_filespec *one = rename_src[j].p->one; struct diff_score this_src; - if (one->rename_used && !want_copies) - continue; + assert(!one->rename_used || want_copies || break_idx); if (skip_unmodified && diff_unmodified_pair(rename_src[j].p)) -- gitgitgadget ^ permalink raw reply related [flat|nested] 25+ messages in thread
end of thread, other threads:[~2021-02-14 7:36 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-02-03 5:49 [PATCH 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget 2021-02-03 5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget 2021-02-03 11:44 ` Derrick Stolee 2021-02-03 16:31 ` Elijah Newren 2021-02-03 18:46 ` Junio C Hamano 2021-02-03 19:10 ` Elijah Newren 2021-02-03 5:49 ` [PATCH 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget [not found] ` <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com> 2021-02-03 17:12 ` Elijah Newren 2021-02-03 19:12 ` Junio C Hamano 2021-02-03 19:19 ` Elijah Newren 2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget 2021-02-03 20:03 ` [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget 2021-02-03 20:03 ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget 2021-02-13 1:04 ` Junio C Hamano 2021-02-13 4:24 ` Elijah Newren 2021-02-13 1:06 ` Junio C Hamano 2021-02-13 4:43 ` Elijah Newren 2021-02-03 21:56 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Junio C Hamano 2021-02-03 23:06 ` Elijah Newren 2021-02-03 23:26 ` Junio C Hamano 2021-02-03 23:36 ` Jeff King 2021-02-04 0:05 ` Elijah Newren 2021-02-14 7:34 ` [PATCH v3 " Elijah Newren via GitGitGadget 2021-02-14 7:35 ` [PATCH v3 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget 2021-02-14 7:35 ` [PATCH v3 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.