All of lore.kernel.org
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>,
	"Git Mailing List" <git@vger.kernel.org>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost
Date: Thu, 15 Jul 2021 09:53:49 -0700	[thread overview]
Message-ID: <CABPp-BF+gR8WtpWt_DVDoWe16R4B65h-59zGOZ5j4vUJKp_Nuw@mail.gmail.com> (raw)
In-Reply-To: <d91ed8a0-b37b-7dfa-10bf-e068f30e9691@gmail.com>

On Thu, Jul 15, 2021 at 8:10 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> ...
> > Often, I noticed that when the optimization does not apply, it is
> > because there are a handful of relevant sources -- maybe even only one.
> > It felt frustrating to need to recurse into potentially hundreds or even
> > thousands of directories just for a single rename, but it was needed for
> > correctness.
> >
> > However, staring at this list of functions and noticing that
> > process_entries() is the most expensive and knowing I could avoid it if
> > I had cached renames suggested a simple idea: change
> >    collect_merge_info()
> >    detect_and_process_renames()
> >    process_entries()
> > into
> >    collect_merge_info()
> >    detect_and_process_renames()
> >    <cache all the renames, and restart>
> >    collect_merge_info()
> >    detect_and_process_renames()
> >    process_entries()
> >
> > This may seem odd and look like more work.  However, note that although
> > we run collect_merge_info() twice, the second time we get to employ
> > trivial directory resolves, which makes it much faster, so the increased
> > time in collect_merge_info() is small.  While we run
> > detect_and_process_renames() again, all renames are cached so it's
> > nearly a no-op (we don't call into diffcore_rename_extended() but we do
> > have a little bit of data structure checking and fixing up).  And the
> > big payoff comes from the fact that process_entries(), will be much
> > faster due to having far fewer entries to process.
>
> I enjoyed the story you tell here.

:-)

> > +     if (path_count_after) {
> > +             /*
> > +              * Not sure were the right cut-off is for the optimization
> > +              * to redo collect_merge_info after we've cached the
> > +              * regular renames is.  Basically, collect_merge_info(),
> > +              * detect_regular_renames(), and process_entries() are
> > +              * similar costs and all big tent poles.  Caching the
> > +              * result of detect_regular_renames() means that redoing
> > +              * that one function will cost us virtually 0 extra, so it
> > +              * depends on the other two functions, which are both O(N)
> > +              * cost in the number of paths.  Thus, it makes sense that
> > +              * if we can cut the number of paths in half, then redoing
> > +              * collect_merge_info() at half cost in order to get
> > +              * process_entries() at half cost should be about equal
> > +              * cost.  If we can cut by more than half, then we would
> > +              * win.  The fact that process_entries() is about 10%-20%
> > +              * more expensive than collect_merge_info() suggests we
> > +              * could make the factor be less than two.  The fact that
> > +              * even when we have renames cached, we still have to
> > +              * traverse down to the individual (relevant) renames,
> > +              * which suggests we should perhaps use a bigger factor.
> > +              *
> > +              * The exact number isn't critical, since the code will
> > +              * work even if we get the factor wrong -- it just might be
> > +              * slightly slower if we're a bit off.  For now, just error
> > +              * on the side of a bigger fudge.  For the linux kernel
>
> super-nit: s/linux/Linux/

Yeah, I tend to refer to projects by the name of their repository
instead of their proper name.  (I do it with git too.)  Bad habit.
Will fix.  That is, I will fix this instance.  Not sure I can fix the
habit.

> > +              * testcases I was looking at with massive renames, the
> > +              * ratio came in around 50 to 250, which clearly would
> > +              * trigger this optimization and provided some *very* nice
> > +              * speedups.
>
> This bit of your testing might be more appropriate for your commit
> message. This discussion of a test made at a certain point in time
> is more likely to go stale than the description of how this factor
> does not change correctness, only performance.

The commit message does include discussion on how this factor only
changes performance, not correctness.  I left this comment in the code
because I figured it looked weird and magic and deserved an
explanation without resorting to git-log or git-blame.  Granted, I
wrote this comment and the commit message at much different times (I
wrote the comment first, then the commit message many months later)
and thus summarized a bit differently.  But I thought I had the same
relevant content in both places.

Are there pieces you feel are missing from the commit message?  Should
I trim this comment down in the code and just let people look for the
commit message for more details?

> > +              */
> > +             int wanted_factor = 3;
>
> and perhaps make it 'const'?

Sure, will do.

> > +
> > +             /* We should only redo collect_merge_info one time */
> > +             assert(renames->redo_after_renames == 0);
> > +
> > +             if (path_count_after / path_count_before > wanted_factor) {
>
> With truncation from integer division, this condition is equivalent* to
>
>         path_count_after >= 4 * path_count_before
>
> Or, do you want to change this to a ">=" so that the factor of 3 seems
> more accurate?
>
> *I understand the intention of using division to avoid (unlikely)
> overflow via multiplication. The truncation is causing some confusion.

Good catch; I'll fix it up to use ">=".

> > -test_expect_merge_algorithm failure failure '12f: Trivial directory resolve, caching, all kinds of fun' '
> > +test_expect_merge_algorithm failure success '12f: Trivial directory resolve, caching, all kinds of fun' '
> >       test_setup_12f &&
> >       (
> >               cd 12f &&
> >
>
> Oh, and a bonus test success! Excellent.

Yeah, this testcase was slightly weird in the order I sent it
upstream.  12f was written specifically with this optimization in mind
as a way of ensuring code coverage of the restart logic.  I would have
waited to submit the 12f testcase with this series, but the testcase
also demonstrated an important directory rename detection bug that I
found existed in both merge-recursive and merge-ort at the time.  See
commit 902c521a35 ("t6423: more involved directory rename test",
2020-10-15)

The merge-ort bug was fixed with commit 203c872c4f ("merge-ort: fix a
directory rename detection bug", 2021-01-19).  The merge-recursive bug
still exists.

So, this series just fixed up the final thing that 12f was testing for
-- namely that it included two collect_merge_info() region_enter
trace2 calls per commit instead of just one.

Perhaps I could have split the test, but both things require a
relatively big set of files which makes the test a bit more expensive
so I didn't want to duplicate it.  Besides, having both factors
involved makes it a better stress test of the restart logic.

  reply	other threads:[~2021-07-15 16:54 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-01 13:21 ` [PATCH 0/7] Optimization batch 14: trivial directory resolution Ævar Arnfjörð Bjarmason
2021-07-01 15:04   ` Elijah Newren
2021-07-01 19:22     ` Elijah Newren
2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
2021-07-13 19:32   ` [PATCH v2 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-13 19:32   ` [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-13 23:34     ` Bagas Sanjaya
2021-07-14  0:19       ` Elijah Newren
2021-07-13 19:32   ` [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-15 13:54     ` Derrick Stolee
2021-07-15 15:54       ` Elijah Newren
2021-07-13 19:33   ` [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-15 14:32     ` Derrick Stolee
2021-07-15 15:59       ` Elijah Newren
2021-07-13 19:33   ` [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-15 14:43     ` Derrick Stolee
2021-07-15 16:03       ` Elijah Newren
2021-07-15 17:14         ` Derrick Stolee
2021-07-13 19:33   ` [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-15 14:55     ` Derrick Stolee
2021-07-15 16:28       ` Elijah Newren
2021-07-13 19:33   ` [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-15 15:09     ` Derrick Stolee
2021-07-15 16:53       ` Elijah Newren [this message]
2021-07-15 17:19         ` Derrick Stolee
2021-07-15 17:32           ` Elijah Newren
2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-20 13:00     ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Derrick Stolee
2021-07-20 21:43       ` Junio C Hamano
2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
2021-07-21  4:23       ` [PATCH v4 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-21  4:23       ` [PATCH v4 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BF+gR8WtpWt_DVDoWe16R4B65h-59zGOZ5j4vUJKp_Nuw@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.