From: Nicholas Piggin <npiggin@gmail.com>
To: Christian Borntraeger <borntraeger@de.ibm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Dave Hansen <dave.hansen@intel.com>,
Vasily Gorbik <gor@linux.ibm.com>,
Heiko Carstens <hca@linux.ibm.com>,
Andy Lutomirski <luto@kernel.org>, Will Deacon <will@kernel.org>
Cc: Anton Blanchard <anton@ozlabs.org>, Arnd Bergmann <arnd@arndb.de>,
linux-arch <linux-arch@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>,
linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Peter Zijlstra <peterz@infradead.org>, X86 ML <x86@kernel.org>
Subject: Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
Date: Wed, 02 Dec 2020 13:47:40 +1000 [thread overview]
Message-ID: <1606879302.tdngvs3yq4.astroid@bobo.none> (raw)
In-Reply-To: <CALCETrXAR_9EGaOF8ymVkZycxgZkYk0dR+NjEpTfVzdcS3sOVw@mail.gmail.com>
Excerpts from Andy Lutomirski's message of December 1, 2020 4:31 am:
> other arch folk: there's some background here:
>
> https://lkml.kernel.org/r/CALCETrVXUbe8LfNn-Qs+DzrOQaiw+sFUg1J047yByV31SaTOZw@mail.gmail.com
>
> On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski <luto@kernel.org> wrote:
>> >
>> > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>> > >
>> > > On big systems, the mm refcount can become highly contented when doing
>> > > a lot of context switching with threaded applications (particularly
>> > > switching between the idle thread and an application thread).
>> > >
>> > > Abandoning lazy tlb slows switching down quite a bit in the important
>> > > user->idle->user cases, so so instead implement a non-refcounted scheme
>> > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>> > > any remaining lazy ones.
>> > >
>> > > Shootdown IPIs are some concern, but they have not been observed to be
>> > > a big problem with this scheme (the powerpc implementation generated
>> > > 314 additional interrupts on a 144 CPU system during a kernel compile).
>> > > There are a number of strategies that could be employed to reduce IPIs
>> > > if they turn out to be a problem for some workload.
>> >
>> > I'm still wondering whether we can do even better.
>> >
>>
>> Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
>> the TLB. On x86, this will shoot down all lazies as long as even a
>> single pagetable was freed. (Or at least it will if we don't have a
>> serious bug, but the code seems okay. We'll hit pmd_free_tlb, which
>> sets tlb->freed_tables, which will trigger the IPI.) So, on
>> architectures like x86, the shootdown approach should be free. The
>> only way it ought to have any excess IPIs is if we have CPUs in
>> mm_cpumask() that don't need IPI to free pagetables, which could
>> happen on paravirt.
>
> Indeed, on x86, we do this:
>
> [ 11.558844] flush_tlb_mm_range.cold+0x18/0x1d
> [ 11.559905] tlb_finish_mmu+0x10e/0x1a0
> [ 11.561068] exit_mmap+0xc8/0x1a0
> [ 11.561932] mmput+0x29/0xd0
> [ 11.562688] do_exit+0x316/0xa90
> [ 11.563588] do_group_exit+0x34/0xb0
> [ 11.564476] __x64_sys_exit_group+0xf/0x10
> [ 11.565512] do_syscall_64+0x34/0x50
>
> and we have info->freed_tables set.
>
> What are the architectures that have large systems like?
>
> x86: we already zap lazies, so it should cost basically nothing to do
This is not zapping lazies, this is freeing the user page tables.
"lazy mm" is where a switch to a kernel thread takes on the
previous mm for its kernel mapping rather than switch to init_mm.
> a little loop at the end of __mmput() to make sure that no lazies are
> left. If we care about paravirt performance, we could implement one
> of the optimizations I mentioned above to fix up the refcounts instead
> of sending an IPI to any remaining lazies.
It might be possible x86's scheme you could scan mm_cpumask
carefully synchronized or something when the last user reference
gets dropped that frees the lazy at that point, but I don't know
what that would buy you because you're still having to maintain
the mm_cpumask on switches. powerpc's characteristics are just
different here so it makes sense whereas I don't know if it
would on x86.
>
> arm64: AFAICT arm64's flush uses magic arm64 hardware support for
> remote flushes, so any lazy mm references will still exist after
> exit_mmap(). (arm64 uses lazy TLB, right?) So this is kind of like
> the x86 paravirt case. Are there large enough arm64 systems that any
> of this matters?
>
> s390x: The code has too many acronyms for me to understand it fully,
> but I think it's more or less the same situation as arm64. How big do
> s390x systems come?
>
> power: Ridiculously complicated, seems to vary by system and kernel config.
>
> So, Nick, your unconditional IPI scheme is apparently a big
> improvement for power, and it should be an improvement and have low
> cost for x86.
As said, the tradeoffs are different, I'm not so sure. It was a big
improvement on a very big system with the powerpc mm_cpumask switching
model on a microbenchmark designed to stress this, which is about all
I can say for it.
> On arm64 and s390x it will add more IPIs on process
> exit but reduce contention on context switching depending on how lazy
> TLB works. I suppose we could try it for all architectures without
> any further optimizations.
It will remain opt-in but certainly try it out and see. There are some
requirements as documented in the config option text.
> Or we could try one of the perhaps
> excessively clever improvements I linked above. arm64, s390x people,
> what do you think?
>
I'm not against improvements to the scheme. e.g., from the patch
+ /*
+ * IPI overheads have not found to be expensive, but they could
+ * be reduced in a number of possible ways, for example (in
+ * roughly increasing order of complexity):
+ * - A batch of mms requiring IPIs could be gathered and freed
+ * at once.
+ * - CPUs could store their active mm somewhere that can be
+ * remotely checked without a lock, to filter out
+ * false-positives in the cpumask.
+ * - After mm_users or mm_count reaches zero, switching away
+ * from the mm could clear mm_cpumask to reduce some IPIs
+ * (some batching or delaying would help).
+ * - A delayed freeing and RCU-like quiescing sequence based on
+ * mm switching to avoid IPIs completely.
+ */
But would like to have numbers before being too clever.
Thanks,
Nick
next prev parent reply other threads:[~2020-12-02 3:47 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-28 16:01 [PATCH 0/8] shoot lazy tlbs Nicholas Piggin
2020-11-28 16:01 ` [PATCH 1/8] lazy tlb: introduce exit_lazy_tlb Nicholas Piggin
2020-11-29 0:38 ` Andy Lutomirski
2020-12-02 2:49 ` Nicholas Piggin
2020-11-28 16:01 ` [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode Nicholas Piggin
2020-11-28 17:55 ` Andy Lutomirski
2020-12-02 2:49 ` Nicholas Piggin
2020-12-03 5:09 ` Andy Lutomirski
2020-12-05 8:00 ` Nicholas Piggin
2020-12-05 16:11 ` Andy Lutomirski
2020-12-05 23:14 ` Nicholas Piggin
2020-12-06 0:36 ` Andy Lutomirski
2020-12-06 3:59 ` Nicholas Piggin
2020-12-11 0:11 ` Andy Lutomirski
2020-12-14 4:07 ` Nicholas Piggin
2020-12-14 5:53 ` Nicholas Piggin
2020-11-30 14:57 ` Mathieu Desnoyers
2020-11-28 16:01 ` [PATCH 3/8] x86: remove ARCH_HAS_SYNC_CORE_BEFORE_USERMODE Nicholas Piggin
2020-11-28 16:01 ` [PATCH 4/8] lazy tlb: introduce lazy mm refcount helper functions Nicholas Piggin
2020-11-28 16:01 ` [PATCH 5/8] lazy tlb: allow lazy tlb mm switching to be configurable Nicholas Piggin
2020-11-29 0:36 ` Andy Lutomirski
2020-12-02 2:49 ` Nicholas Piggin
2020-11-28 16:01 ` [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Nicholas Piggin
2020-11-29 3:54 ` Andy Lutomirski
2020-11-29 20:16 ` Andy Lutomirski
2020-11-30 9:25 ` Peter Zijlstra
2020-11-30 18:31 ` Andy Lutomirski
2020-12-01 21:27 ` Will Deacon
2020-12-01 21:50 ` Andy Lutomirski
2020-12-01 23:04 ` Will Deacon
2020-12-02 3:47 ` Nicholas Piggin [this message]
2020-12-03 5:05 ` Andy Lutomirski
2020-12-03 17:03 ` Alexander Gordeev
2020-12-03 17:14 ` Andy Lutomirski
2020-12-03 18:33 ` Alexander Gordeev
2020-11-30 9:26 ` Peter Zijlstra
2020-11-30 9:30 ` Peter Zijlstra
2020-11-30 9:34 ` Peter Zijlstra
2020-12-02 3:09 ` Nicholas Piggin
2020-12-02 11:17 ` Peter Zijlstra
2020-12-02 12:45 ` Peter Zijlstra
2020-12-02 14:19 ` Peter Zijlstra
2020-12-02 14:38 ` Andy Lutomirski
2020-12-02 16:29 ` Peter Zijlstra
2020-11-28 16:01 ` [PATCH 7/8] powerpc: use lazy mm refcount helper functions Nicholas Piggin
2020-11-28 16:01 ` [PATCH 8/8] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Nicholas Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1606879302.tdngvs3yq4.astroid@bobo.none \
--to=npiggin@gmail.com \
--cc=anton@ozlabs.org \
--cc=arnd@arndb.de \
--cc=borntraeger@de.ibm.com \
--cc=catalin.marinas@arm.com \
--cc=dave.hansen@intel.com \
--cc=gor@linux.ibm.com \
--cc=hca@linux.ibm.com \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=luto@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=peterz@infradead.org \
--cc=will@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).