All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Andy Lutomirski" <luto@kernel.org>
To: "Nicholas Piggin" <npiggin@gmail.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Anton Blanchard" <anton@ozlabs.org>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Dave Hansen" <dave.hansen@intel.com>,
	linux-arch <linux-arch@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Paul Mackerras" <paulus@ozlabs.org>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	"Rik van Riel" <riel@surriel.com>,
	"Will Deacon" <will@kernel.org>,
	"the arch/x86 maintainers" <x86@kernel.org>
Subject: Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
Date: Tue, 11 Jan 2022 07:39:39 -0800	[thread overview]
Message-ID: <c35e696f-7463-49f6-ab89-793ba2dba6bf@www.fastmail.com> (raw)
In-Reply-To: <1641867901.xve7qy78q6.astroid@bobo.none>



On Mon, Jan 10, 2022, at 7:10 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of January 11, 2022 6:52 am:
>> 
>> 
>> On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
>>> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>>>> [ Ugh, I actually went back and looked at Nick's patches again, to
>>>> just verify my memory, and they weren't as pretty as I thought they
>>>> were ]
>>>> 
>>>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>>>> <torvalds@linux-foundation.org> wrote:
>>>>>
>>>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>>>> powerpc, we don't need this overhead at all".
>>>> 
>>>> For some reason I thought Nick's patch worked at "last mmput" time and
>>>> the TLB flush IPIs that happen at that point anyway would then make
>>>> sure any lazy TLB is cleaned up.
>>>> 
>>>> But that's not actually what it does. It ties the
>>>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>>>> last mmdrop() instead. Because it really tied the whole logic to the
>>>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>>>> mm_users thing I mis-remembered it doing.
>>>
>>> It does this because on powerpc with hash MMU, we can't use IPIs for
>>> TLB shootdowns.
>> 
>> I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns,
>
> The paravirtualised hash MMU environment doesn't because it has a single 
> level translation and the guest uses hypercalls to insert and remove 
> translations and the hypervisor flushes TLBs. The HV could flush TLBs
> with IPIs but obviously the guest can't use those to execute the lazy
> switch. In radix guests (and all bare metal) the OS flushes its own
> TLBs.
>
> We are moving over to radix, but powerpc also has a hardware broadcast 
> flush instruction which can be a bit faster than IPIs and is usable by 
> bare metal and radix guests so those can also avoid the IPIs if they 
> want. Part of the powerpc patch I just sent to combine the lazy switch 
> with the final TLB flush is to force it to always take the IPI path and 
> not use TLBIE instruction on the final exit.
>
> So hazard points could avoid some IPIs there too.
>
>> it sounds like the hazard pointer scheme might actually be pretty good.
>
> Some IPIs in the exit path just aren't that big a concern. I measured,
> got numbers, tried to irritate it, just wasn't really a problem. Some
> archs use IPIs for all threaded TLB shootdowns and exits not that
> frequent. Very fast short lived processes that do a lot of exits just
> don't tend to spread across a lot of CPUs leaving lazy tlb mms to shoot,
> and long lived and multi threaded ones that do don't exit at high rates.
>
> So from what I can see it's premature optimization. Actually maybe not
> even optimization because IIRC it adds complexity and even a barrier on
> powerpc in the context switch path which is a lot more critical than
> exit() for us we don't want slowdowns there.
>
> It's a pretty high complexity boutique kind of synchronization. Which
> don't get me wrong is the kind of thing I like, it is clever and may be
> perfectly bug free but it needs to prove itself over the simple dumb
> shoot lazies approach.
>
>>>> So at least some of my arguments were based on me just mis-remembering
>>>> what Nick's patch actually did (mainly because I mentally recreated
>>>> the patch from "Nick did something like this" and what I thought would
>>>> be the way to do it on x86).
>>>
>>> With powerpc with the radix MMU using IPI based shootdowns, we can 
>>> actually do the switch-away-from-lazy on the final TLB flush and the
>>> final broadcast shootdown thing becomes a no-op. I didn't post that
>>> additional patch because it's powerpc-specific and I didn't want to
>>> post more code so widely.
>>>
>>>> So I guess I have to recant my arguments.
>>>> 
>>>> I still think my "get rid of lazy at last mmput" model should work,
>>>> and would be a perfect match for x86, but I can't really point to Nick
>>>> having done that.
>>>> 
>>>> So I was full of BS.
>>>> 
>>>> Hmm. I'd love to try to actually create a patch that does that "Nick
>>>> thing", but on last mmput() (ie when __mmput triggers). Because I
>>>> think this is interesting. But then I look at my schedule for the
>>>> upcoming week, and I go "I don't have a leg to stand on in this
>>>> discussion, and I'm just all hot air".
>>>
>>> I agree Andy's approach is very complicated and adds more overhead than
>>> necessary for powerpc, which is why I don't want to use it. I'm still
>>> not entirely sure what the big problem would be to convert x86 to use
>>> it, I admit I haven't kept up with the exact details of its lazy tlb
>>> mm handling recently though.
>> 
>> The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:
>> 
>> - x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.
>
> Definitely should have been done at the time x86 deviated, but better 
> late than never.
>

I suspect that this code may predate there being anything for x86 to deviate from.

>> 
>> - mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.
>
> membarrier relied on a call that mmdrop was providing. Adding a smp_mb()
> instead if mmdrop is a no-op is fine. Patches changing membarrier's 
> ordering requirements can be concurrent and are not fundmentally tied
> to lazy tlb mm switching, it just reuses an existing ordering point.

smp_mb() is rather expensive on x86 at least, and I think on powerpc to.  Let's not.

Also, IMO my cleanups here generally make sense and make the code better, so I think we should just go with them.

>
>> - There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.
>
> Seems like a decent cleanup. Again lazy tlb specific, just general switch
> code should be factored and better contained in kernel/sched/ which is
> fine, but concurrent to lazy tlb improvements.

I personally rather dislike the model of:

...
mmgrab_lazy();
...
mmdrop_lazy();
...
mmgrab_lazy();
...
mmdrop_lazy();
...

where the different calls have incompatible logic at the call sites and a magic config option NOPs them all away and adds barriers.  I'm personally a big fan of cleaning up code before making it even more complicated.

>
>> - If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.
>
> My shoot lazies patch has no such issues with that AFAIKS. What exact 
> issue is it and where did you fix it?

Without my unlazy_mm_irqs_off() (or something similar), x86's very sensible (and very old!) code to unlazy a lazy CPU when flushing leaves active_mm pointing to the old lazy mm.  That's not a problem at all on current kernels, but in a shootdown-lazy world, those active_mm pointers will stick around.  Even with that fixed, without some care, an NMI during the shootdown CPU could dereference ->active_mm at a time when it's not being actively kept alive.

Fixed by unlazy_mm_irqs_off(), the patches that use it, and the patches that make x86 stop inappropriately using ->active_mm.

>
>> 
>> - If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).
>> 
>> My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.
>
> I disagree, the lazy mm code is very clean and simple. And I can't see 
> how you would propose to remove active_mm from core code I'm skeptical
> but would be very interested to see, but that's nothing to do with my
> shoot lazies patch and can also be concurrent except for mechanical
> merge issues.

I think we may just have to agree to disagree.  The more I looked at the lazy code, the more problems I found.  So I fixed them.  That work is done now (as far as I'm aware) except for rebasing and review.

--Andy

  reply	other threads:[~2022-01-11 15:40 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
2022-01-12 15:30   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
2022-01-12 15:40   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
2022-01-08 16:43 ` [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
2022-01-12 15:52   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 05/23] membarrier, kthread: Use _ONCE accessors for task->mm Andy Lutomirski
2022-01-12 15:55   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch Andy Lutomirski
2022-01-08 16:43   ` Andy Lutomirski
2022-01-10  8:42   ` Christophe Leroy
2022-01-10  8:42     ` Christophe Leroy
2022-01-12 15:57   ` Mathieu Desnoyers
2022-01-12 15:57     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation Andy Lutomirski
2022-01-08 16:43   ` Andy Lutomirski
2022-01-08 16:43   ` Andy Lutomirski
2022-01-12 16:11   ` Mathieu Desnoyers
2022-01-12 16:11     ` Mathieu Desnoyers
2022-01-12 16:11     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap() Andy Lutomirski
2022-01-12 16:13   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() Andy Lutomirski
2022-01-12 16:30   ` Mathieu Desnoyers
2022-01-12 17:08     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references Andy Lutomirski
2022-01-08 16:43 ` [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown Andy Lutomirski
2022-01-10 22:06   ` Sami Tolvanen
2022-01-08 16:43 ` [PATCH 12/23] Rework "sched/core: Fix illegal RCU from offline CPUs" Andy Lutomirski
2022-01-08 16:43 ` [PATCH 13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap() Andy Lutomirski
2022-01-08 16:43 ` [PATCH 14/23] sched, exec: Factor current mm changes out from exec Andy Lutomirski
2022-01-08 16:44 ` [PATCH 15/23] kthread: Switch to __change_current_mm() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Andy Lutomirski
2022-01-08 19:22   ` Linus Torvalds
2022-01-08 22:04     ` Andy Lutomirski
2022-01-09  0:27       ` Linus Torvalds
2022-01-09  0:53       ` Linus Torvalds
2022-01-09  3:58         ` Andy Lutomirski
2022-01-09  4:38           ` Linus Torvalds
2022-01-09 20:19             ` Andy Lutomirski
2022-01-09 20:48               ` Linus Torvalds
2022-01-09 21:51                 ` Linus Torvalds
2022-01-10  0:52                   ` Andy Lutomirski
2022-01-10  2:36                     ` Rik van Riel
2022-01-10  3:51                       ` Linus Torvalds
2022-01-10  4:56                   ` Nicholas Piggin
2022-01-10  5:17                     ` Nicholas Piggin
2022-01-10 17:19                       ` Linus Torvalds
2022-01-11  2:24                         ` Nicholas Piggin
2022-01-10 20:52                     ` Andy Lutomirski
2022-01-11  3:10                       ` Nicholas Piggin
2022-01-11 15:39                         ` Andy Lutomirski [this message]
2022-01-11 22:48                           ` Nicholas Piggin
2022-01-12  0:42                             ` Nicholas Piggin
2022-01-11 10:39                 ` Will Deacon
2022-01-11 15:22                   ` Andy Lutomirski
2022-01-09  5:56   ` Nadav Amit
2022-01-09  6:48     ` Linus Torvalds
2022-01-09  8:49       ` Nadav Amit
2022-01-09 19:10         ` Linus Torvalds
2022-01-09 19:52           ` Andy Lutomirski
2022-01-09 20:00             ` Linus Torvalds
2022-01-09 20:34             ` Nadav Amit
2022-01-09 20:48               ` Andy Lutomirski
2022-01-09 19:22         ` Rik van Riel
2022-01-09 19:34           ` Nadav Amit
2022-01-09 19:37             ` Rik van Riel
2022-01-09 19:51               ` Nadav Amit
2022-01-09 19:54                 ` Linus Torvalds
2022-01-08 16:44 ` [PATCH 17/23] x86/mm: Make use/unuse_temporary_mm() non-static Andy Lutomirski
2022-01-08 16:44 ` [PATCH 18/23] x86/mm: Allow temporary mms when IRQs are on Andy Lutomirski
2022-01-08 16:44 ` [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery Andy Lutomirski
2022-01-10 13:13   ` Ard Biesheuvel
2022-01-08 16:44 ` [PATCH 20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs Andy Lutomirski
2022-01-08 16:44 ` [PATCH 22/23] x86/mm: Optimize for_each_possible_lazymm_cpu() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 23/23] x86/mm: Opt in to IRQs-off activate_mm() Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c35e696f-7463-49f6-ab89-793ba2dba6bf@www.fastmail.com \
    --to=luto@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=anton@ozlabs.org \
    --cc=benh@kernel.crashing.org \
    --cc=catalin.marinas@arm.com \
    --cc=dave.hansen@intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=nadav.amit@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=paulus@ozlabs.org \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=riel@surriel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.