All of lore.kernel.org
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Rik van Riel <riel@surriel.com>
Cc: Andy Lutomirski <luto@kernel.org>, Will Deacon <will@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	Anton Blanchard <anton@ozlabs.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@ozlabs.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	linux-arch <linux-arch@vger.kernel.org>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	Dave Hansen <dave.hansen@intel.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Nadav Amit <nadav.amit@gmail.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Subject: Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
Date: Sun, 9 Jan 2022 19:51:26 -0800	[thread overview]
Message-ID: <CAHk-=wgaQkeCW13TnvOxvNhGo4sz3ihLt+iNyQkkR_DpQZ=W7Q@mail.gmail.com> (raw)
In-Reply-To: <b1d963a8adf4618a53f996283c1bfae37323bbb6.camel@surriel.com>

On Sun, Jan 9, 2022 at 6:40 PM Rik van Riel <riel@surriel.com> wrote:
>
> Also, while 800 loads is kinda expensive, it is a heck of
> a lot less expensive than 800 IPIs.

Rik, the IPI's you have to do *anyway*. So there are exactly zero extra IPI's.

Go take a look. It's part of the whole "flush TLB's" thing in __mmput().

So let me explain one more time what I think we should have done, at
least on x86:

 (1) stop refcounting active_mm entries entirely on x86

Why can we do that? Because instead of worrying about doing those
mm_count games for the active_mm reference, we realize that any
active_mm has to have a _regular_ mm associated with it, and it has a
'mm_users' count.

And when that mm_users count go to zero, we have:

 (2) mmput -> __mmput -> exit_mmap(), which already has to flush all
TLB's because it's tearing down the page tables

And since it has to flush those TLB's as part of tearing down the page
tables, we on x86 then have:

 (3) that TLB flush will have to do the IPI's to anybody who has that
mm active already

and that IPI has to be done *regardless*. And the TLB flushing done by
that IPI? That code already clears the lazy status (and not doing so
would be pointless and in fact wrong).

Notice? There isn't some "800 loads". There isn't some "800 IPI's".
And there isn't any refcounting cost of the lazy TLB.

Well, right now there *is* that refcounting cost, but my point is that
I don't think it should exist.

It shouldn't exist as an atomic access to mm_count (with those cache
ping-pongs when you have a lot of threads across a lot of CPUs), but
it *also* shouldn't exist as a "lightweight hazard pointer".

See my point? I think the lazy-tlb refcounting we do is pointless if
you have to do IPI's for TLB flushes.

Note: the above is for x86, which has to do the IPI's anyway (and it's
very possible that if you don't have to do IPI's because you have HW
TLB coherency, maybe lazy TLB's aren't what you should be using, but I
think that should be a separate discussion).

And yes, right now we do that pointless reference counting, because it
was simple and straightforward, and people historically didn't see it
as a problem.

Plus we had to have that whole secondary 'mm_count' anyway for other
reasons, since we use it for things that need to keep a ref to 'struct
mm_struct' around regardless of page table counts (eg things like a
lot of /proc references to 'struct mm_struct' do not want to keep
forced references to user page tables alive).

But I think conceptually mm_count (ie mmgrab/mmdrop) was always really
dodgy for lazy TLB. Lazy TLB really cares about the page tables still
being there, and that's not what mm_count is ostensibly about. That's
really what mm_users is about.

Yet mmgrab/mmdrop is exactly what the lazy TLB code uses, even if it's
technically odd (ie mmgrab really only keeps the 'struct mm' around,
but not about the vma's and page tables).

Side note: you can see the effects of this mis-use of mmgrab/mmdrop in
 how we tear down _almost_ all the page table state in __mmput(). But
look at what we keep until the final __mmdrop, even though there are
no page tables left:

        mm_free_pgd(mm);
        destroy_context(mm);

exactly because even though we've torn down all the page tables
earlier, we had to keep the page table *root* around for the lazy
case.

It's kind of a layering violation, but it comes from that lazy-tlb
mm_count use, and so we have that odd situation where the page table
directory lifetime is very different from the rest of the page table
lifetimes.

(You can easily make excuses for it by just saying that "mm_users" is
the user-space page table user count, and that the page directory has
a different lifetime because it's also about the kernel page tables,
so it's all a bit of a gray area, but I do think it's also a bit of a
sign of how our refcounting for lazy-tlb is a bit dodgy).

                Linus

  reply	other threads:[~2022-01-10  3:51 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
2022-01-12 15:30   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
2022-01-12 15:40   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
2022-01-08 16:43 ` [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
2022-01-12 15:52   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 05/23] membarrier, kthread: Use _ONCE accessors for task->mm Andy Lutomirski
2022-01-12 15:55   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch Andy Lutomirski
2022-01-08 16:43   ` Andy Lutomirski
2022-01-10  8:42   ` Christophe Leroy
2022-01-10  8:42     ` Christophe Leroy
2022-01-12 15:57   ` Mathieu Desnoyers
2022-01-12 15:57     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation Andy Lutomirski
2022-01-08 16:43   ` Andy Lutomirski
2022-01-08 16:43   ` Andy Lutomirski
2022-01-12 16:11   ` Mathieu Desnoyers
2022-01-12 16:11     ` Mathieu Desnoyers
2022-01-12 16:11     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap() Andy Lutomirski
2022-01-12 16:13   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() Andy Lutomirski
2022-01-12 16:30   ` Mathieu Desnoyers
2022-01-12 17:08     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references Andy Lutomirski
2022-01-08 16:43 ` [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown Andy Lutomirski
2022-01-10 22:06   ` Sami Tolvanen
2022-01-08 16:43 ` [PATCH 12/23] Rework "sched/core: Fix illegal RCU from offline CPUs" Andy Lutomirski
2022-01-08 16:43 ` [PATCH 13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap() Andy Lutomirski
2022-01-08 16:43 ` [PATCH 14/23] sched, exec: Factor current mm changes out from exec Andy Lutomirski
2022-01-08 16:44 ` [PATCH 15/23] kthread: Switch to __change_current_mm() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Andy Lutomirski
2022-01-08 19:22   ` Linus Torvalds
2022-01-08 22:04     ` Andy Lutomirski
2022-01-09  0:27       ` Linus Torvalds
2022-01-09  0:53       ` Linus Torvalds
2022-01-09  3:58         ` Andy Lutomirski
2022-01-09  4:38           ` Linus Torvalds
2022-01-09 20:19             ` Andy Lutomirski
2022-01-09 20:48               ` Linus Torvalds
2022-01-09 21:51                 ` Linus Torvalds
2022-01-10  0:52                   ` Andy Lutomirski
2022-01-10  2:36                     ` Rik van Riel
2022-01-10  3:51                       ` Linus Torvalds [this message]
2022-01-10  4:56                   ` Nicholas Piggin
2022-01-10  5:17                     ` Nicholas Piggin
2022-01-10 17:19                       ` Linus Torvalds
2022-01-11  2:24                         ` Nicholas Piggin
2022-01-10 20:52                     ` Andy Lutomirski
2022-01-11  3:10                       ` Nicholas Piggin
2022-01-11 15:39                         ` Andy Lutomirski
2022-01-11 22:48                           ` Nicholas Piggin
2022-01-12  0:42                             ` Nicholas Piggin
2022-01-11 10:39                 ` Will Deacon
2022-01-11 15:22                   ` Andy Lutomirski
2022-01-09  5:56   ` Nadav Amit
2022-01-09  6:48     ` Linus Torvalds
2022-01-09  8:49       ` Nadav Amit
2022-01-09 19:10         ` Linus Torvalds
2022-01-09 19:52           ` Andy Lutomirski
2022-01-09 20:00             ` Linus Torvalds
2022-01-09 20:34             ` Nadav Amit
2022-01-09 20:48               ` Andy Lutomirski
2022-01-09 19:22         ` Rik van Riel
2022-01-09 19:34           ` Nadav Amit
2022-01-09 19:37             ` Rik van Riel
2022-01-09 19:51               ` Nadav Amit
2022-01-09 19:54                 ` Linus Torvalds
2022-01-08 16:44 ` [PATCH 17/23] x86/mm: Make use/unuse_temporary_mm() non-static Andy Lutomirski
2022-01-08 16:44 ` [PATCH 18/23] x86/mm: Allow temporary mms when IRQs are on Andy Lutomirski
2022-01-08 16:44 ` [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery Andy Lutomirski
2022-01-10 13:13   ` Ard Biesheuvel
2022-01-08 16:44 ` [PATCH 20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs Andy Lutomirski
2022-01-08 16:44 ` [PATCH 22/23] x86/mm: Optimize for_each_possible_lazymm_cpu() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 23/23] x86/mm: Opt in to IRQs-off activate_mm() Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wgaQkeCW13TnvOxvNhGo4sz3ihLt+iNyQkkR_DpQZ=W7Q@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=anton@ozlabs.org \
    --cc=benh@kernel.crashing.org \
    --cc=catalin.marinas@arm.com \
    --cc=dave.hansen@intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=nadav.amit@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=paulus@ozlabs.org \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=riel@surriel.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.