All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/23] mm, sched: Rework lazy mm handling
@ 2022-01-08 16:43 Andy Lutomirski
  2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
                   ` (22 more replies)
  0 siblings, 23 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

Hi all-

Sorry I've been sitting on this so long.  I think it's in decent shape, it
has no *known* bugs, and I think it's time to get the show on the road.
This series needs more eyeballs, too.

The overall point of this series is to get rid of the scalability
problems with mm_count, and my goal is to solve it once and for all,
for all architectures, in a way that doesn't have any gotchas for
unwary users of ->active_mm.

Most of this series is just cleanup, though.  mmgrab(), mmdrop(), and
->active_mm are a mess.  A number of ->active_mm users are simply
wrong.  kthread lazy mm handling is inconsistent with user thread lazy
mm handling (by accident, as far as I can tell).  And membarrier()
relies on the barrier semantics of mmdrop() and mmgrab(), such that
anything that gets rid of those barriers risks breaking membarrier().
x86 is sometimes non-lazy when the core thinks it's lazy because the
core mm code didn't offer any mechanism by which x86 could tell the core
that it's exiting lazy mode.

So most of this series is just cleanup.  Bogus users of ->active_mm
are fixed, and membarrier() is reworked so that its barriers are
explicit instead of depending on mmdrop() and mmgrab().  x86 lazy
handling is extensively tidied up, and x86's EFI mm code gets tidied
up a bit too.  I think I've done this all in a way that introduces
little or no overhead.


Additionally, all the code paths that change current->mm are consolidated
so that there is only one path to start using an mm and only one path
to stop using it.

Once that's done, the actual meat (the hazard pointers) isn't so bad, and
the x86 optimization on top that should eliminate scanning of remote CPUs
in __mmput() is about two lines of code.  Other architectures with
sufficiently accurate mm_cpumask() tracking should be able to do the same
thing.

akpm, this is intended to mostly replace Nick Piggin's lazy shootdown
series.  This series implements lazy shootdown on x86 implicitly, and
powerpc should be able to do the same thing in just a couple lines
of code if it wants to.  The result is IMO much cleaner and more
maintainable.

Once this is all reviewed, I'm hoping it can go in -tip (and -next) after
the merge window or go in -mm.  This is not intended for v5.16.  I suspect
-tip is easier in case other arch maintainers want to optimize their
code in the same release.

Andy Lutomirski (23):
  membarrier: Document why membarrier() works
  x86/mm: Handle unlazying membarrier core sync in the arch code
  membarrier: Remove membarrier_arch_switch_mm() prototype in core code
  membarrier: Make the post-switch-mm barrier explicit
  membarrier, kthread: Use _ONCE accessors for task->mm
  powerpc/membarrier: Remove special barrier on mm switch
  membarrier: Rewrite sync_core_before_usermode() and improve
    documentation
  membarrier: Remove redundant clear of mm->membarrier_state in
    exec_mmap()
  membarrier: Fix incorrect barrier positions during exec and
    kthread_use_mm()
  x86/events, x86/insn-eval: Remove incorrect active_mm references
  sched/scs: Initialize shadow stack on idle thread bringup, not
    shutdown
  Rework "sched/core: Fix illegal RCU from offline CPUs"
  exec: Remove unnecessary vmacache_seqnum clear in exec_mmap()
  sched, exec: Factor current mm changes out from exec
  kthread: Switch to __change_current_mm()
  sched: Use lightweight hazard pointers to grab lazy mms
  x86/mm: Make use/unuse_temporary_mm() non-static
  x86/mm: Allow temporary mms when IRQs are on
  x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery
  x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off()
  x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs
  x86/mm: Optimize for_each_possible_lazymm_cpu()
  x86/mm: Opt in to IRQs-off activate_mm()

 .../membarrier-sync-core/arch-support.txt     |  69 +--
 arch/arm/include/asm/membarrier.h             |  21 +
 arch/arm/kernel/smp.c                         |   2 -
 arch/arm64/include/asm/membarrier.h           |  19 +
 arch/arm64/kernel/smp.c                       |   2 -
 arch/csky/kernel/smp.c                        |   2 -
 arch/ia64/kernel/process.c                    |   1 -
 arch/mips/cavium-octeon/smp.c                 |   1 -
 arch/mips/kernel/smp-bmips.c                  |   2 -
 arch/mips/kernel/smp-cps.c                    |   1 -
 arch/mips/loongson64/smp.c                    |   2 -
 arch/powerpc/include/asm/membarrier.h         |  28 +-
 arch/powerpc/mm/mmu_context.c                 |   1 -
 arch/powerpc/platforms/85xx/smp.c             |   2 -
 arch/powerpc/platforms/powermac/smp.c         |   2 -
 arch/powerpc/platforms/powernv/smp.c          |   1 -
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 -
 arch/powerpc/platforms/pseries/pmem.c         |   1 -
 arch/riscv/kernel/cpu-hotplug.c               |   2 -
 arch/s390/kernel/smp.c                        |   1 -
 arch/sh/kernel/smp.c                          |   1 -
 arch/sparc/kernel/smp_64.c                    |   2 -
 arch/x86/Kconfig                              |   2 +-
 arch/x86/events/core.c                        |   9 +-
 arch/x86/include/asm/membarrier.h             |  25 ++
 arch/x86/include/asm/mmu.h                    |   6 +-
 arch/x86/include/asm/mmu_context.h            |  15 +-
 arch/x86/include/asm/sync_core.h              |  20 -
 arch/x86/kernel/alternative.c                 |  67 +--
 arch/x86/kernel/cpu/mce/core.c                |   2 +-
 arch/x86/kernel/smpboot.c                     |   2 -
 arch/x86/lib/insn-eval.c                      |  13 +-
 arch/x86/mm/tlb.c                             | 155 +++++--
 arch/x86/platform/efi/efi_64.c                |   9 +-
 arch/x86/xen/mmu_pv.c                         |   2 +-
 arch/xtensa/kernel/smp.c                      |   1 -
 drivers/cpuidle/cpuidle.c                     |   2 +-
 drivers/idle/intel_idle.c                     |   4 +-
 drivers/misc/sgi-gru/grufault.c               |   2 +-
 drivers/misc/sgi-gru/gruhandles.c             |   2 +-
 drivers/misc/sgi-gru/grukservices.c           |   2 +-
 fs/exec.c                                     |  28 +-
 include/linux/mmu_context.h                   |   4 +-
 include/linux/sched/hotplug.h                 |   6 -
 include/linux/sched/mm.h                      |  58 ++-
 include/linux/sync_core.h                     |  21 -
 init/Kconfig                                  |   3 -
 kernel/cpu.c                                  |  21 +-
 kernel/exit.c                                 |   2 +-
 kernel/fork.c                                 |  11 +
 kernel/kthread.c                              |  50 +--
 kernel/sched/core.c                           | 409 +++++++++++++++---
 kernel/sched/idle.c                           |   1 +
 kernel/sched/membarrier.c                     |  97 ++++-
 kernel/sched/sched.h                          |  11 +-
 55 files changed, 745 insertions(+), 482 deletions(-)
 create mode 100644 arch/arm/include/asm/membarrier.h
 create mode 100644 arch/arm64/include/asm/membarrier.h
 create mode 100644 arch/x86/include/asm/membarrier.h
 delete mode 100644 include/linux/sync_core.h

-- 
2.33.1


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 01/23] membarrier: Document why membarrier() works
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-12 15:30   ` Mathieu Desnoyers
  2022-01-08 16:43 ` [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

We had a nice comment at the top of membarrier.c explaining why membarrier
worked in a handful of scenarios, but that consisted more of a list of
things not to forget than an actual description of the algorithm and why it
should be expected to work.

Add a comment explaining my understanding of the algorithm.  This exposes a
couple of implementation issues that I will hopefully fix up in subsequent
patches.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/sched/membarrier.c | 60 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 58 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index b5add64d9698..30e964b9689d 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -7,8 +7,64 @@
 #include "sched.h"
 
 /*
- * For documentation purposes, here are some membarrier ordering
- * scenarios to keep in mind:
+ * The basic principle behind the regular memory barrier mode of
+ * membarrier() is as follows.  membarrier() is called in one thread.  Tt
+ * iterates over all CPUs, and, for each CPU, it either sends an IPI to
+ * that CPU or it does not. If it sends an IPI, then we have the
+ * following sequence of events:
+ *
+ * 1. membarrier() does smp_mb().
+ * 2. membarrier() does a store (the IPI request payload) that is observed by
+ *    the target CPU.
+ * 3. The target CPU does smp_mb().
+ * 4. The target CPU does a store (the completion indication) that is observed
+ *    by membarrier()'s wait-for-IPIs-to-finish request.
+ * 5. membarrier() does smp_mb().
+ *
+ * So all pre-membarrier() local accesses are visible after the IPI on the
+ * target CPU and all pre-IPI remote accesses are visible after
+ * membarrier(). IOW membarrier() has synchronized both ways with the target
+ * CPU.
+ *
+ * (This has the caveat that membarrier() does not interrupt the CPU that it's
+ * running on at the time it sends the IPIs. However, if that is the CPU on
+ * which membarrier() starts and/or finishes, membarrier() does smp_mb() and,
+ * if not, then the scheduler's migration of membarrier() is a full barrier.)
+ *
+ * membarrier() skips sending an IPI only if membarrier() sees
+ * cpu_rq(cpu)->curr->mm != target mm.  The sequence of events is:
+ *
+ *           membarrier()            |          target CPU
+ * ---------------------------------------------------------------------
+ *                                   | 1. smp_mb()
+ *                                   | 2. set rq->curr->mm = other_mm
+ *                                   |    (by writing to ->curr or to ->mm)
+ * 3. smp_mb()                       |
+ * 4. read rq->curr->mm == other_mm  |
+ * 5. smp_mb()                       |
+ *                                   | 6. rq->curr->mm = target_mm
+ *                                   |    (by writing to ->curr or to ->mm)
+ *                                   | 7. smp_mb()
+ *                                   |
+ *
+ * All memory accesses on the target CPU prior to scheduling are visible
+ * to membarrier()'s caller after membarrier() returns due to steps 1, 2, 4
+ * and 5.
+ *
+ * All memory accesses by membarrier()'s caller prior to membarrier() are
+ * visible to the target CPU after scheduling due to steps 3, 4, 6, and 7.
+ *
+ * Note that, tasks can change their ->mm, e.g. via kthread_use_mm().  So
+ * tasks that switch their ->mm must follow the same rules as the scheduler
+ * changing rq->curr, and the membarrier() code needs to do both dereferences
+ * carefully.
+ *
+ * GLOBAL_EXPEDITED support works the same way except that all references
+ * to rq->curr->mm are replaced with references to rq->membarrier_state.
+ *
+ *
+ * Specific examples of how this produces the documented properties of
+ * membarrier():
  *
  * A) Userspace thread execution after IPI vs membarrier's memory
  *    barrier before sending the IPI
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
  2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-12 15:40   ` Mathieu Desnoyers
  2022-01-08 16:43 ` [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

The core scheduler isn't a great place for
membarrier_mm_sync_core_before_usermode() -- the core scheduler
doesn't actually know whether we are lazy.  With the old code, if a
CPU is running a membarrier-registered task, goes idle, gets unlazied
via a TLB shootdown IPI, and switches back to the
membarrier-registered task, it will do an unnecessary core sync.

Conveniently, x86 is the only architecture that does anything in this
sync_core_before_usermode(), so membarrier_mm_sync_core_before_usermode()
is a no-op on all other architectures and we can just move the code.

(I am not claiming that the SYNC_CORE code was correct before or after this
 change on any non-x86 architecture.  I merely claim that this change
 improves readability, is correct on x86, and makes no change on any other
 architecture.)

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c        | 58 +++++++++++++++++++++++++++++++---------
 include/linux/sched/mm.h | 13 ---------
 kernel/sched/core.c      | 14 +++++-----
 3 files changed, 53 insertions(+), 32 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 59ba2968af1b..1ae15172885e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -9,6 +9,7 @@
 #include <linux/cpu.h>
 #include <linux/debugfs.h>
 #include <linux/sched/smt.h>
+#include <linux/sched/mm.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -485,6 +486,15 @@ void cr4_update_pce(void *ignored)
 static inline void cr4_update_pce_mm(struct mm_struct *mm) { }
 #endif
 
+static void sync_core_if_membarrier_enabled(struct mm_struct *next)
+{
+#ifdef CONFIG_MEMBARRIER
+	if (unlikely(atomic_read(&next->membarrier_state) &
+		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
+		sync_core_before_usermode();
+#endif
+}
+
 void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			struct task_struct *tsk)
 {
@@ -539,16 +549,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
 	/*
-	 * The membarrier system call requires a full memory barrier and
-	 * core serialization before returning to user-space, after
-	 * storing to rq->curr, when changing mm.  This is because
-	 * membarrier() sends IPIs to all CPUs that are in the target mm
-	 * to make them issue memory barriers.  However, if another CPU
-	 * switches to/from the target mm concurrently with
-	 * membarrier(), it can cause that CPU not to receive an IPI
-	 * when it really should issue a memory barrier.  Writing to CR3
-	 * provides that full memory barrier and core serializing
-	 * instruction.
+	 * membarrier() support requires that, when we change rq->curr->mm:
+	 *
+	 *  - If next->mm has membarrier registered, a full memory barrier
+	 *    after writing rq->curr (or rq->curr->mm if we switched the mm
+	 *    without switching tasks) and before returning to user mode.
+	 *
+	 *  - If next->mm has SYNC_CORE registered, then we sync core before
+	 *    returning to user mode.
+	 *
+	 * In the case where prev->mm == next->mm, membarrier() uses an IPI
+	 * instead, and no particular barriers are needed while context
+	 * switching.
+	 *
+	 * x86 gets all of this as a side-effect of writing to CR3 except
+	 * in the case where we unlazy without flushing.
+	 *
+	 * All other architectures are civilized and do all of this implicitly
+	 * when transitioning from kernel to user mode.
 	 */
 	if (real_prev == next) {
 		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
@@ -566,7 +584,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
-		 * process. No TLB flush required.
+		 * process. No TLB flush or membarrier() synchronization
+		 * is required.
 		 */
 		if (!was_lazy)
 			return;
@@ -576,16 +595,31 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * If the TLB is up to date, just use it.
 		 * The barrier synchronizes with the tlb_gen increment in
 		 * the TLB shootdown code.
+		 *
+		 * As a future optimization opportunity, it's plausible
+		 * that the x86 memory model is strong enough that this
+		 * smp_mb() isn't needed.
 		 */
 		smp_mb();
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
-				next_tlb_gen)
+		    next_tlb_gen) {
+			/*
+			 * We switched logical mm but we're not going to
+			 * write to CR3.  We already did smp_mb() above,
+			 * but membarrier() might require a sync_core()
+			 * as well.
+			 */
+			sync_core_if_membarrier_enabled(next);
+
 			return;
+		}
 
 		/*
 		 * TLB contents went out of date while we were in lazy
 		 * mode. Fall through to the TLB switching code below.
+		 * No need for an explicit membarrier invocation -- the CR3
+		 * write will serialize.
 		 */
 		new_asid = prev_asid;
 		need_flush = true;
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5561486fddef..c256a7fc0423 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -345,16 +345,6 @@ enum {
 #include <asm/membarrier.h>
 #endif
 
-static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
-{
-	if (current->mm != mm)
-		return;
-	if (likely(!(atomic_read(&mm->membarrier_state) &
-		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
-		return;
-	sync_core_before_usermode();
-}
-
 extern void membarrier_exec_mmap(struct mm_struct *mm);
 
 extern void membarrier_update_current_mm(struct mm_struct *next_mm);
@@ -370,9 +360,6 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
 static inline void membarrier_exec_mmap(struct mm_struct *mm)
 {
 }
-static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
-{
-}
 static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
 {
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f21714ea3db8..6a1db8264c7b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4822,22 +4822,22 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	kmap_local_sched_in();
 
 	fire_sched_in_preempt_notifiers(current);
+
 	/*
 	 * When switching through a kernel thread, the loop in
 	 * membarrier_{private,global}_expedited() may have observed that
 	 * kernel thread and not issued an IPI. It is therefore possible to
 	 * schedule between user->kernel->user threads without passing though
 	 * switch_mm(). Membarrier requires a barrier after storing to
-	 * rq->curr, before returning to userspace, so provide them here:
+	 * rq->curr, before returning to userspace, and mmdrop() provides
+	 * this barrier.
 	 *
-	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
-	 *   provided by mmdrop(),
-	 * - a sync_core for SYNC_CORE.
+	 * If an architecture needs to take a specific action for
+	 * SYNC_CORE, it can do so in switch_mm_irqs_off().
 	 */
-	if (mm) {
-		membarrier_mm_sync_core_before_usermode(mm);
+	if (mm)
 		mmdrop(mm);
-	}
+
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
  2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
  2022-01-08 16:43 ` [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-08 16:43 ` [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

membarrier_arch_switch_mm()'s sole implementation and caller are in
arch/powerpc.  Having a fallback implementation in include/linux is
confusing -- remove it.

It's still mentioned in a comment, but a subsequent patch will remove
it.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/sched/mm.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index c256a7fc0423..0df706c099e5 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -350,13 +350,6 @@ extern void membarrier_exec_mmap(struct mm_struct *mm);
 extern void membarrier_update_current_mm(struct mm_struct *next_mm);
 
 #else
-#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
-static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
-					     struct mm_struct *next,
-					     struct task_struct *tsk)
-{
-}
-#endif
 static inline void membarrier_exec_mmap(struct mm_struct *mm)
 {
 }
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (2 preceding siblings ...)
  2022-01-08 16:43 ` [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-12 15:52   ` Mathieu Desnoyers
  2022-01-08 16:43 ` [PATCH 05/23] membarrier, kthread: Use _ONCE accessors for task->mm Andy Lutomirski
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

membarrier() needs a barrier after any CPU changes mm.  There is currently
a comment explaining why this barrier probably exists in all cases. The
logic is based on ensuring that the barrier exists on every control flow
path through the scheduler.  It also relies on mmgrab() and mmdrop() being
full barriers.

mmgrab() and mmdrop() would be better if they were not full barriers.  As a
trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
could use a release on architectures that have these operations.  Larger
optimizations are also in the works.  Doing any of these optimizations
while preserving an unnecessary barrier will complicate the code and
penalize non-membarrier-using tasks.

Simplify the logic by adding an explicit barrier, and allow architectures
to override it as an optimization if they want to.

One of the deleted comments in this patch said "It is therefore
possible to schedule between user->kernel->user threads without
passing through switch_mm()".  It is possible to do this without, say,
writing to CR3 on x86, but the core scheduler indeed calls
switch_mm_irqs_off() to tell the arch code to go back from lazy mode
to no-lazy mode.

The membarrier_finish_switch_mm() call in exec_mmap() is a no-op so long as
there is no way for a newly execed program to register for membarrier prior
to running user code.  Subsequent patches will merge the exec_mmap() code
with the kthread_use_mm() code, though, and keeping the paths consistent
will make the result more comprehensible.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 fs/exec.c                |  1 +
 include/linux/sched/mm.h | 18 ++++++++++++++++++
 kernel/kthread.c         | 12 +-----------
 kernel/sched/core.c      | 34 +++++++++-------------------------
 4 files changed, 29 insertions(+), 36 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a098c133d8d7..3abbd0294e73 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1019,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm)
 	activate_mm(active_mm, mm);
 	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
+	membarrier_finish_switch_mm(mm);
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
 	task_unlock(tsk);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 0df706c099e5..e8919995d8dd 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -349,6 +349,20 @@ extern void membarrier_exec_mmap(struct mm_struct *mm);
 
 extern void membarrier_update_current_mm(struct mm_struct *next_mm);
 
+/*
+ * Called by the core scheduler after calling switch_mm_irqs_off().
+ * Architectures that have implicit barriers when switching mms can
+ * override this as an optimization.
+ */
+#ifndef membarrier_finish_switch_mm
+static inline void membarrier_finish_switch_mm(struct mm_struct *mm)
+{
+	if (atomic_read(&mm->membarrier_state) &
+	    (MEMBARRIER_STATE_GLOBAL_EXPEDITED | MEMBARRIER_STATE_PRIVATE_EXPEDITED))
+		smp_mb();
+}
+#endif
+
 #else
 static inline void membarrier_exec_mmap(struct mm_struct *mm)
 {
@@ -356,6 +370,10 @@ static inline void membarrier_exec_mmap(struct mm_struct *mm)
 static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
 {
 }
+static inline void membarrier_finish_switch_mm(struct mm_struct *mm)
+{
+}
+
 #endif
 
 #endif /* _LINUX_SCHED_MM_H */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 5b37a8567168..396ae78a1a34 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1361,25 +1361,15 @@ void kthread_use_mm(struct mm_struct *mm)
 	tsk->mm = mm;
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
+	membarrier_finish_switch_mm(mm);
 	local_irq_enable();
 	task_unlock(tsk);
 #ifdef finish_arch_post_lock_switch
 	finish_arch_post_lock_switch();
 #endif
 
-	/*
-	 * When a kthread starts operating on an address space, the loop
-	 * in membarrier_{private,global}_expedited() may not observe
-	 * that tsk->mm, and not issue an IPI. Membarrier requires a
-	 * memory barrier after storing to tsk->mm, before accessing
-	 * user-space memory. A full memory barrier for membarrier
-	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
-	 * mmdrop(), or explicitly with smp_mb().
-	 */
 	if (active_mm != mm)
 		mmdrop(active_mm);
-	else
-		smp_mb();
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6a1db8264c7b..917068b0a145 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4824,14 +4824,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	fire_sched_in_preempt_notifiers(current);
 
 	/*
-	 * When switching through a kernel thread, the loop in
-	 * membarrier_{private,global}_expedited() may have observed that
-	 * kernel thread and not issued an IPI. It is therefore possible to
-	 * schedule between user->kernel->user threads without passing though
-	 * switch_mm(). Membarrier requires a barrier after storing to
-	 * rq->curr, before returning to userspace, and mmdrop() provides
-	 * this barrier.
-	 *
 	 * If an architecture needs to take a specific action for
 	 * SYNC_CORE, it can do so in switch_mm_irqs_off().
 	 */
@@ -4915,15 +4907,14 @@ context_switch(struct rq *rq, struct task_struct *prev,
 			prev->active_mm = NULL;
 	} else {                                        // to user
 		membarrier_switch_mm(rq, prev->active_mm, next->mm);
+		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+
 		/*
 		 * sys_membarrier() requires an smp_mb() between setting
-		 * rq->curr / membarrier_switch_mm() and returning to userspace.
-		 *
-		 * The below provides this either through switch_mm(), or in
-		 * case 'prev->active_mm == next->mm' through
-		 * finish_task_switch()'s mmdrop().
+		 * rq->curr->mm to a membarrier-enabled mm and returning
+		 * to userspace.
 		 */
-		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+		membarrier_finish_switch_mm(next->mm);
 
 		if (!prev->mm) {                        // from kernel
 			/* will mmdrop() in finish_task_switch(). */
@@ -6264,17 +6255,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		RCU_INIT_POINTER(rq->curr, next);
 		/*
 		 * The membarrier system call requires each architecture
-		 * to have a full memory barrier after updating
-		 * rq->curr, before returning to user-space.
-		 *
-		 * Here are the schemes providing that barrier on the
-		 * various architectures:
-		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
-		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
-		 * - finish_lock_switch() for weakly-ordered
-		 *   architectures where spin_unlock is a full barrier,
-		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
-		 *   is a RELEASE barrier),
+		 * to have a full memory barrier before and after updating
+		 * rq->curr->mm, before returning to userspace.  This
+		 * is provided by membarrier_finish_switch_mm().  Architectures
+		 * that want to optimize this can override that function.
 		 */
 		++*switch_count;
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 05/23] membarrier, kthread: Use _ONCE accessors for task->mm
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (3 preceding siblings ...)
  2022-01-08 16:43 ` [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-12 15:55   ` Mathieu Desnoyers
  2022-01-08 16:43   ` Andy Lutomirski
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

membarrier reads cpu_rq(remote cpu)->curr->mm without locking.  Use
READ_ONCE() and WRITE_ONCE() to remove the data races.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 fs/exec.c                 | 2 +-
 kernel/exit.c             | 2 +-
 kernel/kthread.c          | 4 ++--
 kernel/sched/membarrier.c | 7 ++++---
 4 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 3abbd0294e73..38b05e01c5bd 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1006,7 +1006,7 @@ static int exec_mmap(struct mm_struct *mm)
 	local_irq_disable();
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
-	tsk->mm = mm;
+	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
diff --git a/kernel/exit.c b/kernel/exit.c
index 91a43e57a32e..70f2cbc42015 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -491,7 +491,7 @@ static void exit_mm(void)
 	 */
 	smp_mb__after_spinlock();
 	local_irq_disable();
-	current->mm = NULL;
+	WRITE_ONCE(current->mm, NULL);
 	membarrier_update_current_mm(NULL);
 	enter_lazy_tlb(mm, current);
 	local_irq_enable();
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 396ae78a1a34..3b18329f885c 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1358,7 +1358,7 @@ void kthread_use_mm(struct mm_struct *mm)
 		mmgrab(mm);
 		tsk->active_mm = mm;
 	}
-	tsk->mm = mm;
+	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
 	membarrier_finish_switch_mm(mm);
@@ -1399,7 +1399,7 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	smp_mb__after_spinlock();
 	sync_mm_rss(mm);
 	local_irq_disable();
-	tsk->mm = NULL;
+	WRITE_ONCE(tsk->mm, NULL);  /* membarrier reads this without locks */
 	membarrier_update_current_mm(NULL);
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 30e964b9689d..327830f89c37 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -411,7 +411,7 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 			goto out;
 		rcu_read_lock();
 		p = rcu_dereference(cpu_rq(cpu_id)->curr);
-		if (!p || p->mm != mm) {
+		if (!p || READ_ONCE(p->mm) != mm) {
 			rcu_read_unlock();
 			goto out;
 		}
@@ -424,7 +424,7 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 			struct task_struct *p;
 
 			p = rcu_dereference(cpu_rq(cpu)->curr);
-			if (p && p->mm == mm)
+			if (p && READ_ONCE(p->mm) == mm)
 				__cpumask_set_cpu(cpu, tmpmask);
 		}
 		rcu_read_unlock();
@@ -522,7 +522,8 @@ static int sync_runqueues_membarrier_state(struct mm_struct *mm)
 		struct task_struct *p;
 
 		p = rcu_dereference(rq->curr);
-		if (p && p->mm == mm)
+		/* exec and kthread_use_mm() write ->mm without locks */
+		if (p && READ_ONCE(p->mm) == mm)
 			__cpumask_set_cpu(cpu, tmpmask);
 	}
 	rcu_read_unlock();
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
@ 2022-01-08 16:43   ` Andy Lutomirski
  2022-01-08 16:43 ` [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
                     ` (21 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski, Michael Ellerman, Paul Mackerras, linuxppc-dev

powerpc did the following on some, but not all, paths through
switch_mm_irqs_off():

       /*
        * Only need the full barrier when switching between processes.
        * Barrier when switching from kernel to userspace is not
        * required here, given that it is implied by mmdrop(). Barrier
        * when switching from userspace to kernel is not needed after
        * store to rq->curr.
        */
       if (likely(!(atomic_read(&next->membarrier_state) &
                    (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
                     MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
               return;

This is puzzling: if !prev, then one might expect that we are switching
from kernel to user, not user to kernel, which is inconsistent with the
comment.  But this is all nonsense, because the one and only caller would
never have prev == NULL and would, in fact, OOPS if prev == NULL.

In any event, this code is unnecessary, since the new generic
membarrier_finish_switch_mm() provides the same barrier without arch help.

arch/powerpc/include/asm/membarrier.h remains as an empty header,
because a later patch in this series will add code to it.

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/powerpc/include/asm/membarrier.h | 24 ------------------------
 arch/powerpc/mm/mmu_context.c         |  1 -
 2 files changed, 25 deletions(-)

diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
index de7f79157918..b90766e95bd1 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -1,28 +1,4 @@
 #ifndef _ASM_POWERPC_MEMBARRIER_H
 #define _ASM_POWERPC_MEMBARRIER_H
 
-static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
-					     struct mm_struct *next,
-					     struct task_struct *tsk)
-{
-	/*
-	 * Only need the full barrier when switching between processes.
-	 * Barrier when switching from kernel to userspace is not
-	 * required here, given that it is implied by mmdrop(). Barrier
-	 * when switching from userspace to kernel is not needed after
-	 * store to rq->curr.
-	 */
-	if (IS_ENABLED(CONFIG_SMP) &&
-	    likely(!(atomic_read(&next->membarrier_state) &
-		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
-		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
-		return;
-
-	/*
-	 * The membarrier system call requires a full memory barrier
-	 * after storing to rq->curr, before going back to user-space.
-	 */
-	smp_mb();
-}
-
 #endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 74246536b832..5f2daa6b0497 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -84,7 +84,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		asm volatile ("dssall");
 
 	if (!new_on_cpu)
-		membarrier_arch_switch_mm(prev, next, tsk);
 
 	/*
 	 * The actual HW switching method differs between the various
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch
@ 2022-01-08 16:43   ` Andy Lutomirski
  0 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: linux-arch, x86, Rik van Riel, Peter Zijlstra, Randy Dunlap,
	linuxppc-dev, Nicholas Piggin, Dave Hansen, Mathieu Desnoyers,
	Andy Lutomirski, Paul Mackerras, Nadav Amit

powerpc did the following on some, but not all, paths through
switch_mm_irqs_off():

       /*
        * Only need the full barrier when switching between processes.
        * Barrier when switching from kernel to userspace is not
        * required here, given that it is implied by mmdrop(). Barrier
        * when switching from userspace to kernel is not needed after
        * store to rq->curr.
        */
       if (likely(!(atomic_read(&next->membarrier_state) &
                    (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
                     MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
               return;

This is puzzling: if !prev, then one might expect that we are switching
from kernel to user, not user to kernel, which is inconsistent with the
comment.  But this is all nonsense, because the one and only caller would
never have prev == NULL and would, in fact, OOPS if prev == NULL.

In any event, this code is unnecessary, since the new generic
membarrier_finish_switch_mm() provides the same barrier without arch help.

arch/powerpc/include/asm/membarrier.h remains as an empty header,
because a later patch in this series will add code to it.

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/powerpc/include/asm/membarrier.h | 24 ------------------------
 arch/powerpc/mm/mmu_context.c         |  1 -
 2 files changed, 25 deletions(-)

diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
index de7f79157918..b90766e95bd1 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -1,28 +1,4 @@
 #ifndef _ASM_POWERPC_MEMBARRIER_H
 #define _ASM_POWERPC_MEMBARRIER_H
 
-static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
-					     struct mm_struct *next,
-					     struct task_struct *tsk)
-{
-	/*
-	 * Only need the full barrier when switching between processes.
-	 * Barrier when switching from kernel to userspace is not
-	 * required here, given that it is implied by mmdrop(). Barrier
-	 * when switching from userspace to kernel is not needed after
-	 * store to rq->curr.
-	 */
-	if (IS_ENABLED(CONFIG_SMP) &&
-	    likely(!(atomic_read(&next->membarrier_state) &
-		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
-		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
-		return;
-
-	/*
-	 * The membarrier system call requires a full memory barrier
-	 * after storing to rq->curr, before going back to user-space.
-	 */
-	smp_mb();
-}
-
 #endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 74246536b832..5f2daa6b0497 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -84,7 +84,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		asm volatile ("dssall");
 
 	if (!new_on_cpu)
-		membarrier_arch_switch_mm(prev, next, tsk);
 
 	/*
 	 * The actual HW switching method differs between the various
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
  2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
@ 2022-01-08 16:43   ` Andy Lutomirski
  2022-01-08 16:43 ` [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
                     ` (20 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski, Michael Ellerman, Paul Mackerras, linuxppc-dev,
	Catalin Marinas, Will Deacon, linux-arm-kernel, stable

The old sync_core_before_usermode() comments suggested that a
non-icache-syncing return-to-usermode instruction is x86-specific and that
all other architectures automatically notice cross-modified code on return
to userspace.

This is misleading.  The incantation needed to modify code from one
CPU and execute it on another CPU is highly architecture dependent.
On x86, according to the SDM, one must modify the code, issue SFENCE
if the modification was WC or nontemporal, and then issue a "serializing
instruction" on the CPU that will execute the code.  membarrier() can do
the latter.

On arm, arm64 and powerpc, one must flush the icache and then flush the
pipeline on the target CPU, although the CPU manuals don't necessarily use
this language.

So let's drop any pretense that we can have a generic way to define or
implement membarrier's SYNC_CORE operation and instead require all
architectures to define the helper and supply their own documentation as to
how to use it.  This means x86, arm64, and powerpc for now.  Let's also
rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.

(It may well be the case that, on real x86 processors, synchronizing the
 icache (which requires no action at all) and "flushing the pipeline" is
 sufficient, but trying to use this language would be confusing at best.
 LFENCE does something awfully like "flushing the pipeline", but the SDM
 does not permit LFENCE as an alternative to a "serializing instruction"
 for this purpose.)

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Acked-by: Will Deacon <will@kernel.org> # for arm64
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 .../membarrier-sync-core/arch-support.txt     | 69 ++++++-------------
 arch/arm/include/asm/membarrier.h             | 21 ++++++
 arch/arm64/include/asm/membarrier.h           | 19 +++++
 arch/powerpc/include/asm/membarrier.h         | 10 +++
 arch/x86/Kconfig                              |  1 -
 arch/x86/include/asm/membarrier.h             | 25 +++++++
 arch/x86/include/asm/sync_core.h              | 20 ------
 arch/x86/kernel/alternative.c                 |  2 +-
 arch/x86/kernel/cpu/mce/core.c                |  2 +-
 arch/x86/mm/tlb.c                             |  3 +-
 drivers/misc/sgi-gru/grufault.c               |  2 +-
 drivers/misc/sgi-gru/gruhandles.c             |  2 +-
 drivers/misc/sgi-gru/grukservices.c           |  2 +-
 include/linux/sched/mm.h                      |  1 -
 include/linux/sync_core.h                     | 21 ------
 init/Kconfig                                  |  3 -
 kernel/sched/membarrier.c                     | 14 +++-
 17 files changed, 115 insertions(+), 102 deletions(-)
 create mode 100644 arch/arm/include/asm/membarrier.h
 create mode 100644 arch/arm64/include/asm/membarrier.h
 create mode 100644 arch/x86/include/asm/membarrier.h
 delete mode 100644 include/linux/sync_core.h

diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index 883d33b265d6..4009b26bf5c3 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -5,51 +5,26 @@
 #
 # Architecture requirements
 #
-# * arm/arm64/powerpc
 #
-# Rely on implicit context synchronization as a result of exception return
-# when returning from IPI handler, and when returning to user-space.
-#
-# * x86
-#
-# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
-# instruction is core serializing, but not SYSEXIT.
-#
-# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it can return to user-space through either SYSRETL (compat code),
-# SYSRETQ, or IRET.
-#
-# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
-# instead on write_cr3() performed by switch_mm() to provide core serialization
-# after changing the current mm, and deal with the special case of kthread ->
-# uthread (temporarily keeping current mm into active_mm) by issuing a
-# sync_core_before_usermode() in that specific case.
-#
-    -----------------------
-    |         arch |status|
-    -----------------------
-    |       alpha: | TODO |
-    |         arc: | TODO |
-    |         arm: |  ok  |
-    |       arm64: |  ok  |
-    |        csky: | TODO |
-    |       h8300: | TODO |
-    |     hexagon: | TODO |
-    |        ia64: | TODO |
-    |        m68k: | TODO |
-    |  microblaze: | TODO |
-    |        mips: | TODO |
-    |       nds32: | TODO |
-    |       nios2: | TODO |
-    |    openrisc: | TODO |
-    |      parisc: | TODO |
-    |     powerpc: |  ok  |
-    |       riscv: | TODO |
-    |        s390: | TODO |
-    |          sh: | TODO |
-    |       sparc: | TODO |
-    |          um: | TODO |
-    |         x86: |  ok  |
-    |      xtensa: | TODO |
-    -----------------------
+# An architecture that wants to support
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
+# is supposed to do and implement membarrier_sync_core_before_usermode() to
+# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
+# Kconfig and document what SYNC_CORE does on that architecture in this
+# list.
+#
+# On x86, a program can safely modify code, issue
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
+# the modified address or an alias, from any thread in the calling process.
+#
+# On arm and arm64, a program can modify code, flush the icache as needed,
+# and issue MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context
+# synchronizing event", aka pipeline flush on all CPUs that might run the
+# calling process.  Then the program can execute the modified code as long
+# as it is executed from an address consistent with the icache flush and
+# the CPU's cache type.  On arm, cacheflush(2) can be used for the icache
+# flushing operation.
+#
+# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
+# similarly to arm64.  It would be nice if the powerpc maintainers could
+# add a more clear explanantion.
diff --git a/arch/arm/include/asm/membarrier.h b/arch/arm/include/asm/membarrier.h
new file mode 100644
index 000000000000..c162a0758657
--- /dev/null
+++ b/arch/arm/include/asm/membarrier.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM_MEMBARRIER_H
+#define _ASM_ARM_MEMBARRIER_H
+
+#include <asm/barrier.h>
+
+/*
+ * On arm, anyone trying to use membarrier() to handle JIT code is required
+ * to first flush the icache (most likely by using cacheflush(2) and then
+ * do SYNC_CORE.  All that's needed after the icache flush is to execute a
+ * "context synchronization event".
+ *
+ * Returning to user mode is a context synchronization event, so no
+ * specific action by the kernel is needed other than ensuring that the
+ * kernel is entered.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_ARM_MEMBARRIER_H */
diff --git a/arch/arm64/include/asm/membarrier.h b/arch/arm64/include/asm/membarrier.h
new file mode 100644
index 000000000000..db8e0ea57253
--- /dev/null
+++ b/arch/arm64/include/asm/membarrier.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_MEMBARRIER_H
+#define _ASM_ARM64_MEMBARRIER_H
+
+#include <asm/barrier.h>
+
+/*
+ * On arm64, anyone trying to use membarrier() to handle JIT code is
+ * required to first flush the icache and then do SYNC_CORE.  All that's
+ * needed after the icache flush is to execute a "context synchronization
+ * event".  Right now, ERET does this, and we are guaranteed to ERET before
+ * any user code runs.  If Linux ever programs the CPU to make ERET stop
+ * being a context synchronizing event, then this will need to be adjusted.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_ARM64_MEMBARRIER_H */
diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
index b90766e95bd1..466abe6fdcea 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -1,4 +1,14 @@
 #ifndef _ASM_POWERPC_MEMBARRIER_H
 #define _ASM_POWERPC_MEMBARRIER_H
 
+#include <asm/barrier.h>
+
+/*
+ * The RFI family of instructions are context synchronising, and
+ * that is how we return to userspace, so nothing is required here.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
 #endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d9830e7e1060..5060c38bf560 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -90,7 +90,6 @@ config X86
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
-	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_DEBUG_WX
diff --git a/arch/x86/include/asm/membarrier.h b/arch/x86/include/asm/membarrier.h
new file mode 100644
index 000000000000..9b72a1b49359
--- /dev/null
+++ b/arch/x86/include/asm/membarrier.h
@@ -0,0 +1,25 @@
+#ifndef _ASM_X86_MEMBARRIER_H
+#define _ASM_X86_MEMBARRIER_H
+
+#include <asm/sync_core.h>
+
+/*
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+	/* With PTI, we unconditionally serialize before running user code. */
+	if (static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	/*
+	 * Even if we're in an interrupt, we might reschedule before returning,
+	 * in which case we could switch to a different thread in the same mm
+	 * and return using SYSRET or SYSEXIT.  Instead of trying to keep
+	 * track of our need to sync the core, just sync right away.
+	 */
+	sync_core();
+}
+
+#endif /* _ASM_X86_MEMBARRIER_H */
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..bfe4ac4e6be2 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -88,24 +88,4 @@ static inline void sync_core(void)
 	iret_to_self();
 }
 
-/*
- * Ensure that a core serializing instruction is issued before returning
- * to user-mode. x86 implements return to user-space through sysexit,
- * sysrel, and sysretq, which are not core serializing.
- */
-static inline void sync_core_before_usermode(void)
-{
-	/* With PTI, we unconditionally serialize before running user code. */
-	if (static_cpu_has(X86_FEATURE_PTI))
-		return;
-
-	/*
-	 * Even if we're in an interrupt, we might reschedule before returning,
-	 * in which case we could switch to a different thread in the same mm
-	 * and return using SYSRET or SYSEXIT.  Instead of trying to keep
-	 * track of our need to sync the core, just sync right away.
-	 */
-	sync_core();
-}
-
 #endif /* _ASM_X86_SYNC_CORE_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index e9da3dc71254..b47cd22b2eb1 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -17,7 +17,7 @@
 #include <linux/kprobes.h>
 #include <linux/mmu_context.h>
 #include <linux/bsearch.h>
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 193204aee880..a2529e09f620 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -41,12 +41,12 @@
 #include <linux/irq_work.h>
 #include <linux/export.h>
 #include <linux/set_memory.h>
-#include <linux/sync_core.h>
 #include <linux/task_work.h>
 #include <linux/hardirq.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
+#include <asm/sync_core.h>
 #include <asm/traps.h>
 #include <asm/tlbflush.h>
 #include <asm/mce.h>
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1ae15172885e..74b7a615bc15 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
 #include <linux/sched/mm.h>
 
 #include <asm/tlbflush.h>
+#include <asm/membarrier.h>
 #include <asm/mmu_context.h>
 #include <asm/nospec-branch.h>
 #include <asm/cache.h>
@@ -491,7 +492,7 @@ static void sync_core_if_membarrier_enabled(struct mm_struct *next)
 #ifdef CONFIG_MEMBARRIER
 	if (unlikely(atomic_read(&next->membarrier_state) &
 		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
-		sync_core_before_usermode();
+		membarrier_sync_core_before_usermode();
 #endif
 }
 
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index d7ef61e602ed..462c667bd6c4 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -20,8 +20,8 @@
 #include <linux/io.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
-#include <linux/sync_core.h>
 #include <linux/prefetch.h>
+#include <asm/sync_core.h>
 #include "gru.h"
 #include "grutables.h"
 #include "grulib.h"
diff --git a/drivers/misc/sgi-gru/gruhandles.c b/drivers/misc/sgi-gru/gruhandles.c
index 1d75d5e540bc..c8cba1c1b00f 100644
--- a/drivers/misc/sgi-gru/gruhandles.c
+++ b/drivers/misc/sgi-gru/gruhandles.c
@@ -16,7 +16,7 @@
 #define GRU_OPERATION_TIMEOUT	(((cycles_t) local_cpu_data->itc_freq)*10)
 #define CLKS2NSEC(c)		((c) *1000000000 / local_cpu_data->itc_freq)
 #else
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/tsc.h>
 #define GRU_OPERATION_TIMEOUT	((cycles_t) tsc_khz*10*1000)
 #define CLKS2NSEC(c)		((c) * 1000000 / tsc_khz)
diff --git a/drivers/misc/sgi-gru/grukservices.c b/drivers/misc/sgi-gru/grukservices.c
index 0ea923fe6371..ce03ff3f7c3a 100644
--- a/drivers/misc/sgi-gru/grukservices.c
+++ b/drivers/misc/sgi-gru/grukservices.c
@@ -16,10 +16,10 @@
 #include <linux/miscdevice.h>
 #include <linux/proc_fs.h>
 #include <linux/interrupt.h>
-#include <linux/sync_core.h>
 #include <linux/uaccess.h>
 #include <linux/delay.h>
 #include <linux/export.h>
+#include <asm/sync_core.h>
 #include <asm/io_apic.h>
 #include "gru.h"
 #include "grulib.h"
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e8919995d8dd..e107f292fc42 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,7 +7,6 @@
 #include <linux/sched.h>
 #include <linux/mm_types.h>
 #include <linux/gfp.h>
-#include <linux/sync_core.h>
 
 /*
  * Routines for handling mm_structs
diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
deleted file mode 100644
index 013da4b8b327..000000000000
--- a/include/linux/sync_core.h
+++ /dev/null
@@ -1,21 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SYNC_CORE_H
-#define _LINUX_SYNC_CORE_H
-
-#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-#include <asm/sync_core.h>
-#else
-/*
- * This is a dummy sync_core_before_usermode() implementation that can be used
- * on all architectures which return to user-space through core serializing
- * instructions.
- * If your architecture returns to user-space through non-core-serializing
- * instructions, you need to write your own functions.
- */
-static inline void sync_core_before_usermode(void)
-{
-}
-#endif
-
-#endif /* _LINUX_SYNC_CORE_H */
-
diff --git a/init/Kconfig b/init/Kconfig
index 11f8a845f259..bbaf93f9438b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2364,9 +2364,6 @@ source "kernel/Kconfig.locks"
 config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	bool
 
-config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-	bool
-
 # It may be useful for an architecture to override the definitions of the
 # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
 # and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 327830f89c37..eb73eeaedc7d 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,14 @@
  * membarrier system call
  */
 #include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/membarrier.h>
+#else
+static inline void membarrier_sync_core_before_usermode(void)
+{
+	compiletime_assert(0, "architecture does not implement membarrier_sync_core_before_usermode");
+}
+#endif
 
 /*
  * The basic principle behind the regular memory barrier mode of
@@ -231,12 +239,12 @@ static void ipi_sync_core(void *info)
 	 * the big comment at the top of this file.
 	 *
 	 * A sync_core() would provide this guarantee, but
-	 * sync_core_before_usermode() might end up being deferred until
-	 * after membarrier()'s smp_mb().
+	 * membarrier_sync_core_before_usermode() might end up being deferred
+	 * until after membarrier()'s smp_mb().
 	 */
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 
-	sync_core_before_usermode();
+	membarrier_sync_core_before_usermode();
 }
 
 static void ipi_rseq(void *info)
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2022-01-08 16:43   ` Andy Lutomirski
  0 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: linux-arch, x86, Catalin Marinas, Will Deacon, Rik van Riel,
	Peter Zijlstra, Randy Dunlap, linuxppc-dev, Nicholas Piggin,
	Dave Hansen, Mathieu Desnoyers, stable, Andy Lutomirski,
	Paul Mackerras, Nadav Amit, linux-arm-kernel

The old sync_core_before_usermode() comments suggested that a
non-icache-syncing return-to-usermode instruction is x86-specific and that
all other architectures automatically notice cross-modified code on return
to userspace.

This is misleading.  The incantation needed to modify code from one
CPU and execute it on another CPU is highly architecture dependent.
On x86, according to the SDM, one must modify the code, issue SFENCE
if the modification was WC or nontemporal, and then issue a "serializing
instruction" on the CPU that will execute the code.  membarrier() can do
the latter.

On arm, arm64 and powerpc, one must flush the icache and then flush the
pipeline on the target CPU, although the CPU manuals don't necessarily use
this language.

So let's drop any pretense that we can have a generic way to define or
implement membarrier's SYNC_CORE operation and instead require all
architectures to define the helper and supply their own documentation as to
how to use it.  This means x86, arm64, and powerpc for now.  Let's also
rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.

(It may well be the case that, on real x86 processors, synchronizing the
 icache (which requires no action at all) and "flushing the pipeline" is
 sufficient, but trying to use this language would be confusing at best.
 LFENCE does something awfully like "flushing the pipeline", but the SDM
 does not permit LFENCE as an alternative to a "serializing instruction"
 for this purpose.)

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Acked-by: Will Deacon <will@kernel.org> # for arm64
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 .../membarrier-sync-core/arch-support.txt     | 69 ++++++-------------
 arch/arm/include/asm/membarrier.h             | 21 ++++++
 arch/arm64/include/asm/membarrier.h           | 19 +++++
 arch/powerpc/include/asm/membarrier.h         | 10 +++
 arch/x86/Kconfig                              |  1 -
 arch/x86/include/asm/membarrier.h             | 25 +++++++
 arch/x86/include/asm/sync_core.h              | 20 ------
 arch/x86/kernel/alternative.c                 |  2 +-
 arch/x86/kernel/cpu/mce/core.c                |  2 +-
 arch/x86/mm/tlb.c                             |  3 +-
 drivers/misc/sgi-gru/grufault.c               |  2 +-
 drivers/misc/sgi-gru/gruhandles.c             |  2 +-
 drivers/misc/sgi-gru/grukservices.c           |  2 +-
 include/linux/sched/mm.h                      |  1 -
 include/linux/sync_core.h                     | 21 ------
 init/Kconfig                                  |  3 -
 kernel/sched/membarrier.c                     | 14 +++-
 17 files changed, 115 insertions(+), 102 deletions(-)
 create mode 100644 arch/arm/include/asm/membarrier.h
 create mode 100644 arch/arm64/include/asm/membarrier.h
 create mode 100644 arch/x86/include/asm/membarrier.h
 delete mode 100644 include/linux/sync_core.h

diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index 883d33b265d6..4009b26bf5c3 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -5,51 +5,26 @@
 #
 # Architecture requirements
 #
-# * arm/arm64/powerpc
 #
-# Rely on implicit context synchronization as a result of exception return
-# when returning from IPI handler, and when returning to user-space.
-#
-# * x86
-#
-# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
-# instruction is core serializing, but not SYSEXIT.
-#
-# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it can return to user-space through either SYSRETL (compat code),
-# SYSRETQ, or IRET.
-#
-# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
-# instead on write_cr3() performed by switch_mm() to provide core serialization
-# after changing the current mm, and deal with the special case of kthread ->
-# uthread (temporarily keeping current mm into active_mm) by issuing a
-# sync_core_before_usermode() in that specific case.
-#
-    -----------------------
-    |         arch |status|
-    -----------------------
-    |       alpha: | TODO |
-    |         arc: | TODO |
-    |         arm: |  ok  |
-    |       arm64: |  ok  |
-    |        csky: | TODO |
-    |       h8300: | TODO |
-    |     hexagon: | TODO |
-    |        ia64: | TODO |
-    |        m68k: | TODO |
-    |  microblaze: | TODO |
-    |        mips: | TODO |
-    |       nds32: | TODO |
-    |       nios2: | TODO |
-    |    openrisc: | TODO |
-    |      parisc: | TODO |
-    |     powerpc: |  ok  |
-    |       riscv: | TODO |
-    |        s390: | TODO |
-    |          sh: | TODO |
-    |       sparc: | TODO |
-    |          um: | TODO |
-    |         x86: |  ok  |
-    |      xtensa: | TODO |
-    -----------------------
+# An architecture that wants to support
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
+# is supposed to do and implement membarrier_sync_core_before_usermode() to
+# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
+# Kconfig and document what SYNC_CORE does on that architecture in this
+# list.
+#
+# On x86, a program can safely modify code, issue
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
+# the modified address or an alias, from any thread in the calling process.
+#
+# On arm and arm64, a program can modify code, flush the icache as needed,
+# and issue MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context
+# synchronizing event", aka pipeline flush on all CPUs that might run the
+# calling process.  Then the program can execute the modified code as long
+# as it is executed from an address consistent with the icache flush and
+# the CPU's cache type.  On arm, cacheflush(2) can be used for the icache
+# flushing operation.
+#
+# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
+# similarly to arm64.  It would be nice if the powerpc maintainers could
+# add a more clear explanantion.
diff --git a/arch/arm/include/asm/membarrier.h b/arch/arm/include/asm/membarrier.h
new file mode 100644
index 000000000000..c162a0758657
--- /dev/null
+++ b/arch/arm/include/asm/membarrier.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM_MEMBARRIER_H
+#define _ASM_ARM_MEMBARRIER_H
+
+#include <asm/barrier.h>
+
+/*
+ * On arm, anyone trying to use membarrier() to handle JIT code is required
+ * to first flush the icache (most likely by using cacheflush(2) and then
+ * do SYNC_CORE.  All that's needed after the icache flush is to execute a
+ * "context synchronization event".
+ *
+ * Returning to user mode is a context synchronization event, so no
+ * specific action by the kernel is needed other than ensuring that the
+ * kernel is entered.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_ARM_MEMBARRIER_H */
diff --git a/arch/arm64/include/asm/membarrier.h b/arch/arm64/include/asm/membarrier.h
new file mode 100644
index 000000000000..db8e0ea57253
--- /dev/null
+++ b/arch/arm64/include/asm/membarrier.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_MEMBARRIER_H
+#define _ASM_ARM64_MEMBARRIER_H
+
+#include <asm/barrier.h>
+
+/*
+ * On arm64, anyone trying to use membarrier() to handle JIT code is
+ * required to first flush the icache and then do SYNC_CORE.  All that's
+ * needed after the icache flush is to execute a "context synchronization
+ * event".  Right now, ERET does this, and we are guaranteed to ERET before
+ * any user code runs.  If Linux ever programs the CPU to make ERET stop
+ * being a context synchronizing event, then this will need to be adjusted.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_ARM64_MEMBARRIER_H */
diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
index b90766e95bd1..466abe6fdcea 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -1,4 +1,14 @@
 #ifndef _ASM_POWERPC_MEMBARRIER_H
 #define _ASM_POWERPC_MEMBARRIER_H
 
+#include <asm/barrier.h>
+
+/*
+ * The RFI family of instructions are context synchronising, and
+ * that is how we return to userspace, so nothing is required here.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
 #endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d9830e7e1060..5060c38bf560 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -90,7 +90,6 @@ config X86
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
-	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_DEBUG_WX
diff --git a/arch/x86/include/asm/membarrier.h b/arch/x86/include/asm/membarrier.h
new file mode 100644
index 000000000000..9b72a1b49359
--- /dev/null
+++ b/arch/x86/include/asm/membarrier.h
@@ -0,0 +1,25 @@
+#ifndef _ASM_X86_MEMBARRIER_H
+#define _ASM_X86_MEMBARRIER_H
+
+#include <asm/sync_core.h>
+
+/*
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+	/* With PTI, we unconditionally serialize before running user code. */
+	if (static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	/*
+	 * Even if we're in an interrupt, we might reschedule before returning,
+	 * in which case we could switch to a different thread in the same mm
+	 * and return using SYSRET or SYSEXIT.  Instead of trying to keep
+	 * track of our need to sync the core, just sync right away.
+	 */
+	sync_core();
+}
+
+#endif /* _ASM_X86_MEMBARRIER_H */
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..bfe4ac4e6be2 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -88,24 +88,4 @@ static inline void sync_core(void)
 	iret_to_self();
 }
 
-/*
- * Ensure that a core serializing instruction is issued before returning
- * to user-mode. x86 implements return to user-space through sysexit,
- * sysrel, and sysretq, which are not core serializing.
- */
-static inline void sync_core_before_usermode(void)
-{
-	/* With PTI, we unconditionally serialize before running user code. */
-	if (static_cpu_has(X86_FEATURE_PTI))
-		return;
-
-	/*
-	 * Even if we're in an interrupt, we might reschedule before returning,
-	 * in which case we could switch to a different thread in the same mm
-	 * and return using SYSRET or SYSEXIT.  Instead of trying to keep
-	 * track of our need to sync the core, just sync right away.
-	 */
-	sync_core();
-}
-
 #endif /* _ASM_X86_SYNC_CORE_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index e9da3dc71254..b47cd22b2eb1 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -17,7 +17,7 @@
 #include <linux/kprobes.h>
 #include <linux/mmu_context.h>
 #include <linux/bsearch.h>
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 193204aee880..a2529e09f620 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -41,12 +41,12 @@
 #include <linux/irq_work.h>
 #include <linux/export.h>
 #include <linux/set_memory.h>
-#include <linux/sync_core.h>
 #include <linux/task_work.h>
 #include <linux/hardirq.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
+#include <asm/sync_core.h>
 #include <asm/traps.h>
 #include <asm/tlbflush.h>
 #include <asm/mce.h>
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1ae15172885e..74b7a615bc15 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
 #include <linux/sched/mm.h>
 
 #include <asm/tlbflush.h>
+#include <asm/membarrier.h>
 #include <asm/mmu_context.h>
 #include <asm/nospec-branch.h>
 #include <asm/cache.h>
@@ -491,7 +492,7 @@ static void sync_core_if_membarrier_enabled(struct mm_struct *next)
 #ifdef CONFIG_MEMBARRIER
 	if (unlikely(atomic_read(&next->membarrier_state) &
 		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
-		sync_core_before_usermode();
+		membarrier_sync_core_before_usermode();
 #endif
 }
 
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index d7ef61e602ed..462c667bd6c4 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -20,8 +20,8 @@
 #include <linux/io.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
-#include <linux/sync_core.h>
 #include <linux/prefetch.h>
+#include <asm/sync_core.h>
 #include "gru.h"
 #include "grutables.h"
 #include "grulib.h"
diff --git a/drivers/misc/sgi-gru/gruhandles.c b/drivers/misc/sgi-gru/gruhandles.c
index 1d75d5e540bc..c8cba1c1b00f 100644
--- a/drivers/misc/sgi-gru/gruhandles.c
+++ b/drivers/misc/sgi-gru/gruhandles.c
@@ -16,7 +16,7 @@
 #define GRU_OPERATION_TIMEOUT	(((cycles_t) local_cpu_data->itc_freq)*10)
 #define CLKS2NSEC(c)		((c) *1000000000 / local_cpu_data->itc_freq)
 #else
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/tsc.h>
 #define GRU_OPERATION_TIMEOUT	((cycles_t) tsc_khz*10*1000)
 #define CLKS2NSEC(c)		((c) * 1000000 / tsc_khz)
diff --git a/drivers/misc/sgi-gru/grukservices.c b/drivers/misc/sgi-gru/grukservices.c
index 0ea923fe6371..ce03ff3f7c3a 100644
--- a/drivers/misc/sgi-gru/grukservices.c
+++ b/drivers/misc/sgi-gru/grukservices.c
@@ -16,10 +16,10 @@
 #include <linux/miscdevice.h>
 #include <linux/proc_fs.h>
 #include <linux/interrupt.h>
-#include <linux/sync_core.h>
 #include <linux/uaccess.h>
 #include <linux/delay.h>
 #include <linux/export.h>
+#include <asm/sync_core.h>
 #include <asm/io_apic.h>
 #include "gru.h"
 #include "grulib.h"
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e8919995d8dd..e107f292fc42 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,7 +7,6 @@
 #include <linux/sched.h>
 #include <linux/mm_types.h>
 #include <linux/gfp.h>
-#include <linux/sync_core.h>
 
 /*
  * Routines for handling mm_structs
diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
deleted file mode 100644
index 013da4b8b327..000000000000
--- a/include/linux/sync_core.h
+++ /dev/null
@@ -1,21 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SYNC_CORE_H
-#define _LINUX_SYNC_CORE_H
-
-#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-#include <asm/sync_core.h>
-#else
-/*
- * This is a dummy sync_core_before_usermode() implementation that can be used
- * on all architectures which return to user-space through core serializing
- * instructions.
- * If your architecture returns to user-space through non-core-serializing
- * instructions, you need to write your own functions.
- */
-static inline void sync_core_before_usermode(void)
-{
-}
-#endif
-
-#endif /* _LINUX_SYNC_CORE_H */
-
diff --git a/init/Kconfig b/init/Kconfig
index 11f8a845f259..bbaf93f9438b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2364,9 +2364,6 @@ source "kernel/Kconfig.locks"
 config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	bool
 
-config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-	bool
-
 # It may be useful for an architecture to override the definitions of the
 # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
 # and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 327830f89c37..eb73eeaedc7d 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,14 @@
  * membarrier system call
  */
 #include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/membarrier.h>
+#else
+static inline void membarrier_sync_core_before_usermode(void)
+{
+	compiletime_assert(0, "architecture does not implement membarrier_sync_core_before_usermode");
+}
+#endif
 
 /*
  * The basic principle behind the regular memory barrier mode of
@@ -231,12 +239,12 @@ static void ipi_sync_core(void *info)
 	 * the big comment at the top of this file.
 	 *
 	 * A sync_core() would provide this guarantee, but
-	 * sync_core_before_usermode() might end up being deferred until
-	 * after membarrier()'s smp_mb().
+	 * membarrier_sync_core_before_usermode() might end up being deferred
+	 * until after membarrier()'s smp_mb().
 	 */
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 
-	sync_core_before_usermode();
+	membarrier_sync_core_before_usermode();
 }
 
 static void ipi_rseq(void *info)
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2022-01-08 16:43   ` Andy Lutomirski
  0 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski, Michael Ellerman, Paul Mackerras, linuxppc-dev,
	Catalin Marinas, Will Deacon, linux-arm-kernel, stable

The old sync_core_before_usermode() comments suggested that a
non-icache-syncing return-to-usermode instruction is x86-specific and that
all other architectures automatically notice cross-modified code on return
to userspace.

This is misleading.  The incantation needed to modify code from one
CPU and execute it on another CPU is highly architecture dependent.
On x86, according to the SDM, one must modify the code, issue SFENCE
if the modification was WC or nontemporal, and then issue a "serializing
instruction" on the CPU that will execute the code.  membarrier() can do
the latter.

On arm, arm64 and powerpc, one must flush the icache and then flush the
pipeline on the target CPU, although the CPU manuals don't necessarily use
this language.

So let's drop any pretense that we can have a generic way to define or
implement membarrier's SYNC_CORE operation and instead require all
architectures to define the helper and supply their own documentation as to
how to use it.  This means x86, arm64, and powerpc for now.  Let's also
rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.

(It may well be the case that, on real x86 processors, synchronizing the
 icache (which requires no action at all) and "flushing the pipeline" is
 sufficient, but trying to use this language would be confusing at best.
 LFENCE does something awfully like "flushing the pipeline", but the SDM
 does not permit LFENCE as an alternative to a "serializing instruction"
 for this purpose.)

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Acked-by: Will Deacon <will@kernel.org> # for arm64
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 .../membarrier-sync-core/arch-support.txt     | 69 ++++++-------------
 arch/arm/include/asm/membarrier.h             | 21 ++++++
 arch/arm64/include/asm/membarrier.h           | 19 +++++
 arch/powerpc/include/asm/membarrier.h         | 10 +++
 arch/x86/Kconfig                              |  1 -
 arch/x86/include/asm/membarrier.h             | 25 +++++++
 arch/x86/include/asm/sync_core.h              | 20 ------
 arch/x86/kernel/alternative.c                 |  2 +-
 arch/x86/kernel/cpu/mce/core.c                |  2 +-
 arch/x86/mm/tlb.c                             |  3 +-
 drivers/misc/sgi-gru/grufault.c               |  2 +-
 drivers/misc/sgi-gru/gruhandles.c             |  2 +-
 drivers/misc/sgi-gru/grukservices.c           |  2 +-
 include/linux/sched/mm.h                      |  1 -
 include/linux/sync_core.h                     | 21 ------
 init/Kconfig                                  |  3 -
 kernel/sched/membarrier.c                     | 14 +++-
 17 files changed, 115 insertions(+), 102 deletions(-)
 create mode 100644 arch/arm/include/asm/membarrier.h
 create mode 100644 arch/arm64/include/asm/membarrier.h
 create mode 100644 arch/x86/include/asm/membarrier.h
 delete mode 100644 include/linux/sync_core.h

diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index 883d33b265d6..4009b26bf5c3 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -5,51 +5,26 @@
 #
 # Architecture requirements
 #
-# * arm/arm64/powerpc
 #
-# Rely on implicit context synchronization as a result of exception return
-# when returning from IPI handler, and when returning to user-space.
-#
-# * x86
-#
-# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
-# instruction is core serializing, but not SYSEXIT.
-#
-# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it can return to user-space through either SYSRETL (compat code),
-# SYSRETQ, or IRET.
-#
-# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
-# instead on write_cr3() performed by switch_mm() to provide core serialization
-# after changing the current mm, and deal with the special case of kthread ->
-# uthread (temporarily keeping current mm into active_mm) by issuing a
-# sync_core_before_usermode() in that specific case.
-#
-    -----------------------
-    |         arch |status|
-    -----------------------
-    |       alpha: | TODO |
-    |         arc: | TODO |
-    |         arm: |  ok  |
-    |       arm64: |  ok  |
-    |        csky: | TODO |
-    |       h8300: | TODO |
-    |     hexagon: | TODO |
-    |        ia64: | TODO |
-    |        m68k: | TODO |
-    |  microblaze: | TODO |
-    |        mips: | TODO |
-    |       nds32: | TODO |
-    |       nios2: | TODO |
-    |    openrisc: | TODO |
-    |      parisc: | TODO |
-    |     powerpc: |  ok  |
-    |       riscv: | TODO |
-    |        s390: | TODO |
-    |          sh: | TODO |
-    |       sparc: | TODO |
-    |          um: | TODO |
-    |         x86: |  ok  |
-    |      xtensa: | TODO |
-    -----------------------
+# An architecture that wants to support
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
+# is supposed to do and implement membarrier_sync_core_before_usermode() to
+# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
+# Kconfig and document what SYNC_CORE does on that architecture in this
+# list.
+#
+# On x86, a program can safely modify code, issue
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
+# the modified address or an alias, from any thread in the calling process.
+#
+# On arm and arm64, a program can modify code, flush the icache as needed,
+# and issue MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context
+# synchronizing event", aka pipeline flush on all CPUs that might run the
+# calling process.  Then the program can execute the modified code as long
+# as it is executed from an address consistent with the icache flush and
+# the CPU's cache type.  On arm, cacheflush(2) can be used for the icache
+# flushing operation.
+#
+# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
+# similarly to arm64.  It would be nice if the powerpc maintainers could
+# add a more clear explanantion.
diff --git a/arch/arm/include/asm/membarrier.h b/arch/arm/include/asm/membarrier.h
new file mode 100644
index 000000000000..c162a0758657
--- /dev/null
+++ b/arch/arm/include/asm/membarrier.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM_MEMBARRIER_H
+#define _ASM_ARM_MEMBARRIER_H
+
+#include <asm/barrier.h>
+
+/*
+ * On arm, anyone trying to use membarrier() to handle JIT code is required
+ * to first flush the icache (most likely by using cacheflush(2) and then
+ * do SYNC_CORE.  All that's needed after the icache flush is to execute a
+ * "context synchronization event".
+ *
+ * Returning to user mode is a context synchronization event, so no
+ * specific action by the kernel is needed other than ensuring that the
+ * kernel is entered.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_ARM_MEMBARRIER_H */
diff --git a/arch/arm64/include/asm/membarrier.h b/arch/arm64/include/asm/membarrier.h
new file mode 100644
index 000000000000..db8e0ea57253
--- /dev/null
+++ b/arch/arm64/include/asm/membarrier.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_MEMBARRIER_H
+#define _ASM_ARM64_MEMBARRIER_H
+
+#include <asm/barrier.h>
+
+/*
+ * On arm64, anyone trying to use membarrier() to handle JIT code is
+ * required to first flush the icache and then do SYNC_CORE.  All that's
+ * needed after the icache flush is to execute a "context synchronization
+ * event".  Right now, ERET does this, and we are guaranteed to ERET before
+ * any user code runs.  If Linux ever programs the CPU to make ERET stop
+ * being a context synchronizing event, then this will need to be adjusted.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_ARM64_MEMBARRIER_H */
diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
index b90766e95bd1..466abe6fdcea 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -1,4 +1,14 @@
 #ifndef _ASM_POWERPC_MEMBARRIER_H
 #define _ASM_POWERPC_MEMBARRIER_H
 
+#include <asm/barrier.h>
+
+/*
+ * The RFI family of instructions are context synchronising, and
+ * that is how we return to userspace, so nothing is required here.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
 #endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d9830e7e1060..5060c38bf560 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -90,7 +90,6 @@ config X86
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
-	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_DEBUG_WX
diff --git a/arch/x86/include/asm/membarrier.h b/arch/x86/include/asm/membarrier.h
new file mode 100644
index 000000000000..9b72a1b49359
--- /dev/null
+++ b/arch/x86/include/asm/membarrier.h
@@ -0,0 +1,25 @@
+#ifndef _ASM_X86_MEMBARRIER_H
+#define _ASM_X86_MEMBARRIER_H
+
+#include <asm/sync_core.h>
+
+/*
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+	/* With PTI, we unconditionally serialize before running user code. */
+	if (static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	/*
+	 * Even if we're in an interrupt, we might reschedule before returning,
+	 * in which case we could switch to a different thread in the same mm
+	 * and return using SYSRET or SYSEXIT.  Instead of trying to keep
+	 * track of our need to sync the core, just sync right away.
+	 */
+	sync_core();
+}
+
+#endif /* _ASM_X86_MEMBARRIER_H */
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..bfe4ac4e6be2 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -88,24 +88,4 @@ static inline void sync_core(void)
 	iret_to_self();
 }
 
-/*
- * Ensure that a core serializing instruction is issued before returning
- * to user-mode. x86 implements return to user-space through sysexit,
- * sysrel, and sysretq, which are not core serializing.
- */
-static inline void sync_core_before_usermode(void)
-{
-	/* With PTI, we unconditionally serialize before running user code. */
-	if (static_cpu_has(X86_FEATURE_PTI))
-		return;
-
-	/*
-	 * Even if we're in an interrupt, we might reschedule before returning,
-	 * in which case we could switch to a different thread in the same mm
-	 * and return using SYSRET or SYSEXIT.  Instead of trying to keep
-	 * track of our need to sync the core, just sync right away.
-	 */
-	sync_core();
-}
-
 #endif /* _ASM_X86_SYNC_CORE_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index e9da3dc71254..b47cd22b2eb1 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -17,7 +17,7 @@
 #include <linux/kprobes.h>
 #include <linux/mmu_context.h>
 #include <linux/bsearch.h>
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 193204aee880..a2529e09f620 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -41,12 +41,12 @@
 #include <linux/irq_work.h>
 #include <linux/export.h>
 #include <linux/set_memory.h>
-#include <linux/sync_core.h>
 #include <linux/task_work.h>
 #include <linux/hardirq.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
+#include <asm/sync_core.h>
 #include <asm/traps.h>
 #include <asm/tlbflush.h>
 #include <asm/mce.h>
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1ae15172885e..74b7a615bc15 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
 #include <linux/sched/mm.h>
 
 #include <asm/tlbflush.h>
+#include <asm/membarrier.h>
 #include <asm/mmu_context.h>
 #include <asm/nospec-branch.h>
 #include <asm/cache.h>
@@ -491,7 +492,7 @@ static void sync_core_if_membarrier_enabled(struct mm_struct *next)
 #ifdef CONFIG_MEMBARRIER
 	if (unlikely(atomic_read(&next->membarrier_state) &
 		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
-		sync_core_before_usermode();
+		membarrier_sync_core_before_usermode();
 #endif
 }
 
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index d7ef61e602ed..462c667bd6c4 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -20,8 +20,8 @@
 #include <linux/io.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
-#include <linux/sync_core.h>
 #include <linux/prefetch.h>
+#include <asm/sync_core.h>
 #include "gru.h"
 #include "grutables.h"
 #include "grulib.h"
diff --git a/drivers/misc/sgi-gru/gruhandles.c b/drivers/misc/sgi-gru/gruhandles.c
index 1d75d5e540bc..c8cba1c1b00f 100644
--- a/drivers/misc/sgi-gru/gruhandles.c
+++ b/drivers/misc/sgi-gru/gruhandles.c
@@ -16,7 +16,7 @@
 #define GRU_OPERATION_TIMEOUT	(((cycles_t) local_cpu_data->itc_freq)*10)
 #define CLKS2NSEC(c)		((c) *1000000000 / local_cpu_data->itc_freq)
 #else
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/tsc.h>
 #define GRU_OPERATION_TIMEOUT	((cycles_t) tsc_khz*10*1000)
 #define CLKS2NSEC(c)		((c) * 1000000 / tsc_khz)
diff --git a/drivers/misc/sgi-gru/grukservices.c b/drivers/misc/sgi-gru/grukservices.c
index 0ea923fe6371..ce03ff3f7c3a 100644
--- a/drivers/misc/sgi-gru/grukservices.c
+++ b/drivers/misc/sgi-gru/grukservices.c
@@ -16,10 +16,10 @@
 #include <linux/miscdevice.h>
 #include <linux/proc_fs.h>
 #include <linux/interrupt.h>
-#include <linux/sync_core.h>
 #include <linux/uaccess.h>
 #include <linux/delay.h>
 #include <linux/export.h>
+#include <asm/sync_core.h>
 #include <asm/io_apic.h>
 #include "gru.h"
 #include "grulib.h"
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e8919995d8dd..e107f292fc42 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,7 +7,6 @@
 #include <linux/sched.h>
 #include <linux/mm_types.h>
 #include <linux/gfp.h>
-#include <linux/sync_core.h>
 
 /*
  * Routines for handling mm_structs
diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
deleted file mode 100644
index 013da4b8b327..000000000000
--- a/include/linux/sync_core.h
+++ /dev/null
@@ -1,21 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SYNC_CORE_H
-#define _LINUX_SYNC_CORE_H
-
-#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-#include <asm/sync_core.h>
-#else
-/*
- * This is a dummy sync_core_before_usermode() implementation that can be used
- * on all architectures which return to user-space through core serializing
- * instructions.
- * If your architecture returns to user-space through non-core-serializing
- * instructions, you need to write your own functions.
- */
-static inline void sync_core_before_usermode(void)
-{
-}
-#endif
-
-#endif /* _LINUX_SYNC_CORE_H */
-
diff --git a/init/Kconfig b/init/Kconfig
index 11f8a845f259..bbaf93f9438b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2364,9 +2364,6 @@ source "kernel/Kconfig.locks"
 config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	bool
 
-config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-	bool
-
 # It may be useful for an architecture to override the definitions of the
 # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
 # and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 327830f89c37..eb73eeaedc7d 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,14 @@
  * membarrier system call
  */
 #include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/membarrier.h>
+#else
+static inline void membarrier_sync_core_before_usermode(void)
+{
+	compiletime_assert(0, "architecture does not implement membarrier_sync_core_before_usermode");
+}
+#endif
 
 /*
  * The basic principle behind the regular memory barrier mode of
@@ -231,12 +239,12 @@ static void ipi_sync_core(void *info)
 	 * the big comment at the top of this file.
 	 *
 	 * A sync_core() would provide this guarantee, but
-	 * sync_core_before_usermode() might end up being deferred until
-	 * after membarrier()'s smp_mb().
+	 * membarrier_sync_core_before_usermode() might end up being deferred
+	 * until after membarrier()'s smp_mb().
 	 */
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 
-	sync_core_before_usermode();
+	membarrier_sync_core_before_usermode();
 }
 
 static void ipi_rseq(void *info)
-- 
2.33.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap()
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (6 preceding siblings ...)
  2022-01-08 16:43   ` Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-12 16:13   ` Mathieu Desnoyers
  2022-01-08 16:43 ` [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() Andy Lutomirski
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

exec_mmap() supplies a brand-new mm from mm_alloc(), and membarrier_state
is already 0.  There's no need to clear it again.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/sched/membarrier.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index eb73eeaedc7d..c38014c2ed66 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -285,7 +285,6 @@ void membarrier_exec_mmap(struct mm_struct *mm)
 	 * clearing this state.
 	 */
 	smp_mb();
-	atomic_set(&mm->membarrier_state, 0);
 	/*
 	 * Keep the runqueue membarrier_state in sync with this mm
 	 * membarrier_state.
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm()
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (7 preceding siblings ...)
  2022-01-08 16:43 ` [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap() Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-12 16:30   ` Mathieu Desnoyers
  2022-01-08 16:43 ` [PATCH 10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references Andy Lutomirski
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

membarrier() requires a barrier before changes to rq->curr->mm, not just
before writes to rq->membarrier_state.  Move the barrier in exec_mmap() to
the right place.  Add the barrier in kthread_use_mm() -- it was entirely
missing before.

This patch makes exec_mmap() and kthread_use_mm() use the same membarrier
hooks, which results in some code deletion.

As an added bonus, this will eliminate a redundant barrier in execve() on
arches for which spinlock acquisition is a barrier.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 fs/exec.c                 |  6 +++++-
 include/linux/sched/mm.h  |  2 --
 kernel/kthread.c          |  5 +++++
 kernel/sched/membarrier.c | 15 ---------------
 4 files changed, 10 insertions(+), 18 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 38b05e01c5bd..325dab98bc51 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1001,12 +1001,16 @@ static int exec_mmap(struct mm_struct *mm)
 	}
 
 	task_lock(tsk);
-	membarrier_exec_mmap(mm);
+	/*
+	 * membarrier() requires a full barrier before switching mm.
+	 */
+	smp_mb__after_spinlock();
 
 	local_irq_disable();
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
 	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
+	membarrier_update_current_mm(mm);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e107f292fc42..f1d2beac464c 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -344,8 +344,6 @@ enum {
 #include <asm/membarrier.h>
 #endif
 
-extern void membarrier_exec_mmap(struct mm_struct *mm);
-
 extern void membarrier_update_current_mm(struct mm_struct *next_mm);
 
 /*
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3b18329f885c..18b0a2e0e3b2 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1351,6 +1351,11 @@ void kthread_use_mm(struct mm_struct *mm)
 	WARN_ON_ONCE(tsk->mm);
 
 	task_lock(tsk);
+	/*
+	 * membarrier() requires a full barrier before switching mm.
+	 */
+	smp_mb__after_spinlock();
+
 	/* Hold off tlb flush IPIs while switching mm's */
 	local_irq_disable();
 	active_mm = tsk->active_mm;
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index c38014c2ed66..44fafa6e1efd 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -277,21 +277,6 @@ static void ipi_sync_rq_state(void *info)
 	smp_mb();
 }
 
-void membarrier_exec_mmap(struct mm_struct *mm)
-{
-	/*
-	 * Issue a memory barrier before clearing membarrier_state to
-	 * guarantee that no memory access prior to exec is reordered after
-	 * clearing this state.
-	 */
-	smp_mb();
-	/*
-	 * Keep the runqueue membarrier_state in sync with this mm
-	 * membarrier_state.
-	 */
-	this_cpu_write(runqueues.membarrier_state, 0);
-}
-
 void membarrier_update_current_mm(struct mm_struct *next_mm)
 {
 	struct rq *rq = this_rq();
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (8 preceding siblings ...)
  2022-01-08 16:43 ` [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-08 16:43 ` [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown Andy Lutomirski
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski, Joerg Roedel, Masami Hiramatsu

When decoding an instruction or handling a perf event that references an
LDT segment, if we don't have a valid user context, trying to access the
LDT by any means other than SLDT is racy.  Certainly, using
current->active_mm is wrong, as active_mm can point to a real user mm when
CR3 and LDTR no longer reference that mm.

Clean up the code.  If nmi_uaccess_okay() says we don't have a valid
context, just fail.  Otherwise use current->mm.

Cc: Joerg Roedel <jroedel@suse.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/events/core.c   |  9 ++++++++-
 arch/x86/lib/insn-eval.c | 13 ++++++++++---
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6dfa8ddaa60f..930082f0eba5 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2800,8 +2800,15 @@ static unsigned long get_segment_base(unsigned int segment)
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 		struct ldt_struct *ldt;
 
+		/*
+		 * If we're not in a valid context with a real (not just lazy)
+		 * user mm, then don't even try.
+		 */
+		if (!nmi_uaccess_okay())
+			return 0;
+
 		/* IRQs are off, so this synchronizes with smp_store_release */
-		ldt = READ_ONCE(current->active_mm->context.ldt);
+		ldt = smp_load_acquire(&current->mm->context.ldt);
 		if (!ldt || idx >= ldt->nr_entries)
 			return 0;
 
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index a1d24fdc07cf..87a85a9dcdc4 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -609,14 +609,21 @@ static bool get_desc(struct desc_struct *out, unsigned short sel)
 		/* Bits [15:3] contain the index of the desired entry. */
 		sel >>= 3;
 
-		mutex_lock(&current->active_mm->context.lock);
-		ldt = current->active_mm->context.ldt;
+		/*
+		 * If we're not in a valid context with a real (not just lazy)
+		 * user mm, then don't even try.
+		 */
+		if (!nmi_uaccess_okay())
+			return false;
+
+		mutex_lock(&current->mm->context.lock);
+		ldt = current->mm->context.ldt;
 		if (ldt && sel < ldt->nr_entries) {
 			*out = ldt->entries[sel];
 			success = true;
 		}
 
-		mutex_unlock(&current->active_mm->context.lock);
+		mutex_unlock(&current->mm->context.lock);
 
 		return success;
 	}
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (9 preceding siblings ...)
  2022-01-08 16:43 ` [PATCH 10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-10 22:06   ` Sami Tolvanen
  2022-01-08 16:43 ` [PATCH 12/23] Rework "sched/core: Fix illegal RCU from offline CPUs" Andy Lutomirski
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski, Woody Lin, Valentin Schneider, Sami Tolvanen

Starting with commit 63acd42c0d49 ("sched/scs: Reset the shadow stack when
idle_task_exit"), the idle thread's shadow stack was reset from the idle
task's context during CPU hot-unplug.  This was fragile: between resetting
the shadow stack and actually stopping the idle task, the shadow stack
did not match the actual call stack.

Clean this up by resetting the idle task's SCS in bringup_cpu().

init_idle() still does scs_task_reset() -- see the comments there.  I
leave this to an SCS maintainer to untangle further.

Cc: Woody Lin <woodylin@google.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/cpu.c        | 3 +++
 kernel/sched/core.c | 9 ++++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 192e43a87407..be16816bb87c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -33,6 +33,7 @@
 #include <linux/slab.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/cpuset.h>
+#include <linux/scs.h>
 
 #include <trace/events/power.h>
 #define CREATE_TRACE_POINTS
@@ -587,6 +588,8 @@ static int bringup_cpu(unsigned int cpu)
 	struct task_struct *idle = idle_thread_get(cpu);
 	int ret;
 
+	scs_task_reset(idle);
+
 	/*
 	 * Some architectures have to walk the irq descriptors to
 	 * setup the vector space for the cpu which comes online.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 917068b0a145..acd52a7d1349 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8621,7 +8621,15 @@ void __init init_idle(struct task_struct *idle, int cpu)
 	idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
 	kthread_set_per_cpu(idle, cpu);
 
+	/*
+	 * NB: This is called from sched_init() on the *current* idle thread.
+	 * This seems fragile if not actively incorrect.
+	 *
+	 * Initializing SCS for about-to-be-brought-up CPU idle threads
+	 * is in bringup_cpu(), but that does not cover the boot CPU.
+	 */
 	scs_task_reset(idle);
+
 	kasan_unpoison_task_stack(idle);
 
 #ifdef CONFIG_SMP
@@ -8779,7 +8787,6 @@ void idle_task_exit(void)
 		finish_arch_post_lock_switch();
 	}
 
-	scs_task_reset(current);
 	/* finish_cpu(), as ran on the BP, will clean up the active_mm state */
 }
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 12/23] Rework "sched/core: Fix illegal RCU from offline CPUs"
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (10 preceding siblings ...)
  2022-01-08 16:43 ` [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-08 16:43 ` [PATCH 13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap() Andy Lutomirski
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

This reworks commit bf2c59fce4074e55d622089b34be3a6bc95484fb.  The problem
solved by that commit was that mmdrop() after cpuhp_report_idle_dead() is
an illegal use of RCU, so, with that commit applied, mmdrop() of the last
lazy mm on an offlined CPU was done by the BSP.

With the upcoming reworking of lazy mm references, retaining that design
would involve the cpu hotplug code poking into internal scheduler details.

Rework the fix.  Add a helper unlazy_mm_irqs_off() to fully switch a CPU to
init_mm, releasing any previous lazy active_mm, and do this before
cpuhp_report_idle_dead().

Note that the actual refcounting of init_mm is inconsistent both before and
after this patch.  Most (all?) arches mmgrab(&init_mm) when booting an AP
and set current->active_mm = &init_mm on that AP.  This is consistent with
the current ->active_mm refcounting rules, but archtectures don't do a
corresponding mmdrop() when a CPU goes offine.  The result is that each
offline/online cycle leaks one init_mm reference.  This seems fairly
harmless.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/arm/kernel/smp.c                        |  2 -
 arch/arm64/kernel/smp.c                      |  2 -
 arch/csky/kernel/smp.c                       |  2 -
 arch/ia64/kernel/process.c                   |  1 -
 arch/mips/cavium-octeon/smp.c                |  1 -
 arch/mips/kernel/smp-bmips.c                 |  2 -
 arch/mips/kernel/smp-cps.c                   |  1 -
 arch/mips/loongson64/smp.c                   |  2 -
 arch/powerpc/platforms/85xx/smp.c            |  2 -
 arch/powerpc/platforms/powermac/smp.c        |  2 -
 arch/powerpc/platforms/powernv/smp.c         |  1 -
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  2 -
 arch/powerpc/platforms/pseries/pmem.c        |  1 -
 arch/riscv/kernel/cpu-hotplug.c              |  2 -
 arch/s390/kernel/smp.c                       |  1 -
 arch/sh/kernel/smp.c                         |  1 -
 arch/sparc/kernel/smp_64.c                   |  2 -
 arch/x86/kernel/smpboot.c                    |  2 -
 arch/xtensa/kernel/smp.c                     |  1 -
 include/linux/sched/hotplug.h                |  6 ---
 kernel/cpu.c                                 | 18 +-------
 kernel/sched/core.c                          | 43 +++++++++++---------
 kernel/sched/idle.c                          |  1 +
 kernel/sched/sched.h                         |  1 +
 24 files changed, 27 insertions(+), 72 deletions(-)

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 842427ff2b3c..19863ad2f852 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -323,8 +323,6 @@ void arch_cpu_idle_dead(void)
 {
 	unsigned int cpu = smp_processor_id();
 
-	idle_task_exit();
-
 	local_irq_disable();
 
 	/*
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 6f6ff072acbd..4b38fb42543f 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -366,8 +366,6 @@ void cpu_die(void)
 	unsigned int cpu = smp_processor_id();
 	const struct cpu_operations *ops = get_cpu_ops(cpu);
 
-	idle_task_exit();
-
 	local_daif_mask();
 
 	/* Tell __cpu_die() that this CPU is now safe to dispose of */
diff --git a/arch/csky/kernel/smp.c b/arch/csky/kernel/smp.c
index e2993539af8e..4b17c3b8fcba 100644
--- a/arch/csky/kernel/smp.c
+++ b/arch/csky/kernel/smp.c
@@ -309,8 +309,6 @@ void __cpu_die(unsigned int cpu)
 
 void arch_cpu_idle_dead(void)
 {
-	idle_task_exit();
-
 	cpu_report_death();
 
 	while (!secondary_stack)
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index e56d63f4abf9..ddb13db7ff7e 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -209,7 +209,6 @@ static inline void play_dead(void)
 
 	max_xtp();
 	local_irq_disable();
-	idle_task_exit();
 	ia64_jump_to_sal(&sal_boot_rendez_state[this_cpu]);
 	/*
 	 * The above is a point of no-return, the processor is
diff --git a/arch/mips/cavium-octeon/smp.c b/arch/mips/cavium-octeon/smp.c
index 89954f5f87fb..7130ec7e9b61 100644
--- a/arch/mips/cavium-octeon/smp.c
+++ b/arch/mips/cavium-octeon/smp.c
@@ -343,7 +343,6 @@ void play_dead(void)
 {
 	int cpu = cpu_number_map(cvmx_get_core_num());
 
-	idle_task_exit();
 	octeon_processor_boot = 0xff;
 	per_cpu(cpu_state, cpu) = CPU_DEAD;
 
diff --git a/arch/mips/kernel/smp-bmips.c b/arch/mips/kernel/smp-bmips.c
index b6ef5f7312cf..bd1e650dd176 100644
--- a/arch/mips/kernel/smp-bmips.c
+++ b/arch/mips/kernel/smp-bmips.c
@@ -388,8 +388,6 @@ static void bmips_cpu_die(unsigned int cpu)
 
 void __ref play_dead(void)
 {
-	idle_task_exit();
-
 	/* flush data cache */
 	_dma_cache_wback_inv(0, ~0);
 
diff --git a/arch/mips/kernel/smp-cps.c b/arch/mips/kernel/smp-cps.c
index bcd6a944b839..23221fcee423 100644
--- a/arch/mips/kernel/smp-cps.c
+++ b/arch/mips/kernel/smp-cps.c
@@ -472,7 +472,6 @@ void play_dead(void)
 	unsigned int cpu;
 
 	local_irq_disable();
-	idle_task_exit();
 	cpu = smp_processor_id();
 	cpu_death = CPU_DEATH_POWER;
 
diff --git a/arch/mips/loongson64/smp.c b/arch/mips/loongson64/smp.c
index 09ebe84a17fe..a1fe59f354d1 100644
--- a/arch/mips/loongson64/smp.c
+++ b/arch/mips/loongson64/smp.c
@@ -788,8 +788,6 @@ void play_dead(void)
 	unsigned int cpu = smp_processor_id();
 	void (*play_dead_at_ckseg1)(int *);
 
-	idle_task_exit();
-
 	prid_imp = read_c0_prid() & PRID_IMP_MASK;
 	prid_rev = read_c0_prid() & PRID_REV_MASK;
 
diff --git a/arch/powerpc/platforms/85xx/smp.c b/arch/powerpc/platforms/85xx/smp.c
index c6df294054fe..9de9e1fcc87a 100644
--- a/arch/powerpc/platforms/85xx/smp.c
+++ b/arch/powerpc/platforms/85xx/smp.c
@@ -121,8 +121,6 @@ static void smp_85xx_cpu_offline_self(void)
 	/* mask all irqs to prevent cpu wakeup */
 	qoriq_pm_ops->irq_mask(cpu);
 
-	idle_task_exit();
-
 	mtspr(SPRN_TCR, 0);
 	mtspr(SPRN_TSR, mfspr(SPRN_TSR));
 
diff --git a/arch/powerpc/platforms/powermac/smp.c b/arch/powerpc/platforms/powermac/smp.c
index 3256a316e884..69d2bdd8246d 100644
--- a/arch/powerpc/platforms/powermac/smp.c
+++ b/arch/powerpc/platforms/powermac/smp.c
@@ -924,7 +924,6 @@ static void pmac_cpu_offline_self(void)
 	int cpu = smp_processor_id();
 
 	local_irq_disable();
-	idle_task_exit();
 	pr_debug("CPU%d offline\n", cpu);
 	generic_set_cpu_dead(cpu);
 	smp_wmb();
@@ -939,7 +938,6 @@ static void pmac_cpu_offline_self(void)
 	int cpu = smp_processor_id();
 
 	local_irq_disable();
-	idle_task_exit();
 
 	/*
 	 * turn off as much as possible, we'll be
diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
index cbb67813cd5d..cba21d053dae 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -169,7 +169,6 @@ static void pnv_cpu_offline_self(void)
 
 	/* Standard hot unplug procedure */
 
-	idle_task_exit();
 	cpu = smp_processor_id();
 	DBG("CPU%d offline\n", cpu);
 	generic_set_cpu_dead(cpu);
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index d646c22e94ab..c11ccd038866 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -19,7 +19,6 @@
 #include <linux/kernel.h>
 #include <linux/interrupt.h>
 #include <linux/delay.h>
-#include <linux/sched.h>	/* for idle_task_exit */
 #include <linux/sched/hotplug.h>
 #include <linux/cpu.h>
 #include <linux/of.h>
@@ -63,7 +62,6 @@ static void pseries_cpu_offline_self(void)
 	unsigned int hwcpu = hard_smp_processor_id();
 
 	local_irq_disable();
-	idle_task_exit();
 	if (xive_enabled())
 		xive_teardown_cpu();
 	else
diff --git a/arch/powerpc/platforms/pseries/pmem.c b/arch/powerpc/platforms/pseries/pmem.c
index 439ac72c2470..5280fcd5b37d 100644
--- a/arch/powerpc/platforms/pseries/pmem.c
+++ b/arch/powerpc/platforms/pseries/pmem.c
@@ -9,7 +9,6 @@
 #include <linux/kernel.h>
 #include <linux/interrupt.h>
 #include <linux/delay.h>
-#include <linux/sched.h>	/* for idle_task_exit */
 #include <linux/sched/hotplug.h>
 #include <linux/cpu.h>
 #include <linux/of.h>
diff --git a/arch/riscv/kernel/cpu-hotplug.c b/arch/riscv/kernel/cpu-hotplug.c
index df84e0c13db1..6cced2d79f07 100644
--- a/arch/riscv/kernel/cpu-hotplug.c
+++ b/arch/riscv/kernel/cpu-hotplug.c
@@ -77,8 +77,6 @@ void __cpu_die(unsigned int cpu)
  */
 void cpu_stop(void)
 {
-	idle_task_exit();
-
 	(void)cpu_report_death();
 
 	cpu_ops[smp_processor_id()]->cpu_stop();
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 1a04e5bdf655..328930549803 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -987,7 +987,6 @@ void __cpu_die(unsigned int cpu)
 
 void __noreturn cpu_die(void)
 {
-	idle_task_exit();
 	__bpon();
 	pcpu_sigp_retry(pcpu_devices + smp_processor_id(), SIGP_STOP, 0);
 	for (;;) ;
diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c
index 65924d9ec245..cbd14604a736 100644
--- a/arch/sh/kernel/smp.c
+++ b/arch/sh/kernel/smp.c
@@ -106,7 +106,6 @@ int native_cpu_disable(unsigned int cpu)
 
 void play_dead_common(void)
 {
-	idle_task_exit();
 	irq_ctx_exit(raw_smp_processor_id());
 	mb();
 
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index 0224d8f19ed6..450dc9513ff0 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1301,8 +1301,6 @@ void cpu_play_dead(void)
 	int cpu = smp_processor_id();
 	unsigned long pstate;
 
-	idle_task_exit();
-
 	if (tlb_type == hypervisor) {
 		struct trap_per_cpu *tb = &trap_block[cpu];
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 85f6e242b6b4..a57a709f2c35 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1656,8 +1656,6 @@ void native_cpu_die(unsigned int cpu)
 
 void play_dead_common(void)
 {
-	idle_task_exit();
-
 	/* Ack it */
 	(void)cpu_report_death();
 
diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c
index 1254da07ead1..fb011807d041 100644
--- a/arch/xtensa/kernel/smp.c
+++ b/arch/xtensa/kernel/smp.c
@@ -329,7 +329,6 @@ void arch_cpu_idle_dead(void)
  */
 void __ref cpu_die(void)
 {
-	idle_task_exit();
 	local_irq_disable();
 	__asm__ __volatile__(
 			"	movi	a2, cpu_restart\n"
diff --git a/include/linux/sched/hotplug.h b/include/linux/sched/hotplug.h
index 412cdaba33eb..18fa3e63123e 100644
--- a/include/linux/sched/hotplug.h
+++ b/include/linux/sched/hotplug.h
@@ -18,10 +18,4 @@ extern int sched_cpu_dying(unsigned int cpu);
 # define sched_cpu_dying	NULL
 #endif
 
-#ifdef CONFIG_HOTPLUG_CPU
-extern void idle_task_exit(void);
-#else
-static inline void idle_task_exit(void) {}
-#endif
-
 #endif /* _LINUX_SCHED_HOTPLUG_H */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index be16816bb87c..709e2a7583ad 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3,7 +3,6 @@
  *
  * This code is licenced under the GPL.
  */
-#include <linux/sched/mm.h>
 #include <linux/proc_fs.h>
 #include <linux/smp.h>
 #include <linux/init.h>
@@ -605,21 +604,6 @@ static int bringup_cpu(unsigned int cpu)
 	return bringup_wait_for_ap(cpu);
 }
 
-static int finish_cpu(unsigned int cpu)
-{
-	struct task_struct *idle = idle_thread_get(cpu);
-	struct mm_struct *mm = idle->active_mm;
-
-	/*
-	 * idle_task_exit() will have switched to &init_mm, now
-	 * clean up any remaining active_mm state.
-	 */
-	if (mm != &init_mm)
-		idle->active_mm = &init_mm;
-	mmdrop(mm);
-	return 0;
-}
-
 /*
  * Hotplug state machine related functions
  */
@@ -1699,7 +1683,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
 	[CPUHP_BRINGUP_CPU] = {
 		.name			= "cpu:bringup",
 		.startup.single		= bringup_cpu,
-		.teardown.single	= finish_cpu,
+		.teardown.single	= NULL,
 		.cant_stop		= true,
 	},
 	/* Final state before CPU kills itself */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index acd52a7d1349..32275b4ff141 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8678,6 +8678,30 @@ void __init init_idle(struct task_struct *idle, int cpu)
 #endif
 }
 
+/*
+ * Drops current->active_mm and switches current->active_mm to &init_mm.
+ * Caller must have IRQs off and must have current->mm == NULL (i.e. must
+ * be in a kernel thread).
+ */
+void unlazy_mm_irqs_off(void)
+{
+	struct mm_struct *mm = current->active_mm;
+
+	lockdep_assert_irqs_disabled();
+
+	if (WARN_ON_ONCE(current->mm))
+		return;
+
+	if (mm == &init_mm)
+		return;
+
+	switch_mm_irqs_off(mm, &init_mm, current);
+	mmgrab(&init_mm);
+	current->active_mm = &init_mm;
+	finish_arch_post_lock_switch();
+	mmdrop(mm);
+}
+
 #ifdef CONFIG_SMP
 
 int cpuset_cpumask_can_shrink(const struct cpumask *cur,
@@ -8771,25 +8795,6 @@ void sched_setnuma(struct task_struct *p, int nid)
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_HOTPLUG_CPU
-/*
- * Ensure that the idle task is using init_mm right before its CPU goes
- * offline.
- */
-void idle_task_exit(void)
-{
-	struct mm_struct *mm = current->active_mm;
-
-	BUG_ON(cpu_online(smp_processor_id()));
-	BUG_ON(current != this_rq()->idle);
-
-	if (mm != &init_mm) {
-		switch_mm(mm, &init_mm, current);
-		finish_arch_post_lock_switch();
-	}
-
-	/* finish_cpu(), as ran on the BP, will clean up the active_mm state */
-}
-
 static int __balance_push_cpu_stop(void *arg)
 {
 	struct task_struct *p = arg;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index d17b0a5ce6ac..af6a98e7a8d1 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -285,6 +285,7 @@ static void do_idle(void)
 		local_irq_disable();
 
 		if (cpu_is_offline(cpu)) {
+			unlazy_mm_irqs_off();
 			tick_nohz_idle_stop_tick();
 			cpuhp_report_idle_dead();
 			arch_cpu_idle_dead();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3d3e5793e117..b496a9ee9aec 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3064,3 +3064,4 @@ extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
 #endif
 
+extern void unlazy_mm_irqs_off(void);
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap()
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (11 preceding siblings ...)
  2022-01-08 16:43 ` [PATCH 12/23] Rework "sched/core: Fix illegal RCU from offline CPUs" Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-08 16:43 ` [PATCH 14/23] sched, exec: Factor current mm changes out from exec Andy Lutomirski
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

exec_mmap() activates a brand new mm, so vmacache_seqnum is already 0.
Stop zeroing it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 fs/exec.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 325dab98bc51..2afa7b0c75f2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1024,7 +1024,6 @@ static int exec_mmap(struct mm_struct *mm)
 	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
 	membarrier_finish_switch_mm(mm);
-	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
 	task_unlock(tsk);
 	if (old_mm) {
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 14/23] sched, exec: Factor current mm changes out from exec
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (12 preceding siblings ...)
  2022-01-08 16:43 ` [PATCH 13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap() Andy Lutomirski
@ 2022-01-08 16:43 ` Andy Lutomirski
  2022-01-08 16:44 ` [PATCH 15/23] kthread: Switch to __change_current_mm() Andy Lutomirski
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:43 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

Currently, exec_mmap() open-codes an mm change.  Create new core
__change_current_mm() and __change_current_mm_to_kernel() helpers
and use the former from exec_mmap().  This moves the nasty scheduler
details out of exec.c and prepares for reusing this code elsewhere.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 fs/exec.c                |  32 +----------
 include/linux/sched/mm.h |  20 +++++++
 kernel/sched/core.c      | 119 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 141 insertions(+), 30 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 2afa7b0c75f2..9e1c2ee7c986 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -971,15 +971,13 @@ EXPORT_SYMBOL(read_code);
 static int exec_mmap(struct mm_struct *mm)
 {
 	struct task_struct *tsk;
-	struct mm_struct *old_mm, *active_mm;
+	struct mm_struct *old_mm;
 	int ret;
 
 	/* Notify parent that we're no longer interested in the old VM */
 	tsk = current;
 	old_mm = current->mm;
 	exec_mm_release(tsk, old_mm);
-	if (old_mm)
-		sync_mm_rss(old_mm);
 
 	ret = down_write_killable(&tsk->signal->exec_update_lock);
 	if (ret)
@@ -1000,41 +998,15 @@ static int exec_mmap(struct mm_struct *mm)
 		}
 	}
 
-	task_lock(tsk);
-	/*
-	 * membarrier() requires a full barrier before switching mm.
-	 */
-	smp_mb__after_spinlock();
+	__change_current_mm(mm, true);
 
-	local_irq_disable();
-	active_mm = tsk->active_mm;
-	tsk->active_mm = mm;
-	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
-	membarrier_update_current_mm(mm);
-	/*
-	 * This prevents preemption while active_mm is being loaded and
-	 * it and mm are being updated, which could cause problems for
-	 * lazy tlb mm refcounting when these are updated by context
-	 * switches. Not all architectures can handle irqs off over
-	 * activate_mm yet.
-	 */
-	if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
-		local_irq_enable();
-	activate_mm(active_mm, mm);
-	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
-		local_irq_enable();
-	membarrier_finish_switch_mm(mm);
-	vmacache_flush(tsk);
-	task_unlock(tsk);
 	if (old_mm) {
 		mmap_read_unlock(old_mm);
-		BUG_ON(active_mm != old_mm);
 		setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
 		mm_update_next_owner(old_mm);
 		mmput(old_mm);
 		return 0;
 	}
-	mmdrop(active_mm);
 	return 0;
 }
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index f1d2beac464c..7509b2b2e99d 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -83,6 +83,26 @@ extern void mmput(struct mm_struct *);
 void mmput_async(struct mm_struct *);
 #endif
 
+/*
+ * Switch the mm for current.  This does not mmget() mm, nor does it mmput()
+ * the previous mm, if any.  The caller is responsible for reference counting,
+ * although __change_current_mm() handles all details related to lazy mm
+ * refcounting.
+ *
+ * If the caller is a user task, the caller must call mm_update_next_owner().
+ */
+void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new);
+
+/*
+ * Switch the mm for current to the kernel mm.  This does not mmdrop()
+ * -- the caller is responsible for reference counting, although
+ * __change_current_mm_to_kernel() handles all details related to lazy
+ * mm refcounting.
+ *
+ * If the caller is a user task, the caller must call mm_update_next_owner().
+ */
+void __change_current_mm_to_kernel(void);
+
 /* Grab a reference to a task's mm, if it is not already going away */
 extern struct mm_struct *get_task_mm(struct task_struct *task);
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 32275b4ff141..95eb0e78f74c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -14,6 +14,7 @@
 
 #include <linux/nospec.h>
 
+#include <linux/vmacache.h>
 #include <linux/kcov.h>
 #include <linux/scs.h>
 
@@ -4934,6 +4935,124 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	return finish_task_switch(prev);
 }
 
+void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new)
+{
+	struct task_struct *tsk = current;
+	struct mm_struct *old_active_mm, *mm_to_drop = NULL;
+
+	BUG_ON(!mm);	/* likely to cause corruption if we continue */
+
+	/*
+	 * We do not want to schedule, nor should procfs peek at current->mm
+	 * while we're modifying it.  task_lock() disables preemption and
+	 * locks against procfs.
+	 */
+	task_lock(tsk);
+	/*
+	 * membarrier() requires a full barrier before switching mm.
+	 */
+	smp_mb__after_spinlock();
+
+	local_irq_disable();
+
+	if (tsk->mm) {
+		/* We're detaching from an old mm.  Sync stats. */
+		sync_mm_rss(tsk->mm);
+	} else {
+		/*
+		 * Switching from kernel mm to user.  Drop the old lazy
+		 * mm reference.
+		 */
+		mm_to_drop = tsk->active_mm;
+	}
+
+	old_active_mm = tsk->active_mm;
+	tsk->active_mm = mm;
+	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
+	membarrier_update_current_mm(mm);
+
+	if (mm_is_brand_new) {
+		/*
+		 * For historical reasons, some architectures want IRQs on
+		 * when activate_mm() is called.  If we're going to call
+		 * activate_mm(), turn on IRQs but leave preemption
+		 * disabled.
+		 */
+		if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+			local_irq_enable();
+		activate_mm(old_active_mm, mm);
+		if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+			local_irq_enable();
+	} else {
+		switch_mm_irqs_off(old_active_mm, mm, tsk);
+		local_irq_enable();
+	}
+
+	/* IRQs are on now.  Preemption is still disabled by task_lock(). */
+
+	membarrier_finish_switch_mm(mm);
+	vmacache_flush(tsk);
+	task_unlock(tsk);
+
+#ifdef finish_arch_post_lock_switch
+	if (!mm_is_brand_new) {
+		/*
+		 * Some architectures want a callback after
+		 * switch_mm_irqs_off() once locks are dropped.  Callers of
+		 * activate_mm() historically did not do this, so skip it if
+		 * we did activate_mm().  On arm, this is because
+		 * activate_mm() switches mm with IRQs on, which uses a
+		 * different code path.
+		 *
+		 * Yes, this is extremely fragile and be cleaned up.
+		 */
+		finish_arch_post_lock_switch();
+	}
+#endif
+
+	if (mm_to_drop)
+		mmdrop(mm_to_drop);
+}
+
+void __change_current_mm_to_kernel(void)
+{
+	struct task_struct *tsk = current;
+	struct mm_struct *old_mm = tsk->mm;
+
+	if (!old_mm)
+		return;	/* nothing to do */
+
+	/*
+	 * We do not want to schedule, nor should procfs peek at current->mm
+	 * while we're modifying it.  task_lock() disables preemption and
+	 * locks against procfs.
+	 */
+	task_lock(tsk);
+	/*
+	 * membarrier() requires a full barrier before switching mm.
+	 */
+	smp_mb__after_spinlock();
+
+	/* current has a real mm, so it must be active */
+	WARN_ON_ONCE(tsk->active_mm != tsk->mm);
+
+	local_irq_disable();
+
+	sync_mm_rss(old_mm);
+
+	WRITE_ONCE(tsk->mm, NULL);  /* membarrier reads this without locks */
+	membarrier_update_current_mm(NULL);
+	vmacache_flush(tsk);
+
+	/* active_mm is still 'old_mm' */
+	mmgrab(old_mm);
+	enter_lazy_tlb(old_mm, tsk);
+
+	local_irq_enable();
+
+	task_unlock(tsk);
+}
+
 /*
  * nr_running and nr_context_switches:
  *
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 15/23] kthread: Switch to __change_current_mm()
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (13 preceding siblings ...)
  2022-01-08 16:43 ` [PATCH 14/23] sched, exec: Factor current mm changes out from exec Andy Lutomirski
@ 2022-01-08 16:44 ` Andy Lutomirski
  2022-01-08 16:44 ` [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Andy Lutomirski
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:44 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

Remove the open-coded mm switching in kthread_use_mm() and
kthread_unuse_mm().

This has one internally-visible effect: the old code active_mm
refcounting was inconsistent with everything else and mmgrabbed the
mm in kthread_use_mm().  The new code refcounts following the same
rules as normal user threads, so kthreads that are currently using
a user mm will not hold an mm_count reference.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/kthread.c | 45 ++-------------------------------------------
 1 file changed, 2 insertions(+), 43 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 18b0a2e0e3b2..77586f5b14e5 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1344,37 +1344,12 @@ EXPORT_SYMBOL(kthread_destroy_worker);
  */
 void kthread_use_mm(struct mm_struct *mm)
 {
-	struct mm_struct *active_mm;
 	struct task_struct *tsk = current;
 
 	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
 	WARN_ON_ONCE(tsk->mm);
 
-	task_lock(tsk);
-	/*
-	 * membarrier() requires a full barrier before switching mm.
-	 */
-	smp_mb__after_spinlock();
-
-	/* Hold off tlb flush IPIs while switching mm's */
-	local_irq_disable();
-	active_mm = tsk->active_mm;
-	if (active_mm != mm) {
-		mmgrab(mm);
-		tsk->active_mm = mm;
-	}
-	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
-	membarrier_update_current_mm(mm);
-	switch_mm_irqs_off(active_mm, mm, tsk);
-	membarrier_finish_switch_mm(mm);
-	local_irq_enable();
-	task_unlock(tsk);
-#ifdef finish_arch_post_lock_switch
-	finish_arch_post_lock_switch();
-#endif
-
-	if (active_mm != mm)
-		mmdrop(active_mm);
+	__change_current_mm(mm, false);
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
@@ -1393,23 +1368,7 @@ void kthread_unuse_mm(struct mm_struct *mm)
 
 	force_uaccess_end(to_kthread(tsk)->oldfs);
 
-	task_lock(tsk);
-	/*
-	 * When a kthread stops operating on an address space, the loop
-	 * in membarrier_{private,global}_expedited() may not observe
-	 * that tsk->mm, and not issue an IPI. Membarrier requires a
-	 * memory barrier after accessing user-space memory, before
-	 * clearing tsk->mm.
-	 */
-	smp_mb__after_spinlock();
-	sync_mm_rss(mm);
-	local_irq_disable();
-	WRITE_ONCE(tsk->mm, NULL);  /* membarrier reads this without locks */
-	membarrier_update_current_mm(NULL);
-	/* active_mm is still 'mm' */
-	enter_lazy_tlb(mm, tsk);
-	local_irq_enable();
-	task_unlock(tsk);
+	__change_current_mm_to_kernel();
 }
 EXPORT_SYMBOL_GPL(kthread_unuse_mm);
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (14 preceding siblings ...)
  2022-01-08 16:44 ` [PATCH 15/23] kthread: Switch to __change_current_mm() Andy Lutomirski
@ 2022-01-08 16:44 ` Andy Lutomirski
  2022-01-08 19:22   ` Linus Torvalds
  2022-01-09  5:56   ` Nadav Amit
  2022-01-08 16:44 ` [PATCH 17/23] x86/mm: Make use/unuse_temporary_mm() non-static Andy Lutomirski
                   ` (6 subsequent siblings)
  22 siblings, 2 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:44 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski, Linus Torvalds

Currently, switching between a real user mm and a kernel context (including
idle) performs an atomic operation on a per-mm counter via mmgrab() and
mmdrop().  For a single-threaded program, this isn't a big problem: a pair
of atomic operations when entering and returning from idle isn't free, but
it's not very expensive in the grand scheme of things.  For a heavily
multithreaded program on a large system, however, the overhead can be very
large -- all CPUs can end up hammering the same cacheline with atomic
operations, and scalability suffers.

The purpose of mmgrab() and mmdrop() is to make "lazy tlb" mode safe.  When
Linux switches from user to kernel mm context, instead of immediately
reprogramming the MMU to use init_mm, the kernel continues to use the most
recent set of user page tables.  This is safe as long as those page tables
aren't freed.

RCU can't be used to keep the pagetables alive, since RCU read locks can't
be held when idle.

To improve scalability, this patch adds a percpu hazard pointer scheme to
keep lazily-used mms alive.  Each CPU has a single pointer to an mm that
must not be freed, and __mmput() checks the pointers belonging to all CPUs
that might be lazily using the mm in question.

By default, this means walking all online CPUs, but arch code can override
the set of CPUs to check if they can do something more clever.  For
architectures that have accurate mm_cpumask(), mm_cpumask() is a reasonable
choice.  For architectures that can guarantee that *no* remote CPUs are
lazily using an mm after the user portion of the pagetables are torn down
(any architcture that uses IPI shootdowns in exit_mmap() and unlazies the
MMU in the IPI handler, e.g. x86 on bare metal), the set of CPUs to check
could be empty.

XXX: I *think* this is correct when hot-unplugging a CPU, but this needs
double-checking and maybe even a WARN to make sure the ordering is correct.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/sched/mm.h |   3 +
 kernel/fork.c            |  11 ++
 kernel/sched/core.c      | 230 +++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h     |  10 +-
 4 files changed, 221 insertions(+), 33 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 7509b2b2e99d..3ceba11c049c 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -76,6 +76,9 @@ static inline bool mmget_not_zero(struct mm_struct *mm)
 
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
+
+extern void mm_unlazy_mm_count(struct mm_struct *mm);
+
 #ifdef CONFIG_MMU
 /* same as above but performs the slow path from the async context. Can
  * be called from the atomic context as well
diff --git a/kernel/fork.c b/kernel/fork.c
index 38681ad44c76..2df72cf3c0d2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1122,6 +1122,17 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+
+	/*
+	 * We hold one mm_count reference.  Convert all remaining lazy_mm
+	 * references to mm_count references so that the mm will be genuinely
+	 * unused when mm_count goes to zero.  Do this after exit_mmap() so
+	 * that, if the architecture shoots down remote TLB entries via IPI in
+	 * exit_mmap() and calls unlazy_mm_irqs_off() when doing so, most or
+	 * all lazy_mm references can be removed without
+	 * mm_unlazy_mm_count()'s help.
+	 */
+	mm_unlazy_mm_count(mm);
 	mmdrop(mm);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95eb0e78f74c..64e4058b3c61 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -20,6 +20,7 @@
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
+#include <asm/mmu.h>
 
 #include "../workqueue_internal.h"
 #include "../../fs/io-wq.h"
@@ -4750,6 +4751,144 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	prepare_arch_switch(next);
 }
 
+/*
+ * Called after each context switch.
+ *
+ * Strictly speaking, no action at all is required here.  This rq
+ * can hold an extra reference to at most one mm, so the memory
+ * wasted by deferring the mmdrop() forever is bounded.  That being
+ * said, it's straightforward to safely drop spare references
+ * in the common case.
+ */
+static void mmdrop_lazy(struct rq *rq)
+{
+	struct mm_struct *old_mm;
+
+	old_mm = READ_ONCE(rq->drop_mm);
+
+	do {
+		/*
+		 * If there is nothing to drop or if we are still using old_mm,
+		 * then don't call mmdrop().
+		 */
+		if (likely(!old_mm || old_mm == rq->lazy_mm))
+			return;
+	} while (!try_cmpxchg_relaxed(&rq->drop_mm, &old_mm, NULL));
+
+	mmdrop(old_mm);
+}
+
+#ifndef for_each_possible_lazymm_cpu
+#define for_each_possible_lazymm_cpu(cpu, mm) for_each_online_cpu((cpu))
+#endif
+
+static bool __try_mm_drop_rq_ref(struct rq *rq, struct mm_struct *mm)
+{
+	struct mm_struct *old_drop_mm = smp_load_acquire(&rq->drop_mm);
+
+	/*
+	 * We know that old_mm != mm: this is the only function that
+	 * might set drop_mm to mm, and we haven't set it yet.
+	 */
+	WARN_ON_ONCE(old_drop_mm == mm);
+
+	if (!old_drop_mm) {
+		/*
+		 * Just set rq->drop_mm to mm and our reference will
+		 * get dropped eventually after rq is done with it.
+		 */
+		return try_cmpxchg(&rq->drop_mm, &old_drop_mm, mm);
+	}
+
+	/*
+	 * The target cpu could still be using old_drop_mm.  We know that, if
+	 * old_drop_mm still exists, then old_drop_mm->mm_users == 0.  Can we
+	 * drop it?
+	 *
+	 * NB: it is critical that we load rq->lazy_mm again after loading
+	 * drop_mm.  If we looked at a prior value of lazy_mm (which we
+	 * already know to be mm), then we would be subject to a race:
+	 *
+	 * Us:
+	 *     Load rq->lazy_mm.
+	 * Remote CPU:
+	 *     Switch to old_drop_mm (with mm_users > 0)
+	 *     Become lazy and set rq->lazy_mm = old_drop_mm
+	 * Third CPU:
+	 *     Set old_drop_mm->mm_users to 0.
+	 *     Set rq->drop_mm = old_drop_mm
+	 * Us:
+	 *     Incorrectly believe that old_drop_mm is unused
+	 *     because rq->lazy_mm != old_drop_mm
+	 *
+	 * In other words, to verify that rq->lazy_mm is not keeping a given
+	 * mm alive, we must load rq->lazy_mm _after_ we know that mm_users ==
+	 * 0 and therefore that rq will not switch to that mm.
+	 */
+	if (smp_load_acquire(&rq->lazy_mm) != mm) {
+		/*
+		 * We got lucky!  rq _was_ using mm, but it stopped.
+		 * Just drop our reference.
+		 */
+		mmdrop(mm);
+		return true;
+	}
+
+	/*
+	 * If we got here, rq->lazy_mm != old_drop_mm, and we ruled
+	 * out the race described above.  rq is done with old_drop_mm,
+	 * so we can steal the reference held by rq and replace it with
+	 * our reference to mm.
+	 */
+	if (cmpxchg(&rq->drop_mm, old_drop_mm, mm) != old_drop_mm)
+		return false;
+
+	mmdrop(old_drop_mm);
+	return true;
+}
+
+/*
+ * This converts all lazy_mm references to mm to mm_count refcounts.  Our
+ * caller holds an mm_count reference, so we don't need to worry about mm
+ * being freed out from under us.
+ */
+void mm_unlazy_mm_count(struct mm_struct *mm)
+{
+	unsigned int drop_count = 0;
+	int cpu;
+
+	/*
+	 * mm_users is zero, so no cpu will set its rq->lazy_mm to mm.
+	 */
+	WARN_ON_ONCE(atomic_read(&mm->mm_users) != 0);
+
+	for_each_possible_lazymm_cpu(cpu, mm) {
+		struct rq *rq = cpu_rq(cpu);
+
+		if (smp_load_acquire(&rq->lazy_mm) != mm)
+			continue;
+
+		/*
+		 * Grab one reference.  Do it as a batch so we do a maximum
+		 * of two atomic operations instead of one per lazy reference.
+		 */
+		if (!drop_count) {
+			/*
+			 * Collect lots of references.  We'll drop the ones we
+			 * don't use.
+			 */
+			drop_count = num_possible_cpus();
+			atomic_add(drop_count, &mm->mm_count);
+		}
+		drop_count--;
+
+		while (!__try_mm_drop_rq_ref(rq, mm))
+			;
+	}
+
+	atomic_sub(drop_count, &mm->mm_count);
+}
+
 /**
  * finish_task_switch - clean up after a task-switch
  * @prev: the thread we just switched away from.
@@ -4773,7 +4912,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	__releases(rq->lock)
 {
 	struct rq *rq = this_rq();
-	struct mm_struct *mm = rq->prev_mm;
 	long prev_state;
 
 	/*
@@ -4792,8 +4930,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		      current->comm, current->pid, preempt_count()))
 		preempt_count_set(FORK_PREEMPT_COUNT);
 
-	rq->prev_mm = NULL;
-
 	/*
 	 * A task struct has one reference for the use as "current".
 	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
@@ -4824,12 +4960,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 
 	fire_sched_in_preempt_notifiers(current);
 
-	/*
-	 * If an architecture needs to take a specific action for
-	 * SYNC_CORE, it can do so in switch_mm_irqs_off().
-	 */
-	if (mm)
-		mmdrop(mm);
+	mmdrop_lazy(rq);
 
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
@@ -4891,36 +5022,55 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 */
 	arch_start_context_switch(prev);
 
+	/*
+	 * Sanity check: if something went wrong and the previous mm was
+	 * freed while we were still using it, KASAN might not notice
+	 * without help.
+	 */
+	kasan_check_byte(prev->active_mm);
+
 	/*
 	 * kernel -> kernel   lazy + transfer active
-	 *   user -> kernel   lazy + mmgrab() active
+	 *   user -> kernel   lazy + lazy_mm grab active
 	 *
-	 * kernel ->   user   switch + mmdrop() active
+	 * kernel ->   user   switch + lazy_mm release active
 	 *   user ->   user   switch
 	 */
 	if (!next->mm) {                                // to kernel
 		enter_lazy_tlb(prev->active_mm, next);
 
 		next->active_mm = prev->active_mm;
-		if (prev->mm)                           // from user
-			mmgrab(prev->active_mm);
-		else
+		if (prev->mm) {                         // from user
+			SCHED_WARN_ON(rq->lazy_mm);
+
+			/*
+			 * Acqure a lazy_mm reference to the active
+			 * (lazy) mm.  No explicit barrier needed: we still
+			 * hold an explicit (mm_users) reference.  __mmput()
+			 * can't be called until we call mmput() to drop
+			 * our reference, and __mmput() is a release barrier.
+			 */
+			WRITE_ONCE(rq->lazy_mm, next->active_mm);
+		} else {
 			prev->active_mm = NULL;
+		}
 	} else {                                        // to user
 		membarrier_switch_mm(rq, prev->active_mm, next->mm);
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
 		/*
-		 * sys_membarrier() requires an smp_mb() between setting
-		 * rq->curr->mm to a membarrier-enabled mm and returning
-		 * to userspace.
+		 * An arch implementation of for_each_possible_lazymm_cpu()
+		 * may skip this CPU now that we have switched away from
+		 * prev->active_mm, so we must not reference it again.
 		 */
+
 		membarrier_finish_switch_mm(next->mm);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop() in finish_task_switch(). */
-			rq->prev_mm = prev->active_mm;
 			prev->active_mm = NULL;
+
+			/* Drop our lazy_mm reference to the old lazy mm. */
+			smp_store_release(&rq->lazy_mm, NULL);
 		}
 	}
 
@@ -4938,7 +5088,8 @@ context_switch(struct rq *rq, struct task_struct *prev,
 void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new)
 {
 	struct task_struct *tsk = current;
-	struct mm_struct *old_active_mm, *mm_to_drop = NULL;
+	struct mm_struct *old_active_mm;
+	bool was_kernel;
 
 	BUG_ON(!mm);	/* likely to cause corruption if we continue */
 
@@ -4958,12 +5109,9 @@ void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new)
 	if (tsk->mm) {
 		/* We're detaching from an old mm.  Sync stats. */
 		sync_mm_rss(tsk->mm);
+		was_kernel = false;
 	} else {
-		/*
-		 * Switching from kernel mm to user.  Drop the old lazy
-		 * mm reference.
-		 */
-		mm_to_drop = tsk->active_mm;
+		was_kernel = true;
 	}
 
 	old_active_mm = tsk->active_mm;
@@ -4992,6 +5140,10 @@ void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new)
 
 	membarrier_finish_switch_mm(mm);
 	vmacache_flush(tsk);
+
+	if (was_kernel)
+		smp_store_release(&this_rq()->lazy_mm, NULL);
+
 	task_unlock(tsk);
 
 #ifdef finish_arch_post_lock_switch
@@ -5009,9 +5161,6 @@ void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new)
 		finish_arch_post_lock_switch();
 	}
 #endif
-
-	if (mm_to_drop)
-		mmdrop(mm_to_drop);
 }
 
 void __change_current_mm_to_kernel(void)
@@ -5044,8 +5193,17 @@ void __change_current_mm_to_kernel(void)
 	membarrier_update_current_mm(NULL);
 	vmacache_flush(tsk);
 
-	/* active_mm is still 'old_mm' */
-	mmgrab(old_mm);
+	/*
+	 * active_mm is still 'old_mm'
+	 *
+	 * Acqure a lazy_mm reference to the active (lazy) mm.  As in
+	 * context_switch(), no explicit barrier needed: we still hold an
+	 * explicit (mm_users) reference.  __mmput() can't be called until we
+	 * call mmput() to drop our reference, and __mmput() is a release
+	 * barrier.
+	 */
+	WRITE_ONCE(this_rq()->lazy_mm, old_mm);
+
 	enter_lazy_tlb(old_mm, tsk);
 
 	local_irq_enable();
@@ -8805,6 +8963,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
 void unlazy_mm_irqs_off(void)
 {
 	struct mm_struct *mm = current->active_mm;
+	struct rq *rq = this_rq();
 
 	lockdep_assert_irqs_disabled();
 
@@ -8815,10 +8974,17 @@ void unlazy_mm_irqs_off(void)
 		return;
 
 	switch_mm_irqs_off(mm, &init_mm, current);
-	mmgrab(&init_mm);
 	current->active_mm = &init_mm;
+
+	/*
+	 * We don't need a lazy reference to init_mm -- it's not about
+	 * to go away.
+	 */
+	smp_store_release(&rq->lazy_mm, NULL);
+
 	finish_arch_post_lock_switch();
-	mmdrop(mm);
+
+	mmdrop_lazy(rq);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b496a9ee9aec..1010e63962d9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -977,7 +977,15 @@ struct rq {
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
-	struct mm_struct	*prev_mm;
+
+	/*
+	 * Fast refcounting scheme for lazy mm.  lazy_mm is a hazard pointer:
+	 * setting it to point to a lazily used mm keeps that mm from being
+	 * freed.  drop_mm points to am mm that needs an mmdrop() call
+	 * after the CPU owning the rq is done with it.
+	 */
+	struct mm_struct	*lazy_mm;
+	struct mm_struct	*drop_mm;
 
 	unsigned int		clock_update_flags;
 	u64			clock;
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 17/23] x86/mm: Make use/unuse_temporary_mm() non-static
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (15 preceding siblings ...)
  2022-01-08 16:44 ` [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Andy Lutomirski
@ 2022-01-08 16:44 ` Andy Lutomirski
  2022-01-08 16:44 ` [PATCH 18/23] x86/mm: Allow temporary mms when IRQs are on Andy Lutomirski
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:44 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

This prepares them for use outside of the alternative machinery.
The code is unchanged.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h |  7 ++++
 arch/x86/kernel/alternative.c      | 65 +-----------------------------
 arch/x86/mm/tlb.c                  | 60 +++++++++++++++++++++++++++
 3 files changed, 68 insertions(+), 64 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 27516046117a..2ca4fc4a8a0a 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -220,4 +220,11 @@ unsigned long __get_current_cr3_fast(void);
 
 #include <asm-generic/mmu_context.h>
 
+typedef struct {
+	struct mm_struct *mm;
+} temp_mm_state_t;
+
+extern temp_mm_state_t use_temporary_mm(struct mm_struct *mm);
+extern void unuse_temporary_mm(temp_mm_state_t prev_state);
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index b47cd22b2eb1..af4c37e177ff 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -29,6 +29,7 @@
 #include <asm/io.h>
 #include <asm/fixmap.h>
 #include <asm/paravirt.h>
+#include <asm/mmu_context.h>
 
 int __read_mostly alternatives_patched;
 
@@ -706,70 +707,6 @@ void __init_or_module text_poke_early(void *addr, const void *opcode,
 	}
 }
 
-typedef struct {
-	struct mm_struct *mm;
-} temp_mm_state_t;
-
-/*
- * Using a temporary mm allows to set temporary mappings that are not accessible
- * by other CPUs. Such mappings are needed to perform sensitive memory writes
- * that override the kernel memory protections (e.g., W^X), without exposing the
- * temporary page-table mappings that are required for these write operations to
- * other CPUs. Using a temporary mm also allows to avoid TLB shootdowns when the
- * mapping is torn down.
- *
- * Context: The temporary mm needs to be used exclusively by a single core. To
- *          harden security IRQs must be disabled while the temporary mm is
- *          loaded, thereby preventing interrupt handler bugs from overriding
- *          the kernel memory protection.
- */
-static inline temp_mm_state_t use_temporary_mm(struct mm_struct *mm)
-{
-	temp_mm_state_t temp_state;
-
-	lockdep_assert_irqs_disabled();
-
-	/*
-	 * Make sure not to be in TLB lazy mode, as otherwise we'll end up
-	 * with a stale address space WITHOUT being in lazy mode after
-	 * restoring the previous mm.
-	 */
-	if (this_cpu_read(cpu_tlbstate_shared.is_lazy))
-		leave_mm(smp_processor_id());
-
-	temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm);
-	switch_mm_irqs_off(NULL, mm, current);
-
-	/*
-	 * If breakpoints are enabled, disable them while the temporary mm is
-	 * used. Userspace might set up watchpoints on addresses that are used
-	 * in the temporary mm, which would lead to wrong signals being sent or
-	 * crashes.
-	 *
-	 * Note that breakpoints are not disabled selectively, which also causes
-	 * kernel breakpoints (e.g., perf's) to be disabled. This might be
-	 * undesirable, but still seems reasonable as the code that runs in the
-	 * temporary mm should be short.
-	 */
-	if (hw_breakpoint_active())
-		hw_breakpoint_disable();
-
-	return temp_state;
-}
-
-static inline void unuse_temporary_mm(temp_mm_state_t prev_state)
-{
-	lockdep_assert_irqs_disabled();
-	switch_mm_irqs_off(NULL, prev_state.mm, current);
-
-	/*
-	 * Restore the breakpoints if they were disabled before the temporary mm
-	 * was loaded.
-	 */
-	if (hw_breakpoint_active())
-		hw_breakpoint_restore();
-}
-
 __ro_after_init struct mm_struct *poking_mm;
 __ro_after_init unsigned long poking_addr;
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 74b7a615bc15..4e371f30e2ab 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -702,6 +702,66 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 	this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
 }
 
+/*
+ * Using a temporary mm allows to set temporary mappings that are not accessible
+ * by other CPUs. Such mappings are needed to perform sensitive memory writes
+ * that override the kernel memory protections (e.g., W^X), without exposing the
+ * temporary page-table mappings that are required for these write operations to
+ * other CPUs. Using a temporary mm also allows to avoid TLB shootdowns when the
+ * mapping is torn down.
+ *
+ * Context: The temporary mm needs to be used exclusively by a single core. To
+ *          harden security IRQs must be disabled while the temporary mm is
+ *          loaded, thereby preventing interrupt handler bugs from overriding
+ *          the kernel memory protection.
+ */
+temp_mm_state_t use_temporary_mm(struct mm_struct *mm)
+{
+	temp_mm_state_t temp_state;
+
+	lockdep_assert_irqs_disabled();
+
+	/*
+	 * Make sure not to be in TLB lazy mode, as otherwise we'll end up
+	 * with a stale address space WITHOUT being in lazy mode after
+	 * restoring the previous mm.
+	 */
+	if (this_cpu_read(cpu_tlbstate_shared.is_lazy))
+		leave_mm(smp_processor_id());
+
+	temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+	switch_mm_irqs_off(NULL, mm, current);
+
+	/*
+	 * If breakpoints are enabled, disable them while the temporary mm is
+	 * used. Userspace might set up watchpoints on addresses that are used
+	 * in the temporary mm, which would lead to wrong signals being sent or
+	 * crashes.
+	 *
+	 * Note that breakpoints are not disabled selectively, which also causes
+	 * kernel breakpoints (e.g., perf's) to be disabled. This might be
+	 * undesirable, but still seems reasonable as the code that runs in the
+	 * temporary mm should be short.
+	 */
+	if (hw_breakpoint_active())
+		hw_breakpoint_disable();
+
+	return temp_state;
+}
+
+void unuse_temporary_mm(temp_mm_state_t prev_state)
+{
+	lockdep_assert_irqs_disabled();
+	switch_mm_irqs_off(NULL, prev_state.mm, current);
+
+	/*
+	 * Restore the breakpoints if they were disabled before the temporary mm
+	 * was loaded.
+	 */
+	if (hw_breakpoint_active())
+		hw_breakpoint_restore();
+}
+
 /*
  * Call this when reinitializing a CPU.  It fixes the following potential
  * problems:
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 18/23] x86/mm: Allow temporary mms when IRQs are on
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (16 preceding siblings ...)
  2022-01-08 16:44 ` [PATCH 17/23] x86/mm: Make use/unuse_temporary_mm() non-static Andy Lutomirski
@ 2022-01-08 16:44 ` Andy Lutomirski
  2022-01-08 16:44 ` [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery Andy Lutomirski
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:44 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

EFI runtime services should use temporary mms, but EFI runtime services
want IRQs on.  Preemption must still be disabled in a temporary mm context.

At some point, the entirely temporary mm mechanism should be moved out of
arch code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4e371f30e2ab..36ce9dffb963 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -708,18 +708,23 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
  * that override the kernel memory protections (e.g., W^X), without exposing the
  * temporary page-table mappings that are required for these write operations to
  * other CPUs. Using a temporary mm also allows to avoid TLB shootdowns when the
- * mapping is torn down.
+ * mapping is torn down.  Temporary mms can also be used for EFI runtime service
+ * calls or similar functionality.
  *
- * Context: The temporary mm needs to be used exclusively by a single core. To
- *          harden security IRQs must be disabled while the temporary mm is
- *          loaded, thereby preventing interrupt handler bugs from overriding
- *          the kernel memory protection.
+ * It is illegal to schedule while using a temporary mm -- the context switch
+ * code is unaware of the temporary mm and does not know how to context switch.
+ * Use a real (non-temporary) mm in a kernel thread if you need to sleep.
+ *
+ * Note: For sensitive memory writes, the temporary mm needs to be used
+ *       exclusively by a single core, and IRQs should be disabled while the
+ *       temporary mm is loaded, thereby preventing interrupt handler bugs from
+ *       overriding the kernel memory protection.
  */
 temp_mm_state_t use_temporary_mm(struct mm_struct *mm)
 {
 	temp_mm_state_t temp_state;
 
-	lockdep_assert_irqs_disabled();
+	lockdep_assert_preemption_disabled();
 
 	/*
 	 * Make sure not to be in TLB lazy mode, as otherwise we'll end up
@@ -751,7 +756,7 @@ temp_mm_state_t use_temporary_mm(struct mm_struct *mm)
 
 void unuse_temporary_mm(temp_mm_state_t prev_state)
 {
-	lockdep_assert_irqs_disabled();
+	lockdep_assert_preemption_disabled();
 	switch_mm_irqs_off(NULL, prev_state.mm, current);
 
 	/*
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (17 preceding siblings ...)
  2022-01-08 16:44 ` [PATCH 18/23] x86/mm: Allow temporary mms when IRQs are on Andy Lutomirski
@ 2022-01-08 16:44 ` Andy Lutomirski
  2022-01-10 13:13   ` Ard Biesheuvel
  2022-01-08 16:44 ` [PATCH 20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off() Andy Lutomirski
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:44 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski, Ard Biesheuvel

This should be considerably more robust.  It's also necessary for optimized
for_each_possible_lazymm_cpu() on x86 -- without this patch, EFI calls in
lazy context would remove the lazy mm from mm_cpumask().

Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/platform/efi/efi_64.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 7515e78ef898..b9a571904363 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -54,7 +54,7 @@
  * 0xffff_ffff_0000_0000 and limit EFI VA mapping space to 64G.
  */
 static u64 efi_va = EFI_VA_START;
-static struct mm_struct *efi_prev_mm;
+static temp_mm_state_t efi_temp_mm_state;
 
 /*
  * We need our own copy of the higher levels of the page tables
@@ -461,15 +461,12 @@ void __init efi_dump_pagetable(void)
  */
 void efi_enter_mm(void)
 {
-	efi_prev_mm = current->active_mm;
-	current->active_mm = &efi_mm;
-	switch_mm(efi_prev_mm, &efi_mm, NULL);
+	efi_temp_mm_state = use_temporary_mm(&efi_mm);
 }
 
 void efi_leave_mm(void)
 {
-	current->active_mm = efi_prev_mm;
-	switch_mm(&efi_mm, efi_prev_mm, NULL);
+	unuse_temporary_mm(efi_temp_mm_state);
 }
 
 static DEFINE_SPINLOCK(efi_runtime_lock);
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off()
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (18 preceding siblings ...)
  2022-01-08 16:44 ` [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery Andy Lutomirski
@ 2022-01-08 16:44 ` Andy Lutomirski
  2022-01-08 16:44 ` [PATCH 21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs Andy Lutomirski
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:44 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

x86's mm_cpumask() precisely tracks every CPU using an mm, with one
major caveat: x86 internally switches back to init_mm more
aggressively than the core code.  This means that it's possible for
x86 to point CR3 to init_mm and drop current->active_mm from
mm_cpumask().  The core scheduler doesn't know when this happens,
which is currently fine.

But if we want to use mm_cpumask() to optimize
for_each_possible_lazymm_cpu(), we need to keep mm_cpumask() in
sync with the core scheduler.

This patch removes x86's bespoke leave_mm() and uses the core scheduler's
unlazy_mm_irqs_off() so that a lazy mm can be dropped and ->active_mm
cleaned up together.  This allows for_each_possible_lazymm_cpu() to be
wired up on x86.

As a side effect, non-x86 architectures that use ACPI C3 will now leave
lazy mm mode before entering C3.  This can only possibly affect ia64,
because only x86 and ia64 enable CONFIG_ACPI_PROCESSOR_CSTATE.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu.h  |  2 --
 arch/x86/mm/tlb.c           | 29 +++--------------------------
 arch/x86/xen/mmu_pv.c       |  2 +-
 drivers/cpuidle/cpuidle.c   |  2 +-
 drivers/idle/intel_idle.c   |  4 ++--
 include/linux/mmu_context.h |  4 +---
 kernel/sched/sched.h        |  2 --
 7 files changed, 8 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5d7494631ea9..03ba71420ff3 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -63,7 +63,5 @@ typedef struct {
 		.lock = __MUTEX_INITIALIZER(mm.context.lock),		\
 	}
 
-void leave_mm(int cpu);
-#define leave_mm leave_mm
 
 #endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 36ce9dffb963..e502565176b9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,6 +8,7 @@
 #include <linux/export.h>
 #include <linux/cpu.h>
 #include <linux/debugfs.h>
+#include <linux/mmu_context.h>
 #include <linux/sched/smt.h>
 #include <linux/sched/mm.h>
 
@@ -294,28 +295,6 @@ static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
 	write_cr3(new_mm_cr3);
 }
 
-void leave_mm(int cpu)
-{
-	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
-
-	/*
-	 * It's plausible that we're in lazy TLB mode while our mm is init_mm.
-	 * If so, our callers still expect us to flush the TLB, but there
-	 * aren't any user TLB entries in init_mm to worry about.
-	 *
-	 * This needs to happen before any other sanity checks due to
-	 * intel_idle's shenanigans.
-	 */
-	if (loaded_mm == &init_mm)
-		return;
-
-	/* Warn if we're not lazy. */
-	WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy));
-
-	switch_mm(NULL, &init_mm, NULL);
-}
-EXPORT_SYMBOL_GPL(leave_mm);
-
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
 {
@@ -512,8 +491,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	 * from lazy TLB mode to normal mode if active_mm isn't changing.
 	 * When this happens, we don't assume that CR3 (and hence
 	 * cpu_tlbstate.loaded_mm) matches next.
-	 *
-	 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
 	 */
 
 	/* We don't want flush_tlb_func() to run concurrently with us. */
@@ -523,7 +500,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	/*
 	 * Verify that CR3 is what we think it is.  This will catch
 	 * hypothetical buggy code that directly switches to swapper_pg_dir
-	 * without going through leave_mm() / switch_mm_irqs_off() or that
+	 * without going through switch_mm_irqs_off() or that
 	 * does something like write_cr3(read_cr3_pa()).
 	 *
 	 * Only do this check if CONFIG_DEBUG_VM=y because __read_cr3()
@@ -732,7 +709,7 @@ temp_mm_state_t use_temporary_mm(struct mm_struct *mm)
 	 * restoring the previous mm.
 	 */
 	if (this_cpu_read(cpu_tlbstate_shared.is_lazy))
-		leave_mm(smp_processor_id());
+		unlazy_mm_irqs_off();
 
 	temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	switch_mm_irqs_off(NULL, mm, current);
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 3359c23573c5..ba849185810a 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -898,7 +898,7 @@ static void drop_mm_ref_this_cpu(void *info)
 	struct mm_struct *mm = info;
 
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm)
-		leave_mm(smp_processor_id());
+		unlazy_mm_irqs_off();
 
 	/*
 	 * If this cpu still has a stale cr3 reference, then make sure
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index ef2ea1b12cd8..b865822a6278 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -223,7 +223,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
 	}
 
 	if (target_state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
-		leave_mm(dev->cpu);
+		unlazy_mm_irqs_off();
 
 	/* Take note of the planned idle state. */
 	sched_idle_set_state(target_state);
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index e6c543b5ee1d..bb5d3b3e28df 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -115,8 +115,8 @@ static unsigned int mwait_substates __initdata;
  * If the local APIC timer is not known to be reliable in the target idle state,
  * enable one-shot tick broadcasting for the target CPU before executing MWAIT.
  *
- * Optionally call leave_mm() for the target CPU upfront to avoid wakeups due to
- * flushing user TLBs.
+ * Optionally call unlazy_mm_irqs_off() for the target CPU upfront to avoid
+ * wakeups due to flushing user TLBs.
  *
  * Must be called under local_irq_disable().
  */
diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index b9b970f7ab45..035e8e42eb78 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -10,9 +10,7 @@
 # define switch_mm_irqs_off switch_mm
 #endif
 
-#ifndef leave_mm
-static inline void leave_mm(int cpu) { }
-#endif
+extern void unlazy_mm_irqs_off(void);
 
 /*
  * CPUs that are capable of running user task @p. Must contain at least one
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1010e63962d9..e57121bc84d5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3071,5 +3071,3 @@ extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
 #endif
-
-extern void unlazy_mm_irqs_off(void);
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (19 preceding siblings ...)
  2022-01-08 16:44 ` [PATCH 20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off() Andy Lutomirski
@ 2022-01-08 16:44 ` Andy Lutomirski
  2022-01-08 16:44 ` [PATCH 22/23] x86/mm: Optimize for_each_possible_lazymm_cpu() Andy Lutomirski
  2022-01-08 16:44 ` [PATCH 23/23] x86/mm: Opt in to IRQs-off activate_mm() Andy Lutomirski
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:44 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

When IPI-flushing a lazy mm, we switch away from the lazy mm.  Use
unlazy_mm_irqs_off() so the scheduler knows we did this.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e502565176b9..225b407812c7 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -843,7 +843,7 @@ static void flush_tlb_func(void *info)
 		 * This should be rare, with native_flush_tlb_multi() skipping
 		 * IPIs to lazy TLB mode CPUs.
 		 */
-		switch_mm_irqs_off(NULL, &init_mm, NULL);
+		unlazy_mm_irqs_off();
 		return;
 	}
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 22/23] x86/mm: Optimize for_each_possible_lazymm_cpu()
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (20 preceding siblings ...)
  2022-01-08 16:44 ` [PATCH 21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs Andy Lutomirski
@ 2022-01-08 16:44 ` Andy Lutomirski
  2022-01-08 16:44 ` [PATCH 23/23] x86/mm: Opt in to IRQs-off activate_mm() Andy Lutomirski
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:44 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

Now that x86 does not switch away from a lazy mm behind the scheduler's
back and thus clear a CPU from mm_cpumask() that the scheduler thinks is
lazy, x86 can use mm_cpumask() to optimize for_each_possible_lazymm_cpu().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu.h | 4 ++++
 arch/x86/mm/tlb.c          | 4 +++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 03ba71420ff3..da55f768e68c 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -63,5 +63,9 @@ typedef struct {
 		.lock = __MUTEX_INITIALIZER(mm.context.lock),		\
 	}
 
+/* On x86, mm_cpumask(mm) contains all CPUs that might be lazily using mm */
+#define for_each_possible_lazymm_cpu(cpu, mm) \
+	for_each_cpu((cpu), mm_cpumask((mm)))
+
 
 #endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 225b407812c7..04eb43e96e23 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -706,7 +706,9 @@ temp_mm_state_t use_temporary_mm(struct mm_struct *mm)
 	/*
 	 * Make sure not to be in TLB lazy mode, as otherwise we'll end up
 	 * with a stale address space WITHOUT being in lazy mode after
-	 * restoring the previous mm.
+	 * restoring the previous mm.  Additionally, once we switch mms,
+	 * for_each_possible_lazymm_cpu() will no longer report this CPU,
+	 * so a lazymm pin wouldn't work.
 	 */
 	if (this_cpu_read(cpu_tlbstate_shared.is_lazy))
 		unlazy_mm_irqs_off();
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 23/23] x86/mm: Opt in to IRQs-off activate_mm()
  2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
                   ` (21 preceding siblings ...)
  2022-01-08 16:44 ` [PATCH 22/23] x86/mm: Optimize for_each_possible_lazymm_cpu() Andy Lutomirski
@ 2022-01-08 16:44 ` Andy Lutomirski
  22 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 16:44 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, x86, Rik van Riel,
	Dave Hansen, Peter Zijlstra, Nadav Amit, Mathieu Desnoyers,
	Andy Lutomirski

We gain nothing by having the core code enable IRQs right before calling
activate_mm() only for us to turn them right back off again in switch_mm().

This will save a few cycles, so execve() should be blazingly fast with this
patch applied!

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                   | 1 +
 arch/x86/include/asm/mmu_context.h | 8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5060c38bf560..908a596619f2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -119,6 +119,7 @@ config X86
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANTS_THP_SWAP		if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
+	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select BUILDTIME_TABLE_SORT
 	select CLKEVT_I8253
 	select CLOCKSOURCE_VALIDATE_LAST_CYCLE
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 2ca4fc4a8a0a..f028f1b68bc0 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -132,10 +132,10 @@ extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			       struct task_struct *tsk);
 #define switch_mm_irqs_off switch_mm_irqs_off
 
-#define activate_mm(prev, next)			\
-do {						\
-	paravirt_activate_mm((prev), (next));	\
-	switch_mm((prev), (next), NULL);	\
+#define activate_mm(prev, next)				\
+do {							\
+	paravirt_activate_mm((prev), (next));		\
+	switch_mm_irqs_off((prev), (next), NULL);	\
 } while (0);
 
 #ifdef CONFIG_X86_32
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-08 16:44 ` [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Andy Lutomirski
@ 2022-01-08 19:22   ` Linus Torvalds
  2022-01-08 22:04     ` Andy Lutomirski
  2022-01-09  5:56   ` Nadav Amit
  1 sibling, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2022-01-08 19:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra, Nadav Amit, Mathieu Desnoyers

On Sat, Jan 8, 2022 at 8:44 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> To improve scalability, this patch adds a percpu hazard pointer scheme to
> keep lazily-used mms alive.  Each CPU has a single pointer to an mm that
> must not be freed, and __mmput() checks the pointers belonging to all CPUs
> that might be lazily using the mm in question.

Ugh. This feels horribly fragile to me, and also looks like it makes
some common cases potentially quite expensive for machines with large
CPU counts if they don't do that mm_cpumask optimization - which in
turn feels quite fragile as well.

IOW, this just feels *complicated*.

And I think it's overly so. I get the strong feeling that we could
make the rules much simpler and more straightforward.

For example, how about we make the rules be

 - a lazy TLB mm reference requires that there's an actual active user
of that mm (ie "mm_users > 0")

 - the last mm_users decrement (ie __mmput) forces a TLB flush, and
that TLB flush must make sure that no lazy users exist (which I think
it does already anyway).

Doesn't that seem like a really simple set of rules?

And the nice thing about it is that we *already* do that required TLB
flush in all normal circumstances. __mmput() already calls
exit_mmap(), and exit_mm() already forces that TLB flush in every
normal situation.

So we might have to make sure that every architecture really does that
"drop lazy mms on TLB flush", and maybe add a flag to the existing
'struct mmu_gather tlb' to make sure that flush actually always
happens (even if the process somehow managed to unmap all vma's even
before exiting).

Is there something silly I'm missing? Somebody pat me on the head, and
say "There, there, Linus, don't try to get involved with things you
don't understand.." and explain to me in small words.

                  Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-08 19:22   ` Linus Torvalds
@ 2022-01-08 22:04     ` Andy Lutomirski
  2022-01-09  0:27       ` Linus Torvalds
  2022-01-09  0:53       ` Linus Torvalds
  0 siblings, 2 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-08 22:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers



On Sat, Jan 8, 2022, at 12:22 PM, Linus Torvalds wrote:
> On Sat, Jan 8, 2022 at 8:44 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> To improve scalability, this patch adds a percpu hazard pointer scheme to
>> keep lazily-used mms alive.  Each CPU has a single pointer to an mm that
>> must not be freed, and __mmput() checks the pointers belonging to all CPUs
>> that might be lazily using the mm in question.
>
> Ugh. This feels horribly fragile to me, and also looks like it makes
> some common cases potentially quite expensive for machines with large
> CPU counts if they don't do that mm_cpumask optimization - which in
> turn feels quite fragile as well.
>
> IOW, this just feels *complicated*.
>
> And I think it's overly so. I get the strong feeling that we could
> make the rules much simpler and more straightforward.
>
> For example, how about we make the rules be

There there, Linus, not everything is as simple^Wincapable as x86 bare metal, and mm_cpumask does not have useful cross-arch semantics.  Is that good?

>
>  - a lazy TLB mm reference requires that there's an actual active user
> of that mm (ie "mm_users > 0")
>
>  - the last mm_users decrement (ie __mmput) forces a TLB flush, and
> that TLB flush must make sure that no lazy users exist (which I think
> it does already anyway).

It does, on x86 bare metal, in exit_mmap().  It’s implicit, but it could be made explicit, as below.

>
> Doesn't that seem like a really simple set of rules?
>
> And the nice thing about it is that we *already* do that required TLB
> flush in all normal circumstances. __mmput() already calls
> exit_mmap(), and exit_mm() already forces that TLB flush in every
> normal situation.

Exactly. On x86 bare metal and similar architectures, this flush is done by IPI, which involves a loop over all CPUs that might be using the mm.  And other patches in this series add the core ability for x86 to shoot down the lazy TLB cleanly so the core drops its reference and wires it up for x86.

>
> So we might have to make sure that every architecture really does that
> "drop lazy mms on TLB flush", and maybe add a flag to the existing
> 'struct mmu_gather tlb' to make sure that flush actually always
> happens (even if the process somehow managed to unmap all vma's even
> before exiting).

So this requires that all architectures actually walk all relevant CPUs to see if an IPI is needed and send that IPI.  On architectures that actually need an IPI anyway (x86 bare metal, powerpc (I think) and others, fine. But on architectures with a broadcast-to-all-CPUs flush (ARM64 IIUC), then the extra IPI will be much much slower than a simple load-acquire in a loop.

In fact, arm64 doesn’t even track mm_cpumask at all last time I checked, so even an IPI lazy shoot down would require looping *all* CPUs, doing a load-acquire, and possibly doing an IPI. I much prefer doing a load-acquire and possibly a cmpxchg.

(And x86 PV can do hypercall flushes.  If a bunch of vCPUs are not running, an IPI shootdown will end up sleeping until they run, whereas this patch will allow the hypervisor to leave them asleep and thus to finish __mmput without waking them. This only matters on a CPU-oversubscribed host, but still.  And it kind of looks like hardware remote flushes are coming in AMD land eventually.)

But yes, I fully agree that this patch is complicated and subtle.

>
> Is there something silly I'm missing? Somebody pat me on the head, and
> say "There, there, Linus, don't try to get involved with things you
> don't understand.." and explain to me in small words.
>
>                   Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-08 22:04     ` Andy Lutomirski
@ 2022-01-09  0:27       ` Linus Torvalds
  2022-01-09  0:53       ` Linus Torvalds
  1 sibling, 0 replies; 79+ messages in thread
From: Linus Torvalds @ 2022-01-09  0:27 UTC (permalink / raw)
  To: Andy Lutomirski, Will Deacon, Catalin Marinas
  Cc: Andrew Morton, Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 1902 bytes --]

On Sat, Jan 8, 2022 at 2:04 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> So this requires that all architectures actually walk all relevant
> CPUs to see if an IPI is needed and send that IPI. On architectures
> that actually need an IPI anyway (x86 bare metal, powerpc (I think)
> and others, fine. But on architectures with a broadcast-to-all-CPUs
> flush (ARM64 IIUC), then the extra IPI will be much much slower than a
> simple load-acquire in a loop.

... hmm. How about a hybrid scheme?

 (a) architectures that already require that IPI anyway for TLB
invalidation (ie x86, but others too), just make the rule be that the TLB
flush by exit_mmap() get rid of any lazy TLB mm references. Which they
already do.

 (b) architectures like arm64 that do hw-assisted TLB shootdown will have
an ASID allocator model, and what you do is to use that to either
    (b') increment/decrement the mm_count at mm ASID allocation/freeing time
    (b'') use the existing ASID tracking data to find the CPU's that have
that ASID

 (c) can you really imagine hardware TLB shootdown without ASID allocation?
That doesn't seem to make sense. But if it exists, maybe that kind of crazy
case would do the percpu array walking.

(Honesty in advertising: I don't know the arm64 ASID code - I used to know
the old alpha version I wrote in a previous lifetime - but afaik any ASID
allocator has to be able to track CPU's that have a particular ASID in use
and be able to invalidate it).

Hmm. The x86 maintainers are on this thread, but they aren't even the
problem. Adding Catalin and Will to this, I think they should know if/how
this would fit with the arm64 ASID allocator.

Will/Catalin, background here:


https://lore.kernel.org/all/CAHk-=wj4LZaFB5HjZmzf7xLFSCcQri-WWqOEJHwQg0QmPRSdQA@mail.gmail.com/

for my objection to that special "keep non-refcounted magic per-cpu pointer
to lazy tlb mm".

           Linus

[-- Attachment #2: Type: text/html, Size: 2452 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-08 22:04     ` Andy Lutomirski
  2022-01-09  0:27       ` Linus Torvalds
@ 2022-01-09  0:53       ` Linus Torvalds
  2022-01-09  3:58         ` Andy Lutomirski
  1 sibling, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2022-01-09  0:53 UTC (permalink / raw)
  To: Andy Lutomirski, Will Deacon, Catalin Marinas
  Cc: Andrew Morton, Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers

[ Let's try this again, without the html crud this time. Apologies to
the people who see this reply twice ]

On Sat, Jan 8, 2022 at 2:04 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> So this requires that all architectures actually walk all relevant
> CPUs to see if an IPI is needed and send that IPI. On architectures
> that actually need an IPI anyway (x86 bare metal, powerpc (I think)
> and others, fine. But on architectures with a broadcast-to-all-CPUs
> flush (ARM64 IIUC), then the extra IPI will be much much slower than a
> simple load-acquire in a loop.

... hmm. How about a hybrid scheme?

 (a) architectures that already require that IPI anyway for TLB
invalidation (ie x86, but others too), just make the rule be that the
TLB flush by exit_mmap() get rid of any lazy TLB mm references. Which
they already do.

 (b) architectures like arm64 that do hw-assisted TLB shootdown will
have an ASID allocator model, and what you do is to use that to either
    (b') increment/decrement the mm_count at mm ASID allocation/freeing time
    (b'') use the existing ASID tracking data to find the CPU's that
have that ASID

 (c) can you really imagine hardware TLB shootdown without ASID
allocation? That doesn't seem to make sense. But if it exists, maybe
that kind of crazy case would do the percpu array walking.

(Honesty in advertising: I don't know the arm64 ASID code - I used to
know the old alpha version I wrote in a previous lifetime - but afaik
any ASID allocator has to be able to track CPU's that have a
particular ASID in use and be able to invalidate it).

Hmm. The x86 maintainers are on this thread, but they aren't even the
problem. Adding Catalin and Will to this, I think they should know
if/how this would fit with the arm64 ASID allocator.

Will/Catalin, background here:

   https://lore.kernel.org/all/CAHk-=wj4LZaFB5HjZmzf7xLFSCcQri-WWqOEJHwQg0QmPRSdQA@mail.gmail.com/

for my objection to that special "keep non-refcounted magic per-cpu
pointer to lazy tlb mm".

           Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09  0:53       ` Linus Torvalds
@ 2022-01-09  3:58         ` Andy Lutomirski
  2022-01-09  4:38           ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-09  3:58 UTC (permalink / raw)
  To: Linus Torvalds, Will Deacon, Catalin Marinas
  Cc: Andrew Morton, Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers



On Sat, Jan 8, 2022, at 4:53 PM, Linus Torvalds wrote:
> [ Let's try this again, without the html crud this time. Apologies to
> the people who see this reply twice ]
>
> On Sat, Jan 8, 2022 at 2:04 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> So this requires that all architectures actually walk all relevant
>> CPUs to see if an IPI is needed and send that IPI. On architectures
>> that actually need an IPI anyway (x86 bare metal, powerpc (I think)
>> and others, fine. But on architectures with a broadcast-to-all-CPUs
>> flush (ARM64 IIUC), then the extra IPI will be much much slower than a
>> simple load-acquire in a loop.
>
> ... hmm. How about a hybrid scheme?
>
>  (a) architectures that already require that IPI anyway for TLB
> invalidation (ie x86, but others too), just make the rule be that the
> TLB flush by exit_mmap() get rid of any lazy TLB mm references. Which
> they already do.
>
>  (b) architectures like arm64 that do hw-assisted TLB shootdown will
> have an ASID allocator model, and what you do is to use that to either
>     (b') increment/decrement the mm_count at mm ASID allocation/freeing time
>     (b'') use the existing ASID tracking data to find the CPU's that
> have that ASID
>
>  (c) can you really imagine hardware TLB shootdown without ASID
> allocation? That doesn't seem to make sense. But if it exists, maybe
> that kind of crazy case would do the percpu array walking.
>

So I can go over a handful of TLB flush schemes:

1. x86 bare metal.  As noted, just plain shootdown would work.  (Unless we switch to inexact mm_cpumask() tracking, which might be enough of a win that it's worth it.)  Right now, "ASID" (i.e. PCID, thanks Intel) is allocated per cpu.  They are never explicitly freed -- they just expire off a percpu LRU.  The data structures have no idea whether an mm still exists -- instead they track mm->context.ctx_id, which is 64 bits and never reused.

2. x86 paravirt.  This is just like bare metal except there's a hypercall to flush a specific target cpu.  (I think this is mutually exclusive with PCID, but I'm not sure.  I haven't looked that hard.  I'm not sure exactly what is implemented right now.  It could be an operation to flush (cpu, pcid), but that gets awkward for reasons that aren't too relevant to this discussion.)  In this model, the exit_mmap() shootdown would either need to switch to a non-paravirt flush or we need a fancy mm_count solution of some sort.

3. Hypothetical better x86.  AMD has INVLPGB, which is almost useless right now.  But it's *so* close to being very useful, and I've asked engineers at AMD and Intel to improve this.  Specifically, I want PCID to be widened to 64 bits.  (This would, as I understand it, not affect the TLB hardware at all.  It would affect the tiny table that sits in front of the real PCID and maintains the illusion that PCID is 12 bits, and it would affect the MOV CR3 instruction.  The latter makes it complicated.)  And INVLPGB would invalidate a given 64-bit PCID system-wide.  In this model, there would be no such thing as freeing an ASID.  So I think we would want something very much like this patch.

4. ARM64.  I only barely understand it, but I think it's an intermediate scheme with ASIDs that are wide enough to be useful but narrow enough to run out on occasion.  I don't think they're tracked -- I think the whole world just gets invalidated when they overflow.  I could be wrong.

In any event, ASID lifetimes aren't a magic solution -- how do we know when to expire an ASID?  Presumably it would be when an mm is fully freed (__mmdrop), which gets us right back to square one.

In any case, what I particularly like about my patch is that, while it's subtle, it's subtle just once.  I think it can handle all the interesting arch cases by merely redefining for_each_possible_lazymm_cpu() to do the right thing.

> (Honesty in advertising: I don't know the arm64 ASID code - I used to
> know the old alpha version I wrote in a previous lifetime - but afaik
> any ASID allocator has to be able to track CPU's that have a
> particular ASID in use and be able to invalidate it).
>
> Hmm. The x86 maintainers are on this thread, but they aren't even the
> problem. Adding Catalin and Will to this, I think they should know
> if/how this would fit with the arm64 ASID allocator.
>

Well, I am an x86 mm maintainer, and there is definitely a performance problem on large x86 systems right now. :)

> Will/Catalin, background here:
>
>    
> https://lore.kernel.org/all/CAHk-=wj4LZaFB5HjZmzf7xLFSCcQri-WWqOEJHwQg0QmPRSdQA@mail.gmail.com/
>
> for my objection to that special "keep non-refcounted magic per-cpu
> pointer to lazy tlb mm".
>
>            Linus

--Andy

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09  3:58         ` Andy Lutomirski
@ 2022-01-09  4:38           ` Linus Torvalds
  2022-01-09 20:19             ` Andy Lutomirski
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2022-01-09  4:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Will Deacon, Catalin Marinas, Andrew Morton, Linux-MM,
	Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers

On Sat, Jan 8, 2022 at 7:59 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> > Hmm. The x86 maintainers are on this thread, but they aren't even the
> > problem. Adding Catalin and Will to this, I think they should know
> > if/how this would fit with the arm64 ASID allocator.
> >
>
> Well, I am an x86 mm maintainer, and there is definitely a performance problem on large x86 systems right now. :)

Well, my point was that on x86, the complexities of the patch you
posted are completely pointless.

So on x86, you can just remove the mmgrab/mmdrop reference counts from
the lazy mm use entirely, and voila, that performance problem is gone.
We don't _need_ reference counting on x86 at all, if we just say that
the rule is that a lazy mm is always associated with a
honest-to-goodness live mm.

So on x86 - and any platform with the IPI model - there is no need for
hundreds of lines of complexity at all.

THAT is my point. Your patch adds complexity that buys you ABSOLUTELY NOTHING.

You then saying that the mmgrab/mmdrop is a performance problem is
just trying to muddy the water. You can just remove it entirely.

Now, I do agree that that depends on the whole "TLB IPI will get rid
of any lazy mm users on other cpus". So I agree that if you have
hardware TLB invalidation that then doesn't have that software
component to it, you need something else.

But my argument _then_ was that hardware TLB invalidation then needs
the hardware ASID thing to be useful, and the ASID management code
already effectively keeps track of "this ASID is used on other CPU's".
And that's exactly the same kind of information that your patch
basically added a separate percpu array for.

So I think that even for that hardware TLB shootdown case, your patch
only adds overhead.

And it potentially adds a *LOT* of overhead, if you replace an atomic
refcount with a "for_each_possible_cpu()" loop that has to do cmpxchg
things too.

Now, on x86, where we maintain that mm_cpumask, and as a result that
overhead is much lower - but we maintain that mm_cpumask exactly
*because* we do that IPI thing, so I don't think you can use that
argument in favor of your patch. When we do the IPI thing, your patch
is worthless overhead.

See?

Btw, you don't even need to really solve the arm64 TLB invalidate
thing - we could make the rule be that we only do the mmgrab/mmput at
all on platforms that don't do that IPI flush.

I think that's basically exactly what Nick Piggin wanted to do on powerpc, no?

But you hated that patch, for non-obvious reasons, and are now
introducing this new patch that is clearly non-optimal on x86.

So I think there's some intellectual dishonesty on your part here.

                Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-08 16:44 ` [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Andy Lutomirski
  2022-01-08 19:22   ` Linus Torvalds
@ 2022-01-09  5:56   ` Nadav Amit
  2022-01-09  6:48     ` Linus Torvalds
  1 sibling, 1 reply; 79+ messages in thread
From: Nadav Amit @ 2022-01-09  5:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	X86 ML, Rik van Riel, Dave Hansen, Peter Zijlstra,
	Mathieu Desnoyers, Linus Torvalds


> On Jan 8, 2022, at 8:44 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> The purpose of mmgrab() and mmdrop() is to make "lazy tlb" mode safe.

Just wondering: In a world of ASID/PCID - does the “lazy TLB” really
have a worthy advantage?

Considering the fact that with PTI anyhow address spaces are switched
all the time, can’t we just get rid of it?


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09  5:56   ` Nadav Amit
@ 2022-01-09  6:48     ` Linus Torvalds
  2022-01-09  8:49       ` Nadav Amit
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2022-01-09  6:48 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Andrew Morton, Linux-MM, Nicholas Piggin,
	Anton Blanchard, Benjamin Herrenschmidt, Paul Mackerras,
	Randy Dunlap, linux-arch, X86 ML, Rik van Riel, Dave Hansen,
	Peter Zijlstra, Mathieu Desnoyers

On Sat, Jan 8, 2022 at 9:56 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> Just wondering: In a world of ASID/PCID - does the “lazy TLB” really
> have a worthy advantage?
>
> Considering the fact that with PTI anyhow address spaces are switched
> all the time, can’t we just get rid of it?

Hmm.. That may indeed be the right thing to do.

I think arm64 already hardcodes ASID 0 to init_mm, and that kernel
threads (and the idle threads in particular) might as well just use
that. In that kind of situation, there's likely little advantage to
reusing a user address space ID, and quite possibly any potential
advantage is overshadowed by the costs.

The lazy tlb thing goes back a *looong* way, and lots of things have
changed since. Maybe it's not worth it any more.

Or maybe it's only worth it on platforms where it's free (UP, possibly
other situations - like if you have IPI and it's "free").

If I read the history correctly, it looks like PF_LAZY_TLB was
introduced in 2.3.11-pre4 or something. Back in summer of 1999. The
"active_mm" vs "mm" model came later.

                  Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09  6:48     ` Linus Torvalds
@ 2022-01-09  8:49       ` Nadav Amit
  2022-01-09 19:10         ` Linus Torvalds
  2022-01-09 19:22         ` Rik van Riel
  0 siblings, 2 replies; 79+ messages in thread
From: Nadav Amit @ 2022-01-09  8:49 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Andrew Morton, Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	X86 ML, Rik van Riel, Dave Hansen, Peter Zijlstra,
	Mathieu Desnoyers



> On Jan 8, 2022, at 10:48 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Sat, Jan 8, 2022 at 9:56 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> Just wondering: In a world of ASID/PCID - does the “lazy TLB” really
>> have a worthy advantage?
>> 
>> Considering the fact that with PTI anyhow address spaces are switched
>> all the time, can’t we just get rid of it?
> 

[snip]

> 
> Or maybe it's only worth it on platforms where it's free (UP, possibly
> other situations - like if you have IPI and it's "free").

On UP it might be free, but on x86+IPIs there is a tradeoff.

When you decide which CPUs you want to send the IPI to, in the
common flow (no tables freed) you check whether they use
“lazy TLB” or not in order to filter out the lazy ones. In the
past this was on a cacheline with other frequently-dirtied data so
many times the cacheline bounced from cache to cache. Worse, the
test used an indirect branch so was expensive with Spectre v2
mitigations. I fixed it some time ago, so things are better and
today the cacheline of is_lazy should bounce less between caches,
but there is a tradeoff in maintaining and checking both cpumask
and then is_lazy for each CPU in cpumask.

It is possible for instance to get rid of is_lazy, keep the CPU
on mm_cpumask when switching to kernel thread, and then if/when
an IPI is received remove it from cpumask to avoid further
unnecessary TLB shootdown IPIs.

I do not know whether it is a pure win, but there is a tradeoff.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09  8:49       ` Nadav Amit
@ 2022-01-09 19:10         ` Linus Torvalds
  2022-01-09 19:52           ` Andy Lutomirski
  2022-01-09 19:22         ` Rik van Riel
  1 sibling, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2022-01-09 19:10 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Andrew Morton, Linux-MM, Nicholas Piggin,
	Anton Blanchard, Benjamin Herrenschmidt, Paul Mackerras,
	Randy Dunlap, linux-arch, X86 ML, Rik van Riel, Dave Hansen,
	Peter Zijlstra, Mathieu Desnoyers

On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> I do not know whether it is a pure win, but there is a tradeoff.

Hmm. I guess only some serious testing would tell.

On x86, I'd be a bit worried about removing lazy TLB simply because
even with ASID support there (called PCIDs by Intel for NIH reasons),
the actual ASID space on x86 was at least originally very very
limited.

Architecturally, x86 may expose 12 bits of ASID space, but iirc at
least the first few implementations actually only internally had one
or two bits, and hashed the 12 bits down to that internal very limited
hardware TLB ID space.

We only use a handful of ASIDs per CPU on x86 partly for this reason
(but also since there's no remote hardware TLB shootdown, there's no
reason to have a bigger global ASID space, so ASIDs aren't _that_
common).

And I don't know how many non-PCID x86 systems (perhaps virtualized?)
there might be out there.

But it would be very interesting to test some "disable lazy tlb"
patch. The main problem workloads tend to be IO, and I'm not sure how
many of the automated performance tests would catch issues. I guess
some threaded pipe ping-pong test (with each thread pinned to
different cores) would show it.

And I guess there is some load that triggered the original powerpc
patch by Nick&co, and that Andy has been using..

Anybody willing to cook up a patch and run some benchmarks? Perhaps
one that basically just replaces "set ->mm to NULL" with "set ->mm to
&init_mm" - so that the lazy TLB code is still *there*, but it never
triggers..

I think it's mainly 'copy_thread()' in kernel/fork.c and the 'init_mm'
initializer in mm/init-mm.c, but there's probably other things too
that have that knowledge of the special "tsk->mm = NULL" situation.

                  Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09  8:49       ` Nadav Amit
  2022-01-09 19:10         ` Linus Torvalds
@ 2022-01-09 19:22         ` Rik van Riel
  2022-01-09 19:34           ` Nadav Amit
  1 sibling, 1 reply; 79+ messages in thread
From: Rik van Riel @ 2022-01-09 19:22 UTC (permalink / raw)
  To: Nadav Amit, Linus Torvalds, Andy Lutomirski
  Cc: Andrew Morton, Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	X86 ML, Dave Hansen, Peter Zijlstra, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 624 bytes --]

On Sun, 2022-01-09 at 00:49 -0800, Nadav Amit wrote:
> 
> It is possible for instance to get rid of is_lazy, keep the CPU
> on mm_cpumask when switching to kernel thread, and then if/when
> an IPI is received remove it from cpumask to avoid further
> unnecessary TLB shootdown IPIs.
> 
> I do not know whether it is a pure win, but there is a tradeoff.

That's not a win at all. It is what we used to have before
the lazy mm stuff was re-introduced, and it led to quite a
few unnecessary IPIs, which are especially slow on virtual
machines, where the target CPU may not be running.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 19:22         ` Rik van Riel
@ 2022-01-09 19:34           ` Nadav Amit
  2022-01-09 19:37             ` Rik van Riel
  0 siblings, 1 reply; 79+ messages in thread
From: Nadav Amit @ 2022-01-09 19:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andy Lutomirski, Andrew Morton, Linux-MM,
	Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, X86 ML, Dave Hansen,
	Peter Zijlstra, Mathieu Desnoyers



> On Jan 9, 2022, at 11:22 AM, Rik van Riel <riel@surriel.com> wrote:
> 
> On Sun, 2022-01-09 at 00:49 -0800, Nadav Amit wrote:
>> 
>> It is possible for instance to get rid of is_lazy, keep the CPU
>> on mm_cpumask when switching to kernel thread, and then if/when
>> an IPI is received remove it from cpumask to avoid further
>> unnecessary TLB shootdown IPIs.
>> 
>> I do not know whether it is a pure win, but there is a tradeoff.
> 
> That's not a win at all. It is what we used to have before
> the lazy mm stuff was re-introduced, and it led to quite a
> few unnecessary IPIs, which are especially slow on virtual
> machines, where the target CPU may not be running.

You make a good point about VMs.

IIUC Lazy-TLB serves several goals:

1. Avoid arch address-space switch to save switching time and
   TLB misses.
2. Prevent unnecessary IPIs while kernel threads run.
3. Avoid cache-contention on mm_cpumask when switching to a kernel
   thread.

Your concern is with (2), and this one is easy to keep regardless
of the rest.

I am not sure that (3) is actually helpful, since it might lead
to more cache activity than without lazy-TLB, but that is somewhat
orthogonal to everything else.

As for (1), which is the most fragile aspect, unless you use
shadow page-tables, I am not sure there is a significant benefit.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 19:34           ` Nadav Amit
@ 2022-01-09 19:37             ` Rik van Riel
  2022-01-09 19:51               ` Nadav Amit
  0 siblings, 1 reply; 79+ messages in thread
From: Rik van Riel @ 2022-01-09 19:37 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linus Torvalds, Andy Lutomirski, Andrew Morton, Linux-MM,
	Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, X86 ML, Dave Hansen,
	Peter Zijlstra, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 1684 bytes --]

On Sun, 2022-01-09 at 11:34 -0800, Nadav Amit wrote:
> 
> 
> > On Jan 9, 2022, at 11:22 AM, Rik van Riel <riel@surriel.com> wrote:
> > 
> > On Sun, 2022-01-09 at 00:49 -0800, Nadav Amit wrote:
> > > 
> > > It is possible for instance to get rid of is_lazy, keep the CPU
> > > on mm_cpumask when switching to kernel thread, and then if/when
> > > an IPI is received remove it from cpumask to avoid further
> > > unnecessary TLB shootdown IPIs.
> > > 
> > > I do not know whether it is a pure win, but there is a tradeoff.
> > 
> > That's not a win at all. It is what we used to have before
> > the lazy mm stuff was re-introduced, and it led to quite a
> > few unnecessary IPIs, which are especially slow on virtual
> > machines, where the target CPU may not be running.
> 
> You make a good point about VMs.
> 
> IIUC Lazy-TLB serves several goals:
> 
> 1. Avoid arch address-space switch to save switching time and
>    TLB misses.
> 2. Prevent unnecessary IPIs while kernel threads run.
> 3. Avoid cache-contention on mm_cpumask when switching to a kernel
>    thread.
> 
> Your concern is with (2), and this one is easy to keep regardless
> of the rest.
> 
> I am not sure that (3) is actually helpful, since it might lead
> to more cache activity than without lazy-TLB, but that is somewhat
> orthogonal to everything else.

I have seen problems with (3) in practice, too.

For many workloads, context switching is much, much more
of a hot path than TLB shootdowns, which are relatively
infrequent by comparison.

Basically ASID took away only the first concern from your
list above, not the other two.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 19:37             ` Rik van Riel
@ 2022-01-09 19:51               ` Nadav Amit
  2022-01-09 19:54                 ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Nadav Amit @ 2022-01-09 19:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andy Lutomirski, Andrew Morton, Linux-MM,
	Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, X86 ML, Dave Hansen,
	Peter Zijlstra, Mathieu Desnoyers



> On Jan 9, 2022, at 11:37 AM, Rik van Riel <riel@surriel.com> wrote:
> 
> On Sun, 2022-01-09 at 11:34 -0800, Nadav Amit wrote:
>> 
>> 
>>> On Jan 9, 2022, at 11:22 AM, Rik van Riel <riel@surriel.com> wrote:
>>> 
>>> On Sun, 2022-01-09 at 00:49 -0800, Nadav Amit wrote:
>>>> 
>>>> It is possible for instance to get rid of is_lazy, keep the CPU
>>>> on mm_cpumask when switching to kernel thread, and then if/when
>>>> an IPI is received remove it from cpumask to avoid further
>>>> unnecessary TLB shootdown IPIs.
>>>> 
>>>> I do not know whether it is a pure win, but there is a tradeoff.
>>> 
>>> That's not a win at all. It is what we used to have before
>>> the lazy mm stuff was re-introduced, and it led to quite a
>>> few unnecessary IPIs, which are especially slow on virtual
>>> machines, where the target CPU may not be running.
>> 
>> You make a good point about VMs.
>> 
>> IIUC Lazy-TLB serves several goals:
>> 
>> 1. Avoid arch address-space switch to save switching time and
>>    TLB misses.
>> 2. Prevent unnecessary IPIs while kernel threads run.
>> 3. Avoid cache-contention on mm_cpumask when switching to a kernel
>>    thread.
>> 
>> Your concern is with (2), and this one is easy to keep regardless
>> of the rest.
>> 
>> I am not sure that (3) is actually helpful, since it might lead
>> to more cache activity than without lazy-TLB, but that is somewhat
>> orthogonal to everything else.
> 
> I have seen problems with (3) in practice, too.
> 
> For many workloads, context switching is much, much more
> of a hot path than TLB shootdowns, which are relatively
> infrequent by comparison.
> 
> Basically ASID took away only the first concern from your
> list above, not the other two.

I agree, but the point I was trying to make is that you can keep lazy
TLB for (2) and (3), but still switch the address-space. If you
already accept PTI, then the 600 cycles or so of switching the
address space back and forth, which should occur more infrequently
than those on syscalls/exceptions, are not that painful.

You can also make a case that it is “safer” to switch the address
space, although SMAP/SMEP protection provides similar properties.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 19:10         ` Linus Torvalds
@ 2022-01-09 19:52           ` Andy Lutomirski
  2022-01-09 20:00             ` Linus Torvalds
  2022-01-09 20:34             ` Nadav Amit
  0 siblings, 2 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-09 19:52 UTC (permalink / raw)
  To: Linus Torvalds, Nadav Amit
  Cc: Andrew Morton, Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra (Intel),
	Mathieu Desnoyers

On Sun, Jan 9, 2022, at 11:10 AM, Linus Torvalds wrote:
> On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>>
>> I do not know whether it is a pure win, but there is a tradeoff.
>
> Hmm. I guess only some serious testing would tell.
>
> On x86, I'd be a bit worried about removing lazy TLB simply because
> even with ASID support there (called PCIDs by Intel for NIH reasons),
> the actual ASID space on x86 was at least originally very very
> limited.
>
> Architecturally, x86 may expose 12 bits of ASID space, but iirc at
> least the first few implementations actually only internally had one
> or two bits, and hashed the 12 bits down to that internal very limited
> hardware TLB ID space.
>
> We only use a handful of ASIDs per CPU on x86 partly for this reason
> (but also since there's no remote hardware TLB shootdown, there's no
> reason to have a bigger global ASID space, so ASIDs aren't _that_
> common).
>
> And I don't know how many non-PCID x86 systems (perhaps virtualized?)
> there might be out there.
>
> But it would be very interesting to test some "disable lazy tlb"
> patch. The main problem workloads tend to be IO, and I'm not sure how
> many of the automated performance tests would catch issues. I guess
> some threaded pipe ping-pong test (with each thread pinned to
> different cores) would show it.

My original PCID series actually did remove lazy TLB on x86.  I don't remember why, but people objected.  The issue isn't the limited PCID space -- IIRC it's just that MOV CR3 is slooooow.  If we get rid of lazy TLB on x86, then we are writing CR3 twice on even a very short idle.  That adds maybe 1k cycles, which isn't great.

>
> And I guess there is some load that triggered the original powerpc
> patch by Nick&co, and that Andy has been using..

I don't own a big enough machine.  The workloads I'm aware of with the problem have massively multithreaded programs using many CPUs, and transitions into and out of lazy mode ping-pong the cacheline.

>
> Anybody willing to cook up a patch and run some benchmarks? Perhaps
> one that basically just replaces "set ->mm to NULL" with "set ->mm to
> &init_mm" - so that the lazy TLB code is still *there*, but it never
> triggers..

It would 

>
> I think it's mainly 'copy_thread()' in kernel/fork.c and the 'init_mm'
> initializer in mm/init-mm.c, but there's probably other things too
> that have that knowledge of the special "tsk->mm = NULL" situation.

I think, for a little test, we would leave all the mm == NULL code alone and just change the enter-lazy logic.  On top of all the cleanups in this series, that would be trivial.

>
>                   Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 19:51               ` Nadav Amit
@ 2022-01-09 19:54                 ` Linus Torvalds
  0 siblings, 0 replies; 79+ messages in thread
From: Linus Torvalds @ 2022-01-09 19:54 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rik van Riel, Andy Lutomirski, Andrew Morton, Linux-MM,
	Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch, X86 ML, Dave Hansen,
	Peter Zijlstra, Mathieu Desnoyers

On Sun, Jan 9, 2022 at 11:51 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> If you already accept PTI, [..]

I really don't think anybody should ever "accept PTI".

It's an absolutely enormous cost, and it should be seen as a
last-choice thing. Only really meant for completely broken hardware
(ie meltdown), and for people who have some very serious issues and
think it's "reasonable" to have the TLB isolation.

No real normal sane setup should ever "accept PTI", and it shouldn't
be used as an argument.

              Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 19:52           ` Andy Lutomirski
@ 2022-01-09 20:00             ` Linus Torvalds
  2022-01-09 20:34             ` Nadav Amit
  1 sibling, 0 replies; 79+ messages in thread
From: Linus Torvalds @ 2022-01-09 20:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nadav Amit, Andrew Morton, Linux-MM, Nicholas Piggin,
	Anton Blanchard, Benjamin Herrenschmidt, Paul Mackerras,
	Randy Dunlap, linux-arch, the arch/x86 maintainers, Rik van Riel,
	Dave Hansen, Peter Zijlstra (Intel),
	Mathieu Desnoyers

On Sun, Jan 9, 2022 at 11:53 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> My original PCID series actually did remove lazy TLB on x86. I don't
> remember why, but people objected. The issue isn't the limited PCID
> space -- IIRC it's just that MOV CR3 is slooooow. If we get rid of
> lazy TLB on x86, then we are writing CR3 twice on even a very short
> idle. That adds maybe 1k cycles, which isn't great.

Yeah, my gut feel is that lazy-TLB almost certainly makes sense on x86.

And the grab/mmput overhead and associated cacheline ping-pong is (I
think) something we could just get rid of on x86 due to the IPI model.
There are probably other costs to lazy TLB, and I can imagine that
there are other maintenance costs, but yes, cr3 moves have always been
expensive on x86 even aside from the actual TLB switch.

But I could easily imagine the situation being different on arm64, for example.

But numbers beat "gut feel" and "easily imagine" every time. So it
would be kind of nice to have that ...

          Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09  4:38           ` Linus Torvalds
@ 2022-01-09 20:19             ` Andy Lutomirski
  2022-01-09 20:48               ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-09 20:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Catalin Marinas, Andrew Morton, Linux-MM,
	Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers



On Sat, Jan 8, 2022, at 8:38 PM, Linus Torvalds wrote:
> On Sat, Jan 8, 2022 at 7:59 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> > Hmm. The x86 maintainers are on this thread, but they aren't even the
>> > problem. Adding Catalin and Will to this, I think they should know
>> > if/how this would fit with the arm64 ASID allocator.
>> >
>>
>> Well, I am an x86 mm maintainer, and there is definitely a performance problem on large x86 systems right now. :)
>
> Well, my point was that on x86, the complexities of the patch you
> posted are completely pointless.
>
> So on x86, you can just remove the mmgrab/mmdrop reference counts from
> the lazy mm use entirely, and voila, that performance problem is gone.
> We don't _need_ reference counting on x86 at all, if we just say that
> the rule is that a lazy mm is always associated with a
> honest-to-goodness live mm.
>
> So on x86 - and any platform with the IPI model - there is no need for
> hundreds of lines of complexity at all.
>
> THAT is my point. Your patch adds complexity that buys you ABSOLUTELY NOTHING.
>
> You then saying that the mmgrab/mmdrop is a performance problem is
> just trying to muddy the water. You can just remove it entirely.
>
> Now, I do agree that that depends on the whole "TLB IPI will get rid
> of any lazy mm users on other cpus". So I agree that if you have
> hardware TLB invalidation that then doesn't have that software
> component to it, you need something else.
>
> But my argument _then_ was that hardware TLB invalidation then needs
> the hardware ASID thing to be useful, and the ASID management code
> already effectively keeps track of "this ASID is used on other CPU's".
> And that's exactly the same kind of information that your patch
> basically added a separate percpu array for.
>

Are you *sure*?  The ASID management code on x86 is (as mentioned before) completely unaware of whether an ASID is actually in use anywhere.  The x86 ASID code is a per-cpu LRU -- it tracks whether an mm has been recently used on a cpu, not whether the mm exists.  If an mm hasn't been used recently, the ASID gets recycled.  If we had more bits, we wouldn't even recycle it.  An ASID can and does get recycled while the mm still exists.

> So I think that even for that hardware TLB shootdown case, your patch
> only adds overhead.

The overhead is literally:

exit_mmap();
for each cpu still in mm_cpumask:
  smp_load_acquire

That's it, unless the mm is actually in use, in which 

>
> And it potentially adds a *LOT* of overhead, if you replace an atomic
> refcount with a "for_each_possible_cpu()" loop that has to do cmpxchg
> things too.

The cmpxchg is only in the case in which the mm is actually in use on that CPU.  I'm having trouble imagining a workload in which the loop is even measurable unless the bit scan itself is somehow problematic.

On a very large arm64 system, I would believe there could be real overhead.  But these very large systems are exactly the systems that currently ping-pong mm_count.

>
> Btw, you don't even need to really solve the arm64 TLB invalidate
> thing - we could make the rule be that we only do the mmgrab/mmput at
> all on platforms that don't do that IPI flush.
>
> I think that's basically exactly what Nick Piggin wanted to do on powerpc, no?

More or less, but...

>
> But you hated that patch, for non-obvious reasons, and are now
> introducing this new patch that is clearly non-optimal on x86.

I hated that patch because it's messy and it leaves the core lazy handling in an IMO quite regrettable state, not because I'm particularly opposed to shooting down lazies on platforms where it makes sense (powerpc and mostly x86).

As just the most obvious issue, note the kasan_check_byte() in this patch that verifies that ->active_mm doesn't point to freed memory when the scheduler is entered.  If we flipped shoot-lazies on on x86, then KASAN would blow up with that.

For perspective, this whole series is 23 patches.  Exactly two of them are directly related to my hazard pointer scheme: patches 16 and 22.  The rest of them are, in my opinion, cleanups and some straight-up bugfixes that are worthwhile no matter what we do with lazy mm handling per se.

>
> So I think there's some intellectual dishonesty on your part here.

I never said I hated shoot-lazies.  I didn't like the *code*.  I thought I could do better, and I still think my hazard pointer scheme is nifty and, aside from some complexity, quite nice, and it even reduces to shoot-lazies if for_each_possible_lazymm_cpu() is defined to do nothing, but I mainly wanted the code to be better.  So I went and did it.  I could respin this series without the hazard pointers quite easily.

--Andy

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 19:52           ` Andy Lutomirski
  2022-01-09 20:00             ` Linus Torvalds
@ 2022-01-09 20:34             ` Nadav Amit
  2022-01-09 20:48               ` Andy Lutomirski
  1 sibling, 1 reply; 79+ messages in thread
From: Nadav Amit @ 2022-01-09 20:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, Nicholas Piggin,
	Anton Blanchard, Benjamin Herrenschmidt, Paul Mackerras,
	Randy Dunlap, linux-arch, the arch/x86 maintainers, Rik van Riel,
	Dave Hansen, Peter Zijlstra (Intel),
	Mathieu Desnoyers



> On Jan 9, 2022, at 11:52 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sun, Jan 9, 2022, at 11:10 AM, Linus Torvalds wrote:
>> On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>>> 
>>> I do not know whether it is a pure win, but there is a tradeoff.
>> 
>> Hmm. I guess only some serious testing would tell.
>> 
>> On x86, I'd be a bit worried about removing lazy TLB simply because
>> even with ASID support there (called PCIDs by Intel for NIH reasons),
>> the actual ASID space on x86 was at least originally very very
>> limited.
>> 
>> Architecturally, x86 may expose 12 bits of ASID space, but iirc at
>> least the first few implementations actually only internally had one
>> or two bits, and hashed the 12 bits down to that internal very limited
>> hardware TLB ID space.
>> 
>> We only use a handful of ASIDs per CPU on x86 partly for this reason
>> (but also since there's no remote hardware TLB shootdown, there's no
>> reason to have a bigger global ASID space, so ASIDs aren't _that_
>> common).
>> 
>> And I don't know how many non-PCID x86 systems (perhaps virtualized?)
>> there might be out there.
>> 
>> But it would be very interesting to test some "disable lazy tlb"
>> patch. The main problem workloads tend to be IO, and I'm not sure how
>> many of the automated performance tests would catch issues. I guess
>> some threaded pipe ping-pong test (with each thread pinned to
>> different cores) would show it.
> 
> My original PCID series actually did remove lazy TLB on x86.  I don't remember why, but people objected.  The issue isn't the limited PCID space -- IIRC it's just that MOV CR3 is slooooow.  If we get rid of lazy TLB on x86, then we are writing CR3 twice on even a very short idle.  That adds maybe 1k cycles, which isn't great.

Just for the record: I just ran a short test when CPUs are on max freq
on my skylake. MOV-CR3 without flush is 250-300 cycles. One can argue
that you mostly only care for one of the switches for the idle thread
(once you wake up). And waking up by itself has its overheads.

But you are the master of micro optimizations, and as Rik said, I
mostly think of TLB shootdowns and might miss the big picture. Just
trying to make your life easier by less coding and my life simpler
in understanding your super-smart code. ;-)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 20:34             ` Nadav Amit
@ 2022-01-09 20:48               ` Andy Lutomirski
  0 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-09 20:48 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, Nicholas Piggin,
	Anton Blanchard, Benjamin Herrenschmidt, Paul Mackerras,
	Randy Dunlap, linux-arch, the arch/x86 maintainers, Rik van Riel,
	Dave Hansen, Peter Zijlstra (Intel),
	Mathieu Desnoyers



On Sun, Jan 9, 2022, at 1:34 PM, Nadav Amit wrote:
>> On Jan 9, 2022, at 11:52 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> 
>> On Sun, Jan 9, 2022, at 11:10 AM, Linus Torvalds wrote:
>>> On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>>>> 
>>>> I do not know whether it is a pure win, but there is a tradeoff.
>>> 
>>> Hmm. I guess only some serious testing would tell.
>>> 
>>> On x86, I'd be a bit worried about removing lazy TLB simply because
>>> even with ASID support there (called PCIDs by Intel for NIH reasons),
>>> the actual ASID space on x86 was at least originally very very
>>> limited.
>>> 
>>> Architecturally, x86 may expose 12 bits of ASID space, but iirc at
>>> least the first few implementations actually only internally had one
>>> or two bits, and hashed the 12 bits down to that internal very limited
>>> hardware TLB ID space.
>>> 
>>> We only use a handful of ASIDs per CPU on x86 partly for this reason
>>> (but also since there's no remote hardware TLB shootdown, there's no
>>> reason to have a bigger global ASID space, so ASIDs aren't _that_
>>> common).
>>> 
>>> And I don't know how many non-PCID x86 systems (perhaps virtualized?)
>>> there might be out there.
>>> 
>>> But it would be very interesting to test some "disable lazy tlb"
>>> patch. The main problem workloads tend to be IO, and I'm not sure how
>>> many of the automated performance tests would catch issues. I guess
>>> some threaded pipe ping-pong test (with each thread pinned to
>>> different cores) would show it.
>> 
>> My original PCID series actually did remove lazy TLB on x86.  I don't remember why, but people objected.  The issue isn't the limited PCID space -- IIRC it's just that MOV CR3 is slooooow.  If we get rid of lazy TLB on x86, then we are writing CR3 twice on even a very short idle.  That adds maybe 1k cycles, which isn't great.
>
> Just for the record: I just ran a short test when CPUs are on max freq
> on my skylake. MOV-CR3 without flush is 250-300 cycles. One can argue
> that you mostly only care for one of the switches for the idle thread
> (once you wake up). And waking up by itself has its overheads.
>
> But you are the master of micro optimizations, and as Rik said, I
> mostly think of TLB shootdowns and might miss the big picture. Just
> trying to make your life easier by less coding and my life simpler
> in understanding your super-smart code. ;-)

As Rik pointed out, the mm_cpumask manipulation is also expensive if we get rid of lazy. Let me ponder how to do this nicely.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 20:19             ` Andy Lutomirski
@ 2022-01-09 20:48               ` Linus Torvalds
  2022-01-09 21:51                 ` Linus Torvalds
  2022-01-11 10:39                 ` Will Deacon
  0 siblings, 2 replies; 79+ messages in thread
From: Linus Torvalds @ 2022-01-09 20:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Will Deacon, Catalin Marinas, Andrew Morton, Linux-MM,
	Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers

On Sun, Jan 9, 2022 at 12:20 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> Are you *sure*? The ASID management code on x86 is (as mentioned
> before) completely unaware of whether an ASID is actually in use
> anywhere.

Right.

But the ASID situation on x86 is very very different, exactly because
x86 doesn't have cross-CPU TLB invalidates.

Put another way: x86 TLB hardware is fundamentally per-cpu. As such,
any ASID management is also per-cpu.

That's fundamentally not true on arm64.  And that's not some "arm64
implementation detail". That's fundamental to doing cross-CPU TLB
invalidates in hardware.

If your TLB invalidates act across CPU's, then the state they act on
are also obviously across CPU's.

So the ASID situation is fundamentally different depending on the
hardware usage. On x86, TLB's are per-core, and on arm64 they are not,
and that's reflected in our code too.

As a result, on x86, each mm has a per-cpu ASID, and there's a small
array of per-cpu "mm->asid" mappings.

On arm, each mm has an asid, and it's allocated from a global asid
space - so there is no need for that "mm->asid" mapping, because the
asid is there in the mm, and it's shared across cpus.

That said, I still don't actually know the arm64 ASID management code.

The thing about TLB flushes is that it's ok to do them spuriously (as
long as you don't do _too_ many of them and tank performance), so two
different mm's can have the same hw ASID on two different cores and
that just makes cross-CPU TLB invalidates too aggressive. You can't
share an ASID on the _same_ core without flushing in between context
switches, because then the TLB on that core might be re-used for a
different mm. So the flushing rules aren't necessarily 100% 1:1 with
the "in use" rules, and who knows if the arm64 ASID management
actually ends up just matching what that whole "this lazy TLB is still
in use on another CPU".

So I don't really know the arm64 situation. And i's possible that lazy
TLB isn't even worth it on arm64 in the first place.

> > So I think that even for that hardware TLB shootdown case, your patch
> > only adds overhead.
>
> The overhead is literally:
>
> exit_mmap();
> for each cpu still in mm_cpumask:
>   smp_load_acquire
>
> That's it, unless the mm is actually in use

Ok, now do this for a machine with 1024 CPU's.

And tell me it is "scalable".

> On a very large arm64 system, I would believe there could be real overhead.  But these very large systems are exactly the systems that currently ping-pong mm_count.

Right.

But I think your arguments against mm_count are questionable.

I'd much rather have a *much* smaller patch that says "on x86 and
powerpc, we don't need this overhead at all".

And then the arm64 people can look at it and say "Yeah, we'll still do
the mm_count thing", or maybe say "Yeah, we can solve it another way".

And maybe the arm64 people actually say "Yeah, this hazard pointer
thing is perfect for us". That still doesn't necessarily argue for it
on an architecture that ends up serializing with an IPI anyway.

                Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 20:48               ` Linus Torvalds
@ 2022-01-09 21:51                 ` Linus Torvalds
  2022-01-10  0:52                   ` Andy Lutomirski
  2022-01-10  4:56                   ` Nicholas Piggin
  2022-01-11 10:39                 ` Will Deacon
  1 sibling, 2 replies; 79+ messages in thread
From: Linus Torvalds @ 2022-01-09 21:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Will Deacon, Catalin Marinas, Andrew Morton, Linux-MM,
	Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers

[ Ugh, I actually went back and looked at Nick's patches again, to
just verify my memory, and they weren't as pretty as I thought they
were ]

On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'd much rather have a *much* smaller patch that says "on x86 and
> powerpc, we don't need this overhead at all".

For some reason I thought Nick's patch worked at "last mmput" time and
the TLB flush IPIs that happen at that point anyway would then make
sure any lazy TLB is cleaned up.

But that's not actually what it does. It ties the
MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
last mmdrop() instead. Because it really tied the whole logic to the
mm_count logic (and made lazy tlb to not do mm_count) rather than the
mm_users thing I mis-remembered it doing.

So at least some of my arguments were based on me just mis-remembering
what Nick's patch actually did (mainly because I mentally recreated
the patch from "Nick did something like this" and what I thought would
be the way to do it on x86).

So I guess I have to recant my arguments.

I still think my "get rid of lazy at last mmput" model should work,
and would be a perfect match for x86, but I can't really point to Nick
having done that.

So I was full of BS.

Hmm. I'd love to try to actually create a patch that does that "Nick
thing", but on last mmput() (ie when __mmput triggers). Because I
think this is interesting. But then I look at my schedule for the
upcoming week, and I go "I don't have a leg to stand on in this
discussion, and I'm just all hot air".

Because code talks, BS walks.

                Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 21:51                 ` Linus Torvalds
@ 2022-01-10  0:52                   ` Andy Lutomirski
  2022-01-10  2:36                     ` Rik van Riel
  2022-01-10  4:56                   ` Nicholas Piggin
  1 sibling, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-10  0:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Catalin Marinas, Andrew Morton, Linux-MM,
	Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers

On Sun, Jan 9, 2022, at 1:51 PM, Linus Torvalds wrote:
> [ Ugh, I actually went back and looked at Nick's patches again, to
> just verify my memory, and they weren't as pretty as I thought they
> were ]
>
> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I'd much rather have a *much* smaller patch that says "on x86 and
>> powerpc, we don't need this overhead at all".

I can whip this up.  It won’t be much smaller — we still need the entire remainder of the series as prep, and we’ll end up with an if (arch needs mmcount) somewhere, but it’s straightforward.

But I reserve the right to spam you with an even bigger patch, because mm_cpumask also pingpongs, and I have some ideas for helping that out too.

Also:

>> exit_mmap();
>> for each cpu still in mm_cpumask:
>>   smp_load_acquire
>>
>> That's it, unless the mm is actually in use
>
> Ok, now do this for a machine with 1024 CPU's.
>
> And tell me it is "scalable".
>

Do you mean a machine with 1024 CPUs and 2 bits set in mm_cpumask or 1024 CPU with 800 bits set in mm_cpumask?  In the former case, this is fine.  In the latter case, *on x86*, sure it does 800 loads, but we're about to do 800 CR3 writes to tear the whole mess down, so the 800 loads should be in the noise.  (And this series won't actually do this anyway on bare metal, since exit_mmap does the shootdown.)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-10  0:52                   ` Andy Lutomirski
@ 2022-01-10  2:36                     ` Rik van Riel
  2022-01-10  3:51                       ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Rik van Riel @ 2022-01-10  2:36 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Will Deacon, Catalin Marinas, Andrew Morton, Linux-MM,
	Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Dave Hansen, Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers

[-- Attachment #1: Type: text/plain, Size: 941 bytes --]

On Sun, 2022-01-09 at 17:52 -0700, Andy Lutomirski wrote:
> On Sun, Jan 9, 2022, at 1:51 PM, Linus Torvalds wrote:
> 
> 
> 
> > > exit_mmap();
> > > for each cpu still in mm_cpumask:
> > >   smp_load_acquire
> > > 
> > > That's it, unless the mm is actually in use
> > 
> > Ok, now do this for a machine with 1024 CPU's.
> > 
> > And tell me it is "scalable".
> > 
> 
> Do you mean a machine with 1024 CPUs and 2 bits set in mm_cpumask or
> 1024 CPU with 800 bits set in mm_cpumask?  In the former case, this
> is fine.  In the latter case, *on x86*, sure it does 800 loads, but
> we're about to do 800 CR3 writes to tear the whole mess down, so the
> 800 loads should be in the noise.  (And this series won't actually do
> this anyway on bare metal, since exit_mmap does the shootdown.)

Also, while 800 loads is kinda expensive, it is a heck of
a lot less expensive than 800 IPIs.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-10  2:36                     ` Rik van Riel
@ 2022-01-10  3:51                       ` Linus Torvalds
  0 siblings, 0 replies; 79+ messages in thread
From: Linus Torvalds @ 2022-01-10  3:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, Will Deacon, Catalin Marinas, Andrew Morton,
	Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Dave Hansen, Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers

On Sun, Jan 9, 2022 at 6:40 PM Rik van Riel <riel@surriel.com> wrote:
>
> Also, while 800 loads is kinda expensive, it is a heck of
> a lot less expensive than 800 IPIs.

Rik, the IPI's you have to do *anyway*. So there are exactly zero extra IPI's.

Go take a look. It's part of the whole "flush TLB's" thing in __mmput().

So let me explain one more time what I think we should have done, at
least on x86:

 (1) stop refcounting active_mm entries entirely on x86

Why can we do that? Because instead of worrying about doing those
mm_count games for the active_mm reference, we realize that any
active_mm has to have a _regular_ mm associated with it, and it has a
'mm_users' count.

And when that mm_users count go to zero, we have:

 (2) mmput -> __mmput -> exit_mmap(), which already has to flush all
TLB's because it's tearing down the page tables

And since it has to flush those TLB's as part of tearing down the page
tables, we on x86 then have:

 (3) that TLB flush will have to do the IPI's to anybody who has that
mm active already

and that IPI has to be done *regardless*. And the TLB flushing done by
that IPI? That code already clears the lazy status (and not doing so
would be pointless and in fact wrong).

Notice? There isn't some "800 loads". There isn't some "800 IPI's".
And there isn't any refcounting cost of the lazy TLB.

Well, right now there *is* that refcounting cost, but my point is that
I don't think it should exist.

It shouldn't exist as an atomic access to mm_count (with those cache
ping-pongs when you have a lot of threads across a lot of CPUs), but
it *also* shouldn't exist as a "lightweight hazard pointer".

See my point? I think the lazy-tlb refcounting we do is pointless if
you have to do IPI's for TLB flushes.

Note: the above is for x86, which has to do the IPI's anyway (and it's
very possible that if you don't have to do IPI's because you have HW
TLB coherency, maybe lazy TLB's aren't what you should be using, but I
think that should be a separate discussion).

And yes, right now we do that pointless reference counting, because it
was simple and straightforward, and people historically didn't see it
as a problem.

Plus we had to have that whole secondary 'mm_count' anyway for other
reasons, since we use it for things that need to keep a ref to 'struct
mm_struct' around regardless of page table counts (eg things like a
lot of /proc references to 'struct mm_struct' do not want to keep
forced references to user page tables alive).

But I think conceptually mm_count (ie mmgrab/mmdrop) was always really
dodgy for lazy TLB. Lazy TLB really cares about the page tables still
being there, and that's not what mm_count is ostensibly about. That's
really what mm_users is about.

Yet mmgrab/mmdrop is exactly what the lazy TLB code uses, even if it's
technically odd (ie mmgrab really only keeps the 'struct mm' around,
but not about the vma's and page tables).

Side note: you can see the effects of this mis-use of mmgrab/mmdrop in
 how we tear down _almost_ all the page table state in __mmput(). But
look at what we keep until the final __mmdrop, even though there are
no page tables left:

        mm_free_pgd(mm);
        destroy_context(mm);

exactly because even though we've torn down all the page tables
earlier, we had to keep the page table *root* around for the lazy
case.

It's kind of a layering violation, but it comes from that lazy-tlb
mm_count use, and so we have that odd situation where the page table
directory lifetime is very different from the rest of the page table
lifetimes.

(You can easily make excuses for it by just saying that "mm_users" is
the user-space page table user count, and that the page directory has
a different lifetime because it's also about the kernel page tables,
so it's all a bit of a gray area, but I do think it's also a bit of a
sign of how our refcounting for lazy-tlb is a bit dodgy).

                Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 21:51                 ` Linus Torvalds
  2022-01-10  0:52                   ` Andy Lutomirski
@ 2022-01-10  4:56                   ` Nicholas Piggin
  2022-01-10  5:17                     ` Nicholas Piggin
  2022-01-10 20:52                     ` Andy Lutomirski
  1 sibling, 2 replies; 79+ messages in thread
From: Nicholas Piggin @ 2022-01-10  4:56 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Andrew Morton, Anton Blanchard, Benjamin Herrenschmidt,
	Catalin Marinas, Dave Hansen, linux-arch, Linux-MM,
	Mathieu Desnoyers, Nadav Amit, Paul Mackerras,
	Peter Zijlstra (Intel),
	Randy Dunlap, Rik van Riel, Will Deacon,
	the arch/x86 maintainers

Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
> [ Ugh, I actually went back and looked at Nick's patches again, to
> just verify my memory, and they weren't as pretty as I thought they
> were ]
> 
> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I'd much rather have a *much* smaller patch that says "on x86 and
>> powerpc, we don't need this overhead at all".
> 
> For some reason I thought Nick's patch worked at "last mmput" time and
> the TLB flush IPIs that happen at that point anyway would then make
> sure any lazy TLB is cleaned up.
> 
> But that's not actually what it does. It ties the
> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
> last mmdrop() instead. Because it really tied the whole logic to the
> mm_count logic (and made lazy tlb to not do mm_count) rather than the
> mm_users thing I mis-remembered it doing.

It does this because on powerpc with hash MMU, we can't use IPIs for
TLB shootdowns.

> So at least some of my arguments were based on me just mis-remembering
> what Nick's patch actually did (mainly because I mentally recreated
> the patch from "Nick did something like this" and what I thought would
> be the way to do it on x86).

With powerpc with the radix MMU using IPI based shootdowns, we can 
actually do the switch-away-from-lazy on the final TLB flush and the
final broadcast shootdown thing becomes a no-op. I didn't post that
additional patch because it's powerpc-specific and I didn't want to
post more code so widely.

> So I guess I have to recant my arguments.
> 
> I still think my "get rid of lazy at last mmput" model should work,
> and would be a perfect match for x86, but I can't really point to Nick
> having done that.
> 
> So I was full of BS.
> 
> Hmm. I'd love to try to actually create a patch that does that "Nick
> thing", but on last mmput() (ie when __mmput triggers). Because I
> think this is interesting. But then I look at my schedule for the
> upcoming week, and I go "I don't have a leg to stand on in this
> discussion, and I'm just all hot air".

I agree Andy's approach is very complicated and adds more overhead than
necessary for powerpc, which is why I don't want to use it. I'm still
not entirely sure what the big problem would be to convert x86 to use
it, I admit I haven't kept up with the exact details of its lazy tlb
mm handling recently though.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-10  4:56                   ` Nicholas Piggin
@ 2022-01-10  5:17                     ` Nicholas Piggin
  2022-01-10 17:19                       ` Linus Torvalds
  2022-01-10 20:52                     ` Andy Lutomirski
  1 sibling, 1 reply; 79+ messages in thread
From: Nicholas Piggin @ 2022-01-10  5:17 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Andrew Morton, Anton Blanchard, Benjamin Herrenschmidt,
	Catalin Marinas, Dave Hansen, linux-arch, Linux-MM,
	Mathieu Desnoyers, Nadav Amit, Paul Mackerras,
	Peter Zijlstra (Intel),
	Randy Dunlap, Rik van Riel, Will Deacon,
	the arch/x86 maintainers

Excerpts from Nicholas Piggin's message of January 10, 2022 2:56 pm:
> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>> [ Ugh, I actually went back and looked at Nick's patches again, to
>> just verify my memory, and they weren't as pretty as I thought they
>> were ]
>> 
>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>> powerpc, we don't need this overhead at all".
>> 
>> For some reason I thought Nick's patch worked at "last mmput" time and
>> the TLB flush IPIs that happen at that point anyway would then make
>> sure any lazy TLB is cleaned up.
>> 
>> But that's not actually what it does. It ties the
>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>> last mmdrop() instead. Because it really tied the whole logic to the
>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>> mm_users thing I mis-remembered it doing.
> 
> It does this because on powerpc with hash MMU, we can't use IPIs for
> TLB shootdowns.
> 
>> So at least some of my arguments were based on me just mis-remembering
>> what Nick's patch actually did (mainly because I mentally recreated
>> the patch from "Nick did something like this" and what I thought would
>> be the way to do it on x86).
> 
> With powerpc with the radix MMU using IPI based shootdowns, we can 
> actually do the switch-away-from-lazy on the final TLB flush and the
> final broadcast shootdown thing becomes a no-op. I didn't post that
> additional patch because it's powerpc-specific and I didn't want to
> post more code so widely.

This is the patch that goes on top of the series I posted. It's not
very clean at the moment it was just a proof of concept. I wrote it
a year ago by the looks so no guarantees. But it exits all other
lazies as part of the final exit_mmap TLB flush so there should not be
additional IPIs at drop-time. Possibly you could get preempted and
moved CPUs between them but the point is the vast majority of the time
you won't require more IPIs.

Well, with powerpc it's not _quite_ that simple, it is possible we
could use broadcast TLBIE instruction rather than IPIs for this, in
practice I think that's not _so much_ faster that the IPIs are a
problem and on highly threaded jobs where you might have hundreds of
other CPUs in the mask, you'd rather avoid the cacheline bouncing in
context switch anyway.

Thanks,
Nick

From 1f7fd5de284fab6b94bf49f55ce691ae22538473 Mon Sep 17 00:00:00 2001
From: Nicholas Piggin <npiggin@gmail.com>
Date: Tue, 23 Feb 2021 14:11:21 +1000
Subject: [PATCH] powerpc/64s/radix: TLB flush optimization for lazy mm
 shootdown refcounting

XXX: could also clear lazy at exit, perhaps?
XXX: doesn't really matter AFAIKS because it will soon go away with mmput
XXX: must audit exit flushes for nest MMU
---
 arch/powerpc/mm/book3s64/radix_tlb.c | 45 ++++++++++++++--------------
 kernel/fork.c                        | 16 ++++++++--
 2 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 59156c2d2ebe..b64cd0d22b8b 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -723,7 +723,7 @@ void radix__local_flush_all_mm(struct mm_struct *mm)
 }
 EXPORT_SYMBOL(radix__local_flush_all_mm);
 
-static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
+static void radix__flush_all_mm(struct mm_struct *mm, bool fullmm)
 {
 	radix__local_flush_all_mm(mm);
 }
@@ -862,7 +862,7 @@ enum tlb_flush_type {
 	FLUSH_TYPE_GLOBAL,
 };
 
-static enum tlb_flush_type flush_type_needed(struct mm_struct *mm, bool fullmm)
+static enum tlb_flush_type flush_type_needed(struct mm_struct *mm)
 {
 	int active_cpus = atomic_read(&mm->context.active_cpus);
 	int cpu = smp_processor_id();
@@ -888,14 +888,6 @@ static enum tlb_flush_type flush_type_needed(struct mm_struct *mm, bool fullmm)
 	if (atomic_read(&mm->context.copros) > 0)
 		return FLUSH_TYPE_GLOBAL;
 
-	/*
-	 * In the fullmm case there's no point doing the exit_flush_lazy_tlbs
-	 * because the mm is being taken down anyway, and a TLBIE tends to
-	 * be faster than an IPI+TLBIEL.
-	 */
-	if (fullmm)
-		return FLUSH_TYPE_GLOBAL;
-
 	/*
 	 * If we are running the only thread of a single-threaded process,
 	 * then we should almost always be able to trim off the rest of the
@@ -947,7 +939,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 	 * switch_mm_irqs_off
 	 */
 	smp_mb();
-	type = flush_type_needed(mm, false);
+	type = flush_type_needed(mm);
 	if (type == FLUSH_TYPE_LOCAL) {
 		_tlbiel_pid(pid, RIC_FLUSH_TLB);
 	} else if (type == FLUSH_TYPE_GLOBAL) {
@@ -971,7 +963,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 }
 EXPORT_SYMBOL(radix__flush_tlb_mm);
 
-static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
+void radix__flush_all_mm(struct mm_struct *mm)
 {
 	unsigned long pid;
 	enum tlb_flush_type type;
@@ -982,7 +974,7 @@ static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
 
 	preempt_disable();
 	smp_mb(); /* see radix__flush_tlb_mm */
-	type = flush_type_needed(mm, fullmm);
+	type = flush_type_needed(mm);
 	if (type == FLUSH_TYPE_LOCAL) {
 		_tlbiel_pid(pid, RIC_FLUSH_ALL);
 	} else if (type == FLUSH_TYPE_GLOBAL) {
@@ -1002,11 +994,6 @@ static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
 	}
 	preempt_enable();
 }
-
-void radix__flush_all_mm(struct mm_struct *mm)
-{
-	__flush_all_mm(mm, false);
-}
 EXPORT_SYMBOL(radix__flush_all_mm);
 
 void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
@@ -1021,7 +1008,7 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
 
 	preempt_disable();
 	smp_mb(); /* see radix__flush_tlb_mm */
-	type = flush_type_needed(mm, false);
+	type = flush_type_needed(mm);
 	if (type == FLUSH_TYPE_LOCAL) {
 		_tlbiel_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
 	} else if (type == FLUSH_TYPE_GLOBAL) {
@@ -1127,7 +1114,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 
 	preempt_disable();
 	smp_mb(); /* see radix__flush_tlb_mm */
-	type = flush_type_needed(mm, fullmm);
+	type = flush_type_needed(mm);
 	if (type == FLUSH_TYPE_NONE)
 		goto out;
 
@@ -1295,7 +1282,18 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 	 * See the comment for radix in arch_exit_mmap().
 	 */
 	if (tlb->fullmm || tlb->need_flush_all) {
-		__flush_all_mm(mm, true);
+		/*
+		 * Shootdown based lazy tlb mm refcounting means we have to
+		 * IPI everyone in the mm_cpumask anyway soon when the mm goes
+		 * away, so might as well do it as part of the final flush now.
+		 *
+		 * If lazy shootdown was improved to reduce IPIs (e.g., by
+		 * batching), then it may end up being better to use tlbies
+		 * here instead.
+		 */
+		smp_mb(); /* see radix__flush_tlb_mm */
+		exit_flush_lazy_tlbs(mm);
+		_tlbiel_pid(mm->context.id, RIC_FLUSH_ALL);
 	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
 		if (!tlb->freed_tables)
 			radix__flush_tlb_mm(mm);
@@ -1326,10 +1324,11 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 		return;
 
 	fullmm = (end == TLB_FLUSH_ALL);
+	WARN_ON_ONCE(fullmm); /* XXX: this shouldn't get fullmm? */
 
 	preempt_disable();
 	smp_mb(); /* see radix__flush_tlb_mm */
-	type = flush_type_needed(mm, fullmm);
+	type = flush_type_needed(mm);
 	if (type == FLUSH_TYPE_NONE)
 		goto out;
 
@@ -1412,7 +1411,7 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr)
 	/* Otherwise first do the PWC, then iterate the pages. */
 	preempt_disable();
 	smp_mb(); /* see radix__flush_tlb_mm */
-	type = flush_type_needed(mm, false);
+	type = flush_type_needed(mm);
 	if (type == FLUSH_TYPE_LOCAL) {
 		_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
 	} else if (type == FLUSH_TYPE_GLOBAL) {

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch
  2022-01-08 16:43   ` Andy Lutomirski
@ 2022-01-10  8:42     ` Christophe Leroy
  -1 siblings, 0 replies; 79+ messages in thread
From: Christophe Leroy @ 2022-01-10  8:42 UTC (permalink / raw)
  To: Andy Lutomirski, Andrew Morton, Linux-MM
  Cc: linux-arch, x86, Rik van Riel, Peter Zijlstra, Randy Dunlap,
	linuxppc-dev, Nicholas Piggin, Dave Hansen, Mathieu Desnoyers,
	Paul Mackerras, Nadav Amit



Le 08/01/2022 à 17:43, Andy Lutomirski a écrit :
> powerpc did the following on some, but not all, paths through
> switch_mm_irqs_off():
> 
>         /*
>          * Only need the full barrier when switching between processes.
>          * Barrier when switching from kernel to userspace is not
>          * required here, given that it is implied by mmdrop(). Barrier
>          * when switching from userspace to kernel is not needed after
>          * store to rq->curr.
>          */
>         if (likely(!(atomic_read(&next->membarrier_state) &
>                      (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
>                       MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
>                 return;
> 
> This is puzzling: if !prev, then one might expect that we are switching
> from kernel to user, not user to kernel, which is inconsistent with the
> comment.  But this is all nonsense, because the one and only caller would
> never have prev == NULL and would, in fact, OOPS if prev == NULL.
> 
> In any event, this code is unnecessary, since the new generic
> membarrier_finish_switch_mm() provides the same barrier without arch help.

I can't find this function membarrier_finish_switch_mm(), neither in 
Linus tree, nor in linux-next tree.

> 
> arch/powerpc/include/asm/membarrier.h remains as an empty header,
> because a later patch in this series will add code to it.
> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>   arch/powerpc/include/asm/membarrier.h | 24 ------------------------
>   arch/powerpc/mm/mmu_context.c         |  1 -
>   2 files changed, 25 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
> index de7f79157918..b90766e95bd1 100644
> --- a/arch/powerpc/include/asm/membarrier.h
> +++ b/arch/powerpc/include/asm/membarrier.h
> @@ -1,28 +1,4 @@
>   #ifndef _ASM_POWERPC_MEMBARRIER_H
>   #define _ASM_POWERPC_MEMBARRIER_H
>   
> -static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> -					     struct mm_struct *next,
> -					     struct task_struct *tsk)
> -{
> -	/*
> -	 * Only need the full barrier when switching between processes.
> -	 * Barrier when switching from kernel to userspace is not
> -	 * required here, given that it is implied by mmdrop(). Barrier
> -	 * when switching from userspace to kernel is not needed after
> -	 * store to rq->curr.
> -	 */
> -	if (IS_ENABLED(CONFIG_SMP) &&
> -	    likely(!(atomic_read(&next->membarrier_state) &
> -		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
> -		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
> -		return;
> -
> -	/*
> -	 * The membarrier system call requires a full memory barrier
> -	 * after storing to rq->curr, before going back to user-space.
> -	 */
> -	smp_mb();
> -}
> -
>   #endif /* _ASM_POWERPC_MEMBARRIER_H */
> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
> index 74246536b832..5f2daa6b0497 100644
> --- a/arch/powerpc/mm/mmu_context.c
> +++ b/arch/powerpc/mm/mmu_context.c
> @@ -84,7 +84,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>   		asm volatile ("dssall");
>   
>   	if (!new_on_cpu)
> -		membarrier_arch_switch_mm(prev, next, tsk);

Are you sure that's what you want ?

It now means you have:

	if (!new_on_cpu)
	switch_mmu_context(prev, next, tsk);


>   
>   	/*
>   	 * The actual HW switching method differs between the various

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch
@ 2022-01-10  8:42     ` Christophe Leroy
  0 siblings, 0 replies; 79+ messages in thread
From: Christophe Leroy @ 2022-01-10  8:42 UTC (permalink / raw)
  To: Andy Lutomirski, Andrew Morton, Linux-MM
  Cc: linux-arch, Randy Dunlap, Rik van Riel, Peter Zijlstra,
	Nadav Amit, x86, Nicholas Piggin, Dave Hansen, Mathieu Desnoyers,
	Paul Mackerras, linuxppc-dev



Le 08/01/2022 à 17:43, Andy Lutomirski a écrit :
> powerpc did the following on some, but not all, paths through
> switch_mm_irqs_off():
> 
>         /*
>          * Only need the full barrier when switching between processes.
>          * Barrier when switching from kernel to userspace is not
>          * required here, given that it is implied by mmdrop(). Barrier
>          * when switching from userspace to kernel is not needed after
>          * store to rq->curr.
>          */
>         if (likely(!(atomic_read(&next->membarrier_state) &
>                      (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
>                       MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
>                 return;
> 
> This is puzzling: if !prev, then one might expect that we are switching
> from kernel to user, not user to kernel, which is inconsistent with the
> comment.  But this is all nonsense, because the one and only caller would
> never have prev == NULL and would, in fact, OOPS if prev == NULL.
> 
> In any event, this code is unnecessary, since the new generic
> membarrier_finish_switch_mm() provides the same barrier without arch help.

I can't find this function membarrier_finish_switch_mm(), neither in 
Linus tree, nor in linux-next tree.

> 
> arch/powerpc/include/asm/membarrier.h remains as an empty header,
> because a later patch in this series will add code to it.
> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>   arch/powerpc/include/asm/membarrier.h | 24 ------------------------
>   arch/powerpc/mm/mmu_context.c         |  1 -
>   2 files changed, 25 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
> index de7f79157918..b90766e95bd1 100644
> --- a/arch/powerpc/include/asm/membarrier.h
> +++ b/arch/powerpc/include/asm/membarrier.h
> @@ -1,28 +1,4 @@
>   #ifndef _ASM_POWERPC_MEMBARRIER_H
>   #define _ASM_POWERPC_MEMBARRIER_H
>   
> -static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> -					     struct mm_struct *next,
> -					     struct task_struct *tsk)
> -{
> -	/*
> -	 * Only need the full barrier when switching between processes.
> -	 * Barrier when switching from kernel to userspace is not
> -	 * required here, given that it is implied by mmdrop(). Barrier
> -	 * when switching from userspace to kernel is not needed after
> -	 * store to rq->curr.
> -	 */
> -	if (IS_ENABLED(CONFIG_SMP) &&
> -	    likely(!(atomic_read(&next->membarrier_state) &
> -		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
> -		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
> -		return;
> -
> -	/*
> -	 * The membarrier system call requires a full memory barrier
> -	 * after storing to rq->curr, before going back to user-space.
> -	 */
> -	smp_mb();
> -}
> -
>   #endif /* _ASM_POWERPC_MEMBARRIER_H */
> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
> index 74246536b832..5f2daa6b0497 100644
> --- a/arch/powerpc/mm/mmu_context.c
> +++ b/arch/powerpc/mm/mmu_context.c
> @@ -84,7 +84,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>   		asm volatile ("dssall");
>   
>   	if (!new_on_cpu)
> -		membarrier_arch_switch_mm(prev, next, tsk);

Are you sure that's what you want ?

It now means you have:

	if (!new_on_cpu)
	switch_mmu_context(prev, next, tsk);


>   
>   	/*
>   	 * The actual HW switching method differs between the various

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery
  2022-01-08 16:44 ` [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery Andy Lutomirski
@ 2022-01-10 13:13   ` Ard Biesheuvel
  0 siblings, 0 replies; 79+ messages in thread
From: Ard Biesheuvel @ 2022-01-10 13:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	x86, Rik van Riel, Dave Hansen, Peter Zijlstra, Nadav Amit,
	Mathieu Desnoyers

On Sat, 8 Jan 2022 at 17:44, Andy Lutomirski <luto@kernel.org> wrote:
>
> This should be considerably more robust.  It's also necessary for optimized
> for_each_possible_lazymm_cpu() on x86 -- without this patch, EFI calls in
> lazy context would remove the lazy mm from mm_cpumask().
>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  arch/x86/platform/efi/efi_64.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index 7515e78ef898..b9a571904363 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -54,7 +54,7 @@
>   * 0xffff_ffff_0000_0000 and limit EFI VA mapping space to 64G.
>   */
>  static u64 efi_va = EFI_VA_START;
> -static struct mm_struct *efi_prev_mm;
> +static temp_mm_state_t efi_temp_mm_state;
>
>  /*
>   * We need our own copy of the higher levels of the page tables
> @@ -461,15 +461,12 @@ void __init efi_dump_pagetable(void)
>   */
>  void efi_enter_mm(void)
>  {
> -       efi_prev_mm = current->active_mm;
> -       current->active_mm = &efi_mm;
> -       switch_mm(efi_prev_mm, &efi_mm, NULL);
> +       efi_temp_mm_state = use_temporary_mm(&efi_mm);
>  }
>
>  void efi_leave_mm(void)
>  {
> -       current->active_mm = efi_prev_mm;
> -       switch_mm(&efi_mm, efi_prev_mm, NULL);
> +       unuse_temporary_mm(efi_temp_mm_state);
>  }
>
>  static DEFINE_SPINLOCK(efi_runtime_lock);
> --
> 2.33.1
>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-10  5:17                     ` Nicholas Piggin
@ 2022-01-10 17:19                       ` Linus Torvalds
  2022-01-11  2:24                         ` Nicholas Piggin
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2022-01-10 17:19 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andy Lutomirski, Andrew Morton, Anton Blanchard,
	Benjamin Herrenschmidt, Catalin Marinas, Dave Hansen, linux-arch,
	Linux-MM, Mathieu Desnoyers, Nadav Amit, Paul Mackerras,
	Peter Zijlstra (Intel),
	Randy Dunlap, Rik van Riel, Will Deacon,
	the arch/x86 maintainers

On Sun, Jan 9, 2022 at 9:18 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> This is the patch that goes on top of the series I posted. It's not
> very clean at the moment it was just a proof of concept.

Yeah, this looks like what x86 basically already effectively does.

x86 obviously doesn't have that TLBIE option, and already has that
"exit lazy mode" logic (although it does so differently, using
switch_mm_irqs_off(), and guards it with the 'info->freed_tables'
check).

But there are so many different possible ways to flush TLB's (the
whole "paravirt vs native") that it would still require some
double-checking that there isn't some case that does it differently..

               Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-10  4:56                   ` Nicholas Piggin
  2022-01-10  5:17                     ` Nicholas Piggin
@ 2022-01-10 20:52                     ` Andy Lutomirski
  2022-01-11  3:10                       ` Nicholas Piggin
  1 sibling, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-10 20:52 UTC (permalink / raw)
  To: Nicholas Piggin, Linus Torvalds
  Cc: Andrew Morton, Anton Blanchard, Benjamin Herrenschmidt,
	Catalin Marinas, Dave Hansen, linux-arch, Linux-MM,
	Mathieu Desnoyers, Nadav Amit, Paul Mackerras,
	Peter Zijlstra (Intel),
	Randy Dunlap, Rik van Riel, Will Deacon,
	the arch/x86 maintainers



On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>> [ Ugh, I actually went back and looked at Nick's patches again, to
>> just verify my memory, and they weren't as pretty as I thought they
>> were ]
>> 
>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>> powerpc, we don't need this overhead at all".
>> 
>> For some reason I thought Nick's patch worked at "last mmput" time and
>> the TLB flush IPIs that happen at that point anyway would then make
>> sure any lazy TLB is cleaned up.
>> 
>> But that's not actually what it does. It ties the
>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>> last mmdrop() instead. Because it really tied the whole logic to the
>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>> mm_users thing I mis-remembered it doing.
>
> It does this because on powerpc with hash MMU, we can't use IPIs for
> TLB shootdowns.

I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns, it sounds like the hazard pointer scheme might actually be pretty good.

>
>> So at least some of my arguments were based on me just mis-remembering
>> what Nick's patch actually did (mainly because I mentally recreated
>> the patch from "Nick did something like this" and what I thought would
>> be the way to do it on x86).
>
> With powerpc with the radix MMU using IPI based shootdowns, we can 
> actually do the switch-away-from-lazy on the final TLB flush and the
> final broadcast shootdown thing becomes a no-op. I didn't post that
> additional patch because it's powerpc-specific and I didn't want to
> post more code so widely.
>
>> So I guess I have to recant my arguments.
>> 
>> I still think my "get rid of lazy at last mmput" model should work,
>> and would be a perfect match for x86, but I can't really point to Nick
>> having done that.
>> 
>> So I was full of BS.
>> 
>> Hmm. I'd love to try to actually create a patch that does that "Nick
>> thing", but on last mmput() (ie when __mmput triggers). Because I
>> think this is interesting. But then I look at my schedule for the
>> upcoming week, and I go "I don't have a leg to stand on in this
>> discussion, and I'm just all hot air".
>
> I agree Andy's approach is very complicated and adds more overhead than
> necessary for powerpc, which is why I don't want to use it. I'm still
> not entirely sure what the big problem would be to convert x86 to use
> it, I admit I haven't kept up with the exact details of its lazy tlb
> mm handling recently though.

The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:

- x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.

- mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.

- There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.

- If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.

- If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).

My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown
  2022-01-08 16:43 ` [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown Andy Lutomirski
@ 2022-01-10 22:06   ` Sami Tolvanen
  0 siblings, 0 replies; 79+ messages in thread
From: Sami Tolvanen @ 2022-01-10 22:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Linux-MM, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	X86 ML, Rik van Riel, Dave Hansen, Peter Zijlstra, Nadav Amit,
	Mathieu Desnoyers, Woody Lin, Valentin Schneider, Mark Rutland

Hi Andy,

On Sat, Jan 8, 2022 at 8:44 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> Starting with commit 63acd42c0d49 ("sched/scs: Reset the shadow stack when
> idle_task_exit"), the idle thread's shadow stack was reset from the idle
> task's context during CPU hot-unplug.  This was fragile: between resetting
> the shadow stack and actually stopping the idle task, the shadow stack
> did not match the actual call stack.
>
> Clean this up by resetting the idle task's SCS in bringup_cpu().
>
> init_idle() still does scs_task_reset() -- see the comments there.  I
> leave this to an SCS maintainer to untangle further.
>
> Cc: Woody Lin <woodylin@google.com>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Sami Tolvanen <samitolvanen@google.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  kernel/cpu.c        | 3 +++
>  kernel/sched/core.c | 9 ++++++++-
>  2 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 192e43a87407..be16816bb87c 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -33,6 +33,7 @@
>  #include <linux/slab.h>
>  #include <linux/percpu-rwsem.h>
>  #include <linux/cpuset.h>
> +#include <linux/scs.h>
>
>  #include <trace/events/power.h>
>  #define CREATE_TRACE_POINTS
> @@ -587,6 +588,8 @@ static int bringup_cpu(unsigned int cpu)
>         struct task_struct *idle = idle_thread_get(cpu);
>         int ret;
>
> +       scs_task_reset(idle);
> +
>         /*
>          * Some architectures have to walk the irq descriptors to
>          * setup the vector space for the cpu which comes online.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 917068b0a145..acd52a7d1349 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8621,7 +8621,15 @@ void __init init_idle(struct task_struct *idle, int cpu)
>         idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
>         kthread_set_per_cpu(idle, cpu);
>
> +       /*
> +        * NB: This is called from sched_init() on the *current* idle thread.
> +        * This seems fragile if not actively incorrect.
> +        *
> +        * Initializing SCS for about-to-be-brought-up CPU idle threads
> +        * is in bringup_cpu(), but that does not cover the boot CPU.
> +        */
>         scs_task_reset(idle);
> +
>         kasan_unpoison_task_stack(idle);
>
>  #ifdef CONFIG_SMP
> @@ -8779,7 +8787,6 @@ void idle_task_exit(void)
>                 finish_arch_post_lock_switch();
>         }
>
> -       scs_task_reset(current);
>         /* finish_cpu(), as ran on the BP, will clean up the active_mm state */
>  }

I believe Mark already fixed this one here:

https://lore.kernel.org/lkml/20211123114047.45918-1-mark.rutland@arm.com/

Sami

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-10 17:19                       ` Linus Torvalds
@ 2022-01-11  2:24                         ` Nicholas Piggin
  0 siblings, 0 replies; 79+ messages in thread
From: Nicholas Piggin @ 2022-01-11  2:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Anton Blanchard, Benjamin Herrenschmidt,
	Catalin Marinas, Dave Hansen, linux-arch, Linux-MM,
	Andy Lutomirski, Mathieu Desnoyers, Nadav Amit, Paul Mackerras,
	Peter Zijlstra (Intel),
	Randy Dunlap, Rik van Riel, Will Deacon,
	the arch/x86 maintainers

Excerpts from Linus Torvalds's message of January 11, 2022 3:19 am:
> On Sun, Jan 9, 2022 at 9:18 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> This is the patch that goes on top of the series I posted. It's not
>> very clean at the moment it was just a proof of concept.
> 
> Yeah, this looks like what x86 basically already effectively does.
> 
> x86 obviously doesn't have that TLBIE option, and already has that
> "exit lazy mode" logic (although it does so differently, using
> switch_mm_irqs_off(), and guards it with the 'info->freed_tables'
> check).
> 
> But there are so many different possible ways to flush TLB's (the
> whole "paravirt vs native") that it would still require some
> double-checking that there isn't some case that does it differently..

Oh yeah x86 needs a little porting to be able to use this for sure,
but there's no reason it couldn't do it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-10 20:52                     ` Andy Lutomirski
@ 2022-01-11  3:10                       ` Nicholas Piggin
  2022-01-11 15:39                         ` Andy Lutomirski
  0 siblings, 1 reply; 79+ messages in thread
From: Nicholas Piggin @ 2022-01-11  3:10 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Andrew Morton, Anton Blanchard, Benjamin Herrenschmidt,
	Catalin Marinas, Dave Hansen, linux-arch, Linux-MM,
	Mathieu Desnoyers, Nadav Amit, Paul Mackerras,
	Peter Zijlstra (Intel),
	Randy Dunlap, Rik van Riel, Will Deacon,
	the arch/x86 maintainers

Excerpts from Andy Lutomirski's message of January 11, 2022 6:52 am:
> 
> 
> On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
>> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>>> [ Ugh, I actually went back and looked at Nick's patches again, to
>>> just verify my memory, and they weren't as pretty as I thought they
>>> were ]
>>> 
>>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>>
>>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>>> powerpc, we don't need this overhead at all".
>>> 
>>> For some reason I thought Nick's patch worked at "last mmput" time and
>>> the TLB flush IPIs that happen at that point anyway would then make
>>> sure any lazy TLB is cleaned up.
>>> 
>>> But that's not actually what it does. It ties the
>>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>>> last mmdrop() instead. Because it really tied the whole logic to the
>>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>>> mm_users thing I mis-remembered it doing.
>>
>> It does this because on powerpc with hash MMU, we can't use IPIs for
>> TLB shootdowns.
> 
> I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns,

The paravirtualised hash MMU environment doesn't because it has a single 
level translation and the guest uses hypercalls to insert and remove 
translations and the hypervisor flushes TLBs. The HV could flush TLBs
with IPIs but obviously the guest can't use those to execute the lazy
switch. In radix guests (and all bare metal) the OS flushes its own
TLBs.

We are moving over to radix, but powerpc also has a hardware broadcast 
flush instruction which can be a bit faster than IPIs and is usable by 
bare metal and radix guests so those can also avoid the IPIs if they 
want. Part of the powerpc patch I just sent to combine the lazy switch 
with the final TLB flush is to force it to always take the IPI path and 
not use TLBIE instruction on the final exit.

So hazard points could avoid some IPIs there too.

> it sounds like the hazard pointer scheme might actually be pretty good.

Some IPIs in the exit path just aren't that big a concern. I measured,
got numbers, tried to irritate it, just wasn't really a problem. Some
archs use IPIs for all threaded TLB shootdowns and exits not that
frequent. Very fast short lived processes that do a lot of exits just
don't tend to spread across a lot of CPUs leaving lazy tlb mms to shoot,
and long lived and multi threaded ones that do don't exit at high rates.

So from what I can see it's premature optimization. Actually maybe not
even optimization because IIRC it adds complexity and even a barrier on
powerpc in the context switch path which is a lot more critical than
exit() for us we don't want slowdowns there.

It's a pretty high complexity boutique kind of synchronization. Which
don't get me wrong is the kind of thing I like, it is clever and may be
perfectly bug free but it needs to prove itself over the simple dumb
shoot lazies approach.

>>> So at least some of my arguments were based on me just mis-remembering
>>> what Nick's patch actually did (mainly because I mentally recreated
>>> the patch from "Nick did something like this" and what I thought would
>>> be the way to do it on x86).
>>
>> With powerpc with the radix MMU using IPI based shootdowns, we can 
>> actually do the switch-away-from-lazy on the final TLB flush and the
>> final broadcast shootdown thing becomes a no-op. I didn't post that
>> additional patch because it's powerpc-specific and I didn't want to
>> post more code so widely.
>>
>>> So I guess I have to recant my arguments.
>>> 
>>> I still think my "get rid of lazy at last mmput" model should work,
>>> and would be a perfect match for x86, but I can't really point to Nick
>>> having done that.
>>> 
>>> So I was full of BS.
>>> 
>>> Hmm. I'd love to try to actually create a patch that does that "Nick
>>> thing", but on last mmput() (ie when __mmput triggers). Because I
>>> think this is interesting. But then I look at my schedule for the
>>> upcoming week, and I go "I don't have a leg to stand on in this
>>> discussion, and I'm just all hot air".
>>
>> I agree Andy's approach is very complicated and adds more overhead than
>> necessary for powerpc, which is why I don't want to use it. I'm still
>> not entirely sure what the big problem would be to convert x86 to use
>> it, I admit I haven't kept up with the exact details of its lazy tlb
>> mm handling recently though.
> 
> The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:
> 
> - x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.

Definitely should have been done at the time x86 deviated, but better 
late than never.

> 
> - mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.

membarrier relied on a call that mmdrop was providing. Adding a smp_mb()
instead if mmdrop is a no-op is fine. Patches changing membarrier's 
ordering requirements can be concurrent and are not fundmentally tied
to lazy tlb mm switching, it just reuses an existing ordering point.

> - There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.

Seems like a decent cleanup. Again lazy tlb specific, just general switch
code should be factored and better contained in kernel/sched/ which is
fine, but concurrent to lazy tlb improvements.

> - If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.

My shoot lazies patch has no such issues with that AFAIKS. What exact 
issue is it and where did you fix it?

> 
> - If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).
> 
> My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.

I disagree, the lazy mm code is very clean and simple. And I can't see 
how you would propose to remove active_mm from core code I'm skeptical
but would be very interested to see, but that's nothing to do with my
shoot lazies patch and can also be concurrent except for mechanical
merge issues.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-09 20:48               ` Linus Torvalds
  2022-01-09 21:51                 ` Linus Torvalds
@ 2022-01-11 10:39                 ` Will Deacon
  2022-01-11 15:22                   ` Andy Lutomirski
  1 sibling, 1 reply; 79+ messages in thread
From: Will Deacon @ 2022-01-11 10:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Catalin Marinas, Andrew Morton, Linux-MM,
	Nicholas Piggin, Anton Blanchard, Benjamin Herrenschmidt,
	Paul Mackerras, Randy Dunlap, linux-arch,
	the arch/x86 maintainers, Rik van Riel, Dave Hansen,
	Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers

Hi Andy, Linus,

On Sun, Jan 09, 2022 at 12:48:42PM -0800, Linus Torvalds wrote:
> On Sun, Jan 9, 2022 at 12:20 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > Are you *sure*? The ASID management code on x86 is (as mentioned
> > before) completely unaware of whether an ASID is actually in use
> > anywhere.
> 
> Right.
> 
> But the ASID situation on x86 is very very different, exactly because
> x86 doesn't have cross-CPU TLB invalidates.
> 
> Put another way: x86 TLB hardware is fundamentally per-cpu. As such,
> any ASID management is also per-cpu.
> 
> That's fundamentally not true on arm64.  And that's not some "arm64
> implementation detail". That's fundamental to doing cross-CPU TLB
> invalidates in hardware.
> 
> If your TLB invalidates act across CPU's, then the state they act on
> are also obviously across CPU's.
> 
> So the ASID situation is fundamentally different depending on the
> hardware usage. On x86, TLB's are per-core, and on arm64 they are not,
> and that's reflected in our code too.
> 
> As a result, on x86, each mm has a per-cpu ASID, and there's a small
> array of per-cpu "mm->asid" mappings.
> 
> On arm, each mm has an asid, and it's allocated from a global asid
> space - so there is no need for that "mm->asid" mapping, because the
> asid is there in the mm, and it's shared across cpus.
> 
> That said, I still don't actually know the arm64 ASID management code.

That appears to be a common theme in this thread, so hopefully I can shed
some light on the arm64 side of things:

The CPU supports either 8-bit or 16-bit ASIDs and we require that we don't
have more CPUs than we can represent in the ASID space (well, we WARN in
this case but it's likely to go badly wrong). We reserve ASID 0 for things
like the idmap, so as far as the allocator is concerned ASID 0 is "invalid"
and we rely on this.

As Linus says above, the ASID is per-'mm' and we require that all threads
of an 'mm' use the same ASID at the same time, otherwise the hardware TLB
broadcasting isn't going to work properly because the invalidations are
typically tagged by ASID.

As Andy points out later, this means that we have to reuse ASIDs for
different 'mm's once we have enough of them. We do this using a 64-bit
context ID in mm_context_t, where the lower bits are the ASID for the 'mm'
and the upper bits are a generation count. The ASID allocator keeps an
internal generation count which is incremented whenever we fail to allocate
an ASID and are forced to invalidate them all and start re-allocating. We
assume that the generation count doesn't overflow.

When switching to an 'mm', we check if the generation count held in the
'mm' is behind the allocator's current generation count. If it is, then
we know that the 'mm' needs to be allocated a new ASID. Allocation is
performed with a spinlock held and basically involves a setting a new bit
in the bitmap and updating the 'mm' with the new ASID and current
generation. We don't reclaim ASIDs greedily on 'mm' teardown -- this was
pretty slow when I looked at it in the past.

So far so good, but it gets more complicated when we look at the details of
the overflow handling. Overflow is always detected on the allocation path
with the spinlock held but other CPUs could happily be running other code
(inc. user code) at this point. Therefore, we can't simply invalidate the
TLBs, clear the bitmap and start re-allocating ASIDs because we could end up
with an ASID shared between two running 'mm's, leading to both invalidation
interference but also the potential to hit stale TLB entries allocated after
the invalidation on rollover. We handle this with a couple of per-cpu
variables, 'active_asids' and 'reserved_asids'.

'active_asids' is set to the current ASID in switch_mm() just before
writing the actual TTBR register. On a rollover, the CPU holding the lock
goes through each CPU's 'active_asids' entry, atomic xchg()s it to 0 and
writes the result into the corresponding 'reserved_asids' entry. These
'reserved_asids' are then immediately marked as allocated and a flag is
set for each CPU to indicate that its TLBs are dirty. This allows the
CPU handling the rollover to continue with its allocation without stopping
the world and without broadcasting TLB invalidation; other CPUs will
hit a generation mismatch on their next switch_mm(), notice that they are
running a reserved ASID from an older generation, upgrade the generation
(i.e. keep the same ASID) and then invalidate their local TLB.

So we do have some tracking of which ASIDs are where, but we can't generally
say "is this ASID dirty in the TLBs of this CPU". That also gets more
complicated on some systems where a TLB can be shared between some of the
CPUs (I haven't covered that case above, since I think that this is enough
detail already.)

FWIW, we have a TLA+ model of some of this, which may (or may not) be easier
to follow than my text:

https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/asidalloc.tla

although the syntax is pretty hard going :(

> The thing about TLB flushes is that it's ok to do them spuriously (as
> long as you don't do _too_ many of them and tank performance), so two
> different mm's can have the same hw ASID on two different cores and
> that just makes cross-CPU TLB invalidates too aggressive. You can't
> share an ASID on the _same_ core without flushing in between context
> switches, because then the TLB on that core might be re-used for a
> different mm. So the flushing rules aren't necessarily 100% 1:1 with
> the "in use" rules, and who knows if the arm64 ASID management
> actually ends up just matching what that whole "this lazy TLB is still
> in use on another CPU".

The shared TLBs (Arm calls this "Common-not-private") make this problematic,
as the TLB is no longer necessarily per-core.

> So I don't really know the arm64 situation. And i's possible that lazy
> TLB isn't even worth it on arm64 in the first place.

ASID allocation aside, I think there are a few useful things to point out
for arm64:

	- We only have "local" or "all" TLB invalidation; nothing targetted
	  (and for KVM guests this is always "all").

	- Most mms end up running on more than one CPU (at least, when I
	  last looked at this a fork+exec would end up with the mm having
	  been installed on two CPUs)

	- We don't track mm_cpumask as it showed up as a bottleneck in the
	  past and, because of the earlier points, it wasn't very useful
	  anyway

	- mmgrab() should be fast for us (it's a posted atomic add),
	  although mmdrop() will be slower as it has to return data to
	  check against the count going to zero.

So it doesn't feel like an obvious win to me for us to scan these new hazard
pointers on arm64. At least, I would love to see some numbers if we're going
to make changes here.

Will

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-11 10:39                 ` Will Deacon
@ 2022-01-11 15:22                   ` Andy Lutomirski
  0 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-11 15:22 UTC (permalink / raw)
  To: Will Deacon, Linus Torvalds
  Cc: Catalin Marinas, Andrew Morton, Linux-MM, Nicholas Piggin,
	Anton Blanchard, Benjamin Herrenschmidt, Paul Mackerras,
	Randy Dunlap, linux-arch, the arch/x86 maintainers, Rik van Riel,
	Dave Hansen, Peter Zijlstra (Intel),
	Nadav Amit, Mathieu Desnoyers



On Tue, Jan 11, 2022, at 2:39 AM, Will Deacon wrote:
> Hi Andy, Linus,
>
> On Sun, Jan 09, 2022 at 12:48:42PM -0800, Linus Torvalds wrote:
>> On Sun, Jan 9, 2022 at 12:20 PM Andy Lutomirski <luto@kernel.org> wrote:

>> That said, I still don't actually know the arm64 ASID management code.
>
> That appears to be a common theme in this thread, so hopefully I can shed
> some light on the arm64 side of things:
>

Thanks!

>
> FWIW, we have a TLA+ model of some of this, which may (or may not) be easier
> to follow than my text:

Yikes. Your fine hardware engineers should consider 64-bit ASIDs :)

I suppose x86-on-AMD could copy this, but eww.  OTOH x86 can easily have more CPUs than ASIDs, so maybe not.

>
> https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/asidalloc.tla
>
> although the syntax is pretty hard going :(
>
>> The thing about TLB flushes is that it's ok to do them spuriously (as
>> long as you don't do _too_ many of them and tank performance), so two
>> different mm's can have the same hw ASID on two different cores and
>> that just makes cross-CPU TLB invalidates too aggressive. You can't
>> share an ASID on the _same_ core without flushing in between context
>> switches, because then the TLB on that core might be re-used for a
>> different mm. So the flushing rules aren't necessarily 100% 1:1 with
>> the "in use" rules, and who knows if the arm64 ASID management
>> actually ends up just matching what that whole "this lazy TLB is still
>> in use on another CPU".
>
> The shared TLBs (Arm calls this "Common-not-private") make this problematic,
> as the TLB is no longer necessarily per-core.
>
>> So I don't really know the arm64 situation. And i's possible that lazy
>> TLB isn't even worth it on arm64 in the first place.
>
> ASID allocation aside, I think there are a few useful things to point out
> for arm64:
>
> 	- We only have "local" or "all" TLB invalidation; nothing targetted
> 	  (and for KVM guests this is always "all").
>
> 	- Most mms end up running on more than one CPU (at least, when I
> 	  last looked at this a fork+exec would end up with the mm having
> 	  been installed on two CPUs)
>
> 	- We don't track mm_cpumask as it showed up as a bottleneck in the
> 	  past and, because of the earlier points, it wasn't very useful
> 	  anyway
>
> 	- mmgrab() should be fast for us (it's a posted atomic add),
> 	  although mmdrop() will be slower as it has to return data to
> 	  check against the count going to zero.
>
> So it doesn't feel like an obvious win to me for us to scan these new hazard
> pointers on arm64. At least, I would love to see some numbers if we're going
> to make changes here.

I will table the hazard pointer scheme, then, and adjust the series to do shootdowns.

I would guess that once arm64 hits a few hundred CPUs, you'll start finding workloads where mmdrop() at least starts to hurt.  But we can cross that bridge when we get to it.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-11  3:10                       ` Nicholas Piggin
@ 2022-01-11 15:39                         ` Andy Lutomirski
  2022-01-11 22:48                           ` Nicholas Piggin
  0 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2022-01-11 15:39 UTC (permalink / raw)
  To: Nicholas Piggin, Linus Torvalds
  Cc: Andrew Morton, Anton Blanchard, Benjamin Herrenschmidt,
	Catalin Marinas, Dave Hansen, linux-arch, Linux-MM,
	Mathieu Desnoyers, Nadav Amit, Paul Mackerras,
	Peter Zijlstra (Intel),
	Randy Dunlap, Rik van Riel, Will Deacon,
	the arch/x86 maintainers



On Mon, Jan 10, 2022, at 7:10 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of January 11, 2022 6:52 am:
>> 
>> 
>> On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
>>> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>>>> [ Ugh, I actually went back and looked at Nick's patches again, to
>>>> just verify my memory, and they weren't as pretty as I thought they
>>>> were ]
>>>> 
>>>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>>>> <torvalds@linux-foundation.org> wrote:
>>>>>
>>>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>>>> powerpc, we don't need this overhead at all".
>>>> 
>>>> For some reason I thought Nick's patch worked at "last mmput" time and
>>>> the TLB flush IPIs that happen at that point anyway would then make
>>>> sure any lazy TLB is cleaned up.
>>>> 
>>>> But that's not actually what it does. It ties the
>>>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>>>> last mmdrop() instead. Because it really tied the whole logic to the
>>>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>>>> mm_users thing I mis-remembered it doing.
>>>
>>> It does this because on powerpc with hash MMU, we can't use IPIs for
>>> TLB shootdowns.
>> 
>> I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns,
>
> The paravirtualised hash MMU environment doesn't because it has a single 
> level translation and the guest uses hypercalls to insert and remove 
> translations and the hypervisor flushes TLBs. The HV could flush TLBs
> with IPIs but obviously the guest can't use those to execute the lazy
> switch. In radix guests (and all bare metal) the OS flushes its own
> TLBs.
>
> We are moving over to radix, but powerpc also has a hardware broadcast 
> flush instruction which can be a bit faster than IPIs and is usable by 
> bare metal and radix guests so those can also avoid the IPIs if they 
> want. Part of the powerpc patch I just sent to combine the lazy switch 
> with the final TLB flush is to force it to always take the IPI path and 
> not use TLBIE instruction on the final exit.
>
> So hazard points could avoid some IPIs there too.
>
>> it sounds like the hazard pointer scheme might actually be pretty good.
>
> Some IPIs in the exit path just aren't that big a concern. I measured,
> got numbers, tried to irritate it, just wasn't really a problem. Some
> archs use IPIs for all threaded TLB shootdowns and exits not that
> frequent. Very fast short lived processes that do a lot of exits just
> don't tend to spread across a lot of CPUs leaving lazy tlb mms to shoot,
> and long lived and multi threaded ones that do don't exit at high rates.
>
> So from what I can see it's premature optimization. Actually maybe not
> even optimization because IIRC it adds complexity and even a barrier on
> powerpc in the context switch path which is a lot more critical than
> exit() for us we don't want slowdowns there.
>
> It's a pretty high complexity boutique kind of synchronization. Which
> don't get me wrong is the kind of thing I like, it is clever and may be
> perfectly bug free but it needs to prove itself over the simple dumb
> shoot lazies approach.
>
>>>> So at least some of my arguments were based on me just mis-remembering
>>>> what Nick's patch actually did (mainly because I mentally recreated
>>>> the patch from "Nick did something like this" and what I thought would
>>>> be the way to do it on x86).
>>>
>>> With powerpc with the radix MMU using IPI based shootdowns, we can 
>>> actually do the switch-away-from-lazy on the final TLB flush and the
>>> final broadcast shootdown thing becomes a no-op. I didn't post that
>>> additional patch because it's powerpc-specific and I didn't want to
>>> post more code so widely.
>>>
>>>> So I guess I have to recant my arguments.
>>>> 
>>>> I still think my "get rid of lazy at last mmput" model should work,
>>>> and would be a perfect match for x86, but I can't really point to Nick
>>>> having done that.
>>>> 
>>>> So I was full of BS.
>>>> 
>>>> Hmm. I'd love to try to actually create a patch that does that "Nick
>>>> thing", but on last mmput() (ie when __mmput triggers). Because I
>>>> think this is interesting. But then I look at my schedule for the
>>>> upcoming week, and I go "I don't have a leg to stand on in this
>>>> discussion, and I'm just all hot air".
>>>
>>> I agree Andy's approach is very complicated and adds more overhead than
>>> necessary for powerpc, which is why I don't want to use it. I'm still
>>> not entirely sure what the big problem would be to convert x86 to use
>>> it, I admit I haven't kept up with the exact details of its lazy tlb
>>> mm handling recently though.
>> 
>> The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:
>> 
>> - x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.
>
> Definitely should have been done at the time x86 deviated, but better 
> late than never.
>

I suspect that this code may predate there being anything for x86 to deviate from.

>> 
>> - mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.
>
> membarrier relied on a call that mmdrop was providing. Adding a smp_mb()
> instead if mmdrop is a no-op is fine. Patches changing membarrier's 
> ordering requirements can be concurrent and are not fundmentally tied
> to lazy tlb mm switching, it just reuses an existing ordering point.

smp_mb() is rather expensive on x86 at least, and I think on powerpc to.  Let's not.

Also, IMO my cleanups here generally make sense and make the code better, so I think we should just go with them.

>
>> - There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.
>
> Seems like a decent cleanup. Again lazy tlb specific, just general switch
> code should be factored and better contained in kernel/sched/ which is
> fine, but concurrent to lazy tlb improvements.

I personally rather dislike the model of:

...
mmgrab_lazy();
...
mmdrop_lazy();
...
mmgrab_lazy();
...
mmdrop_lazy();
...

where the different calls have incompatible logic at the call sites and a magic config option NOPs them all away and adds barriers.  I'm personally a big fan of cleaning up code before making it even more complicated.

>
>> - If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.
>
> My shoot lazies patch has no such issues with that AFAIKS. What exact 
> issue is it and where did you fix it?

Without my unlazy_mm_irqs_off() (or something similar), x86's very sensible (and very old!) code to unlazy a lazy CPU when flushing leaves active_mm pointing to the old lazy mm.  That's not a problem at all on current kernels, but in a shootdown-lazy world, those active_mm pointers will stick around.  Even with that fixed, without some care, an NMI during the shootdown CPU could dereference ->active_mm at a time when it's not being actively kept alive.

Fixed by unlazy_mm_irqs_off(), the patches that use it, and the patches that make x86 stop inappropriately using ->active_mm.

>
>> 
>> - If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).
>> 
>> My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.
>
> I disagree, the lazy mm code is very clean and simple. And I can't see 
> how you would propose to remove active_mm from core code I'm skeptical
> but would be very interested to see, but that's nothing to do with my
> shoot lazies patch and can also be concurrent except for mechanical
> merge issues.

I think we may just have to agree to disagree.  The more I looked at the lazy code, the more problems I found.  So I fixed them.  That work is done now (as far as I'm aware) except for rebasing and review.

--Andy

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-11 15:39                         ` Andy Lutomirski
@ 2022-01-11 22:48                           ` Nicholas Piggin
  2022-01-12  0:42                             ` Nicholas Piggin
  0 siblings, 1 reply; 79+ messages in thread
From: Nicholas Piggin @ 2022-01-11 22:48 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Andrew Morton, Anton Blanchard, Benjamin Herrenschmidt,
	Catalin Marinas, Dave Hansen, linux-arch, Linux-MM,
	Mathieu Desnoyers, Nadav Amit, Paul Mackerras,
	Peter Zijlstra (Intel),
	Randy Dunlap, Rik van Riel, Will Deacon,
	the arch/x86 maintainers

Excerpts from Andy Lutomirski's message of January 12, 2022 1:39 am:
> 
> 
> On Mon, Jan 10, 2022, at 7:10 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of January 11, 2022 6:52 am:
>>> 
>>> 
>>> On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
>>>> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>>>>> [ Ugh, I actually went back and looked at Nick's patches again, to
>>>>> just verify my memory, and they weren't as pretty as I thought they
>>>>> were ]
>>>>> 
>>>>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>>>>> <torvalds@linux-foundation.org> wrote:
>>>>>>
>>>>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>>>>> powerpc, we don't need this overhead at all".
>>>>> 
>>>>> For some reason I thought Nick's patch worked at "last mmput" time and
>>>>> the TLB flush IPIs that happen at that point anyway would then make
>>>>> sure any lazy TLB is cleaned up.
>>>>> 
>>>>> But that's not actually what it does. It ties the
>>>>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>>>>> last mmdrop() instead. Because it really tied the whole logic to the
>>>>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>>>>> mm_users thing I mis-remembered it doing.
>>>>
>>>> It does this because on powerpc with hash MMU, we can't use IPIs for
>>>> TLB shootdowns.
>>> 
>>> I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns,
>>
>> The paravirtualised hash MMU environment doesn't because it has a single 
>> level translation and the guest uses hypercalls to insert and remove 
>> translations and the hypervisor flushes TLBs. The HV could flush TLBs
>> with IPIs but obviously the guest can't use those to execute the lazy
>> switch. In radix guests (and all bare metal) the OS flushes its own
>> TLBs.
>>
>> We are moving over to radix, but powerpc also has a hardware broadcast 
>> flush instruction which can be a bit faster than IPIs and is usable by 
>> bare metal and radix guests so those can also avoid the IPIs if they 
>> want. Part of the powerpc patch I just sent to combine the lazy switch 
>> with the final TLB flush is to force it to always take the IPI path and 
>> not use TLBIE instruction on the final exit.
>>
>> So hazard points could avoid some IPIs there too.
>>
>>> it sounds like the hazard pointer scheme might actually be pretty good.
>>
>> Some IPIs in the exit path just aren't that big a concern. I measured,
>> got numbers, tried to irritate it, just wasn't really a problem. Some
>> archs use IPIs for all threaded TLB shootdowns and exits not that
>> frequent. Very fast short lived processes that do a lot of exits just
>> don't tend to spread across a lot of CPUs leaving lazy tlb mms to shoot,
>> and long lived and multi threaded ones that do don't exit at high rates.
>>
>> So from what I can see it's premature optimization. Actually maybe not
>> even optimization because IIRC it adds complexity and even a barrier on
>> powerpc in the context switch path which is a lot more critical than
>> exit() for us we don't want slowdowns there.
>>
>> It's a pretty high complexity boutique kind of synchronization. Which
>> don't get me wrong is the kind of thing I like, it is clever and may be
>> perfectly bug free but it needs to prove itself over the simple dumb
>> shoot lazies approach.
>>
>>>>> So at least some of my arguments were based on me just mis-remembering
>>>>> what Nick's patch actually did (mainly because I mentally recreated
>>>>> the patch from "Nick did something like this" and what I thought would
>>>>> be the way to do it on x86).
>>>>
>>>> With powerpc with the radix MMU using IPI based shootdowns, we can 
>>>> actually do the switch-away-from-lazy on the final TLB flush and the
>>>> final broadcast shootdown thing becomes a no-op. I didn't post that
>>>> additional patch because it's powerpc-specific and I didn't want to
>>>> post more code so widely.
>>>>
>>>>> So I guess I have to recant my arguments.
>>>>> 
>>>>> I still think my "get rid of lazy at last mmput" model should work,
>>>>> and would be a perfect match for x86, but I can't really point to Nick
>>>>> having done that.
>>>>> 
>>>>> So I was full of BS.
>>>>> 
>>>>> Hmm. I'd love to try to actually create a patch that does that "Nick
>>>>> thing", but on last mmput() (ie when __mmput triggers). Because I
>>>>> think this is interesting. But then I look at my schedule for the
>>>>> upcoming week, and I go "I don't have a leg to stand on in this
>>>>> discussion, and I'm just all hot air".
>>>>
>>>> I agree Andy's approach is very complicated and adds more overhead than
>>>> necessary for powerpc, which is why I don't want to use it. I'm still
>>>> not entirely sure what the big problem would be to convert x86 to use
>>>> it, I admit I haven't kept up with the exact details of its lazy tlb
>>>> mm handling recently though.
>>> 
>>> The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:
>>> 
>>> - x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.
>>
>> Definitely should have been done at the time x86 deviated, but better 
>> late than never.
>>
> 
> I suspect that this code may predate there being anything for x86 to deviate from.

Interesting, active_mm came in 2.3.11 and x86's cpu_tlbstate[].active_mm 
2.3.43. Longer than I thought.

>>> 
>>> - mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.
>>
>> membarrier relied on a call that mmdrop was providing. Adding a smp_mb()
>> instead if mmdrop is a no-op is fine. Patches changing membarrier's 
>> ordering requirements can be concurrent and are not fundmentally tied
>> to lazy tlb mm switching, it just reuses an existing ordering point.
> 
> smp_mb() is rather expensive on x86 at least, and I think on powerpc to.  Let's not.

You misunderstand me. If mmdrop is not there to provide the required 
ordering that membarrier needs, then the membarrier code does its own
smp_mb(). It's not _more_ expensive than before because the full barrier
from the mmdrop is gone.

> Also, IMO my cleanups here generally make sense and make the code better, so I think we should just go with them.

Sure if you can make the membarrier code better and avoid this ordering
requirement that's nice. This is orthogonal to the lazy tlb code though
and they can go in parallel (again aside from mechanical merge issues).

I'm not sure what you don't understand about this, that ordering is a
membarrier requirement, and it happens to be using an existing barrier
that the lazy mm code had anyway, which is perfectly fine and something
that is done all over the kernel in performance critical code.

>>> - There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.
>>
>> Seems like a decent cleanup. Again lazy tlb specific, just general switch
>> code should be factored and better contained in kernel/sched/ which is
>> fine, but concurrent to lazy tlb improvements.
> 
> I personally rather dislike the model of:
> 
> ...
> mmgrab_lazy();
> ...
> mmdrop_lazy();
> ...
> mmgrab_lazy();
> ...
> mmdrop_lazy();
> ...
> 
> where the different calls have incompatible logic at the call sites and a magic config option NOPs them all away and adds barriers.  I'm personally a big fan of cleaning up code before making it even more complicated.

Not sure what model that is though. Call sites don't have to know 
anything at all about the options or barriers. The rule is simply
that the lazy tlb mm reference is manipulated with the _lazy variants.

It was just mostly duplicated code. Again, can go in parallel and no
dependencies other than mechanical merge.

> 
>>
>>> - If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.
>>
>> My shoot lazies patch has no such issues with that AFAIKS. What exact 
>> issue is it and where did you fix it?
> 
> Without my unlazy_mm_irqs_off() (or something similar), x86's very sensible (and very old!) code to unlazy a lazy CPU when flushing leaves active_mm pointing to the old lazy mm.  That's not a problem at all on current kernels, but in a shootdown-lazy world, those active_mm pointers will stick around.  Even with that fixed, without some care, an NMI during the shootdown CPU could dereference ->active_mm at a time when it's not being actively kept alive.
> 
> Fixed by unlazy_mm_irqs_off(), the patches that use it, and the patches that make x86 stop inappropriately using ->active_mm.

Oh you're talking about some x86 specific issue where you would have
problems if you didn't do the port properly? Don't tell me this is why
you've been nacking my patches for 15 months.

>>> - If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).
>>> 
>>> My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.
>>
>> I disagree, the lazy mm code is very clean and simple. And I can't see 
>> how you would propose to remove active_mm from core code I'm skeptical
>> but would be very interested to see, but that's nothing to do with my
>> shoot lazies patch and can also be concurrent except for mechanical
>> merge issues.
> 
> I think we may just have to agree to disagree.  The more I looked at the lazy code, the more problems I found.  So I fixed them.  That work is done now (as far as I'm aware) except for rebasing and review.

I don't see any problems with the lazy tlb mm code outside arch/x86 or 
anything your series fixed with it aside from a bit of code duplication.

Anyway I will try to take a look over and review bits I can before the
5.18 merge window. For 5.17 my series has been ready to go for a year or
so and very small so let's merge that first since Linus wants to try go
with that approach rather than the refcount one.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
  2022-01-11 22:48                           ` Nicholas Piggin
@ 2022-01-12  0:42                             ` Nicholas Piggin
  0 siblings, 0 replies; 79+ messages in thread
From: Nicholas Piggin @ 2022-01-12  0:42 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Andrew Morton, Anton Blanchard, Benjamin Herrenschmidt,
	Catalin Marinas, Dave Hansen, linux-arch, Linux-MM,
	Mathieu Desnoyers, Nadav Amit, Paul Mackerras,
	Peter Zijlstra (Intel),
	Randy Dunlap, Rik van Riel, Will Deacon,
	the arch/x86 maintainers

Excerpts from Nicholas Piggin's message of January 12, 2022 8:48 am:
> Anyway I will try to take a look over and review bits I can before the
> 5.18 merge window. For 5.17 my series has been ready to go for a year or
> so and very small so let's merge that first since Linus wants to try go
> with that approach rather than the refcount one.

5.19 and 5.18 respectively.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 01/23] membarrier: Document why membarrier() works
  2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
@ 2022-01-12 15:30   ` Mathieu Desnoyers
  0 siblings, 0 replies; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 15:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, linux-mm, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	x86, riel, Dave Hansen, Peter Zijlstra, Nadav Amit

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> We had a nice comment at the top of membarrier.c explaining why membarrier
> worked in a handful of scenarios, but that consisted more of a list of
> things not to forget than an actual description of the algorithm and why it
> should be expected to work.
> 
> Add a comment explaining my understanding of the algorithm.  This exposes a
> couple of implementation issues that I will hopefully fix up in subsequent
> patches.

Given that no explanation about the specific implementation issues is provided
here, I would be tempted to remove the last sentence above, and keep that for
the commit messages of the subsequent patches.

The explanation you add here is clear and very much fits my mental model, thanks!

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> kernel/sched/membarrier.c | 60 +++++++++++++++++++++++++++++++++++++--
> 1 file changed, 58 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index b5add64d9698..30e964b9689d 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -7,8 +7,64 @@
> #include "sched.h"
> 
> /*
> - * For documentation purposes, here are some membarrier ordering
> - * scenarios to keep in mind:
> + * The basic principle behind the regular memory barrier mode of
> + * membarrier() is as follows.  membarrier() is called in one thread.  Tt
> + * iterates over all CPUs, and, for each CPU, it either sends an IPI to
> + * that CPU or it does not. If it sends an IPI, then we have the
> + * following sequence of events:
> + *
> + * 1. membarrier() does smp_mb().
> + * 2. membarrier() does a store (the IPI request payload) that is observed by
> + *    the target CPU.
> + * 3. The target CPU does smp_mb().
> + * 4. The target CPU does a store (the completion indication) that is observed
> + *    by membarrier()'s wait-for-IPIs-to-finish request.
> + * 5. membarrier() does smp_mb().
> + *
> + * So all pre-membarrier() local accesses are visible after the IPI on the
> + * target CPU and all pre-IPI remote accesses are visible after
> + * membarrier(). IOW membarrier() has synchronized both ways with the target
> + * CPU.
> + *
> + * (This has the caveat that membarrier() does not interrupt the CPU that it's
> + * running on at the time it sends the IPIs. However, if that is the CPU on
> + * which membarrier() starts and/or finishes, membarrier() does smp_mb() and,
> + * if not, then the scheduler's migration of membarrier() is a full barrier.)
> + *
> + * membarrier() skips sending an IPI only if membarrier() sees
> + * cpu_rq(cpu)->curr->mm != target mm.  The sequence of events is:
> + *
> + *           membarrier()            |          target CPU
> + * ---------------------------------------------------------------------
> + *                                   | 1. smp_mb()
> + *                                   | 2. set rq->curr->mm = other_mm
> + *                                   |    (by writing to ->curr or to ->mm)
> + * 3. smp_mb()                       |
> + * 4. read rq->curr->mm == other_mm  |
> + * 5. smp_mb()                       |
> + *                                   | 6. rq->curr->mm = target_mm
> + *                                   |    (by writing to ->curr or to ->mm)
> + *                                   | 7. smp_mb()
> + *                                   |
> + *
> + * All memory accesses on the target CPU prior to scheduling are visible
> + * to membarrier()'s caller after membarrier() returns due to steps 1, 2, 4
> + * and 5.
> + *
> + * All memory accesses by membarrier()'s caller prior to membarrier() are
> + * visible to the target CPU after scheduling due to steps 3, 4, 6, and 7.
> + *
> + * Note that, tasks can change their ->mm, e.g. via kthread_use_mm().  So
> + * tasks that switch their ->mm must follow the same rules as the scheduler
> + * changing rq->curr, and the membarrier() code needs to do both dereferences
> + * carefully.
> + *
> + * GLOBAL_EXPEDITED support works the same way except that all references
> + * to rq->curr->mm are replaced with references to rq->membarrier_state.
> + *
> + *
> + * Specific examples of how this produces the documented properties of
> + * membarrier():
>  *
>  * A) Userspace thread execution after IPI vs membarrier's memory
>  *    barrier before sending the IPI
> --
> 2.33.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code
  2022-01-08 16:43 ` [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
@ 2022-01-12 15:40   ` Mathieu Desnoyers
  0 siblings, 0 replies; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 15:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, linux-mm, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	x86, riel, Dave Hansen, Peter Zijlstra, Nadav Amit

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> The core scheduler isn't a great place for
> membarrier_mm_sync_core_before_usermode() -- the core scheduler
> doesn't actually know whether we are lazy.  With the old code, if a
> CPU is running a membarrier-registered task, goes idle, gets unlazied
> via a TLB shootdown IPI, and switches back to the
> membarrier-registered task, it will do an unnecessary core sync.
> 
> Conveniently, x86 is the only architecture that does anything in this
> sync_core_before_usermode(), so membarrier_mm_sync_core_before_usermode()
> is a no-op on all other architectures and we can just move the code.
> 
> (I am not claiming that the SYNC_CORE code was correct before or after this
> change on any non-x86 architecture.  I merely claim that this change
> improves readability, is correct on x86, and makes no change on any other
> architecture.)
> 

Looks good to me! Thanks!

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> arch/x86/mm/tlb.c        | 58 +++++++++++++++++++++++++++++++---------
> include/linux/sched/mm.h | 13 ---------
> kernel/sched/core.c      | 14 +++++-----
> 3 files changed, 53 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 59ba2968af1b..1ae15172885e 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -9,6 +9,7 @@
> #include <linux/cpu.h>
> #include <linux/debugfs.h>
> #include <linux/sched/smt.h>
> +#include <linux/sched/mm.h>
> 
> #include <asm/tlbflush.h>
> #include <asm/mmu_context.h>
> @@ -485,6 +486,15 @@ void cr4_update_pce(void *ignored)
> static inline void cr4_update_pce_mm(struct mm_struct *mm) { }
> #endif
> 
> +static void sync_core_if_membarrier_enabled(struct mm_struct *next)
> +{
> +#ifdef CONFIG_MEMBARRIER
> +	if (unlikely(atomic_read(&next->membarrier_state) &
> +		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
> +		sync_core_before_usermode();
> +#endif
> +}
> +
> void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> 			struct task_struct *tsk)
> {
> @@ -539,16 +549,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,
> 		this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
> 
> 	/*
> -	 * The membarrier system call requires a full memory barrier and
> -	 * core serialization before returning to user-space, after
> -	 * storing to rq->curr, when changing mm.  This is because
> -	 * membarrier() sends IPIs to all CPUs that are in the target mm
> -	 * to make them issue memory barriers.  However, if another CPU
> -	 * switches to/from the target mm concurrently with
> -	 * membarrier(), it can cause that CPU not to receive an IPI
> -	 * when it really should issue a memory barrier.  Writing to CR3
> -	 * provides that full memory barrier and core serializing
> -	 * instruction.
> +	 * membarrier() support requires that, when we change rq->curr->mm:
> +	 *
> +	 *  - If next->mm has membarrier registered, a full memory barrier
> +	 *    after writing rq->curr (or rq->curr->mm if we switched the mm
> +	 *    without switching tasks) and before returning to user mode.
> +	 *
> +	 *  - If next->mm has SYNC_CORE registered, then we sync core before
> +	 *    returning to user mode.
> +	 *
> +	 * In the case where prev->mm == next->mm, membarrier() uses an IPI
> +	 * instead, and no particular barriers are needed while context
> +	 * switching.
> +	 *
> +	 * x86 gets all of this as a side-effect of writing to CR3 except
> +	 * in the case where we unlazy without flushing.
> +	 *
> +	 * All other architectures are civilized and do all of this implicitly
> +	 * when transitioning from kernel to user mode.
> 	 */
> 	if (real_prev == next) {
> 		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> @@ -566,7 +584,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,
> 		/*
> 		 * If the CPU is not in lazy TLB mode, we are just switching
> 		 * from one thread in a process to another thread in the same
> -		 * process. No TLB flush required.
> +		 * process. No TLB flush or membarrier() synchronization
> +		 * is required.
> 		 */
> 		if (!was_lazy)
> 			return;
> @@ -576,16 +595,31 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,
> 		 * If the TLB is up to date, just use it.
> 		 * The barrier synchronizes with the tlb_gen increment in
> 		 * the TLB shootdown code.
> +		 *
> +		 * As a future optimization opportunity, it's plausible
> +		 * that the x86 memory model is strong enough that this
> +		 * smp_mb() isn't needed.
> 		 */
> 		smp_mb();
> 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> 		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> -				next_tlb_gen)
> +		    next_tlb_gen) {
> +			/*
> +			 * We switched logical mm but we're not going to
> +			 * write to CR3.  We already did smp_mb() above,
> +			 * but membarrier() might require a sync_core()
> +			 * as well.
> +			 */
> +			sync_core_if_membarrier_enabled(next);
> +
> 			return;
> +		}
> 
> 		/*
> 		 * TLB contents went out of date while we were in lazy
> 		 * mode. Fall through to the TLB switching code below.
> +		 * No need for an explicit membarrier invocation -- the CR3
> +		 * write will serialize.
> 		 */
> 		new_asid = prev_asid;
> 		need_flush = true;
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 5561486fddef..c256a7fc0423 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -345,16 +345,6 @@ enum {
> #include <asm/membarrier.h>
> #endif
> 
> -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct
> *mm)
> -{
> -	if (current->mm != mm)
> -		return;
> -	if (likely(!(atomic_read(&mm->membarrier_state) &
> -		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
> -		return;
> -	sync_core_before_usermode();
> -}
> -
> extern void membarrier_exec_mmap(struct mm_struct *mm);
> 
> extern void membarrier_update_current_mm(struct mm_struct *next_mm);
> @@ -370,9 +360,6 @@ static inline void membarrier_arch_switch_mm(struct
> mm_struct *prev,
> static inline void membarrier_exec_mmap(struct mm_struct *mm)
> {
> }
> -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct
> *mm)
> -{
> -}
> static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
> {
> }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f21714ea3db8..6a1db8264c7b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4822,22 +4822,22 @@ static struct rq *finish_task_switch(struct task_struct
> *prev)
> 	kmap_local_sched_in();
> 
> 	fire_sched_in_preempt_notifiers(current);
> +
> 	/*
> 	 * When switching through a kernel thread, the loop in
> 	 * membarrier_{private,global}_expedited() may have observed that
> 	 * kernel thread and not issued an IPI. It is therefore possible to
> 	 * schedule between user->kernel->user threads without passing though
> 	 * switch_mm(). Membarrier requires a barrier after storing to
> -	 * rq->curr, before returning to userspace, so provide them here:
> +	 * rq->curr, before returning to userspace, and mmdrop() provides
> +	 * this barrier.
> 	 *
> -	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -	 *   provided by mmdrop(),
> -	 * - a sync_core for SYNC_CORE.
> +	 * If an architecture needs to take a specific action for
> +	 * SYNC_CORE, it can do so in switch_mm_irqs_off().
> 	 */
> -	if (mm) {
> -		membarrier_mm_sync_core_before_usermode(mm);
> +	if (mm)
> 		mmdrop(mm);
> -	}
> +
> 	if (unlikely(prev_state == TASK_DEAD)) {
> 		if (prev->sched_class->task_dead)
> 			prev->sched_class->task_dead(prev);
> --
> 2.33.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit
  2022-01-08 16:43 ` [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
@ 2022-01-12 15:52   ` Mathieu Desnoyers
  0 siblings, 0 replies; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 15:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, linux-mm, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	x86, riel, Dave Hansen, Peter Zijlstra, Nadav Amit

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> membarrier() needs a barrier after any CPU changes mm.  There is currently
> a comment explaining why this barrier probably exists in all cases. The
> logic is based on ensuring that the barrier exists on every control flow
> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> full barriers.
> 
> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> could use a release on architectures that have these operations.  Larger
> optimizations are also in the works.  Doing any of these optimizations
> while preserving an unnecessary barrier will complicate the code and
> penalize non-membarrier-using tasks.
> 
> Simplify the logic by adding an explicit barrier, and allow architectures
> to override it as an optimization if they want to.
> 
> One of the deleted comments in this patch said "It is therefore
> possible to schedule between user->kernel->user threads without
> passing through switch_mm()".  It is possible to do this without, say,
> writing to CR3 on x86, but the core scheduler indeed calls
> switch_mm_irqs_off() to tell the arch code to go back from lazy mode
> to no-lazy mode.
> 
> The membarrier_finish_switch_mm() call in exec_mmap() is a no-op so long as
> there is no way for a newly execed program to register for membarrier prior
> to running user code.  Subsequent patches will merge the exec_mmap() code
> with the kthread_use_mm() code, though, and keeping the paths consistent
> will make the result more comprehensible.

I don't agree with the approach here. Adding additional memory barrier overhead
for the sake of possible future optimization work is IMHO not an appropriate
justification for a performance regression.

However I think we can manage to allow forward progress on optimization of
mmgrab/mmdrop without hurting performance.

One possible approach would be to introduce smp_mb__{before,after}_{mmgrab,mmdrop},
which would initially be no-ops. Those could be used by membarrier without
introducing any overhead, and would allow to gradually move the implicit barriers
from mmgrab/mmdrop to smp_mb__{before,after}_{mmgrab,mmdrop} on a per-architecture
basis.

Thanks,

Mathieu


> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> fs/exec.c                |  1 +
> include/linux/sched/mm.h | 18 ++++++++++++++++++
> kernel/kthread.c         | 12 +-----------
> kernel/sched/core.c      | 34 +++++++++-------------------------
> 4 files changed, 29 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index a098c133d8d7..3abbd0294e73 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1019,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm)
> 	activate_mm(active_mm, mm);
> 	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> 		local_irq_enable();
> +	membarrier_finish_switch_mm(mm);
> 	tsk->mm->vmacache_seqnum = 0;
> 	vmacache_flush(tsk);
> 	task_unlock(tsk);
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 0df706c099e5..e8919995d8dd 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -349,6 +349,20 @@ extern void membarrier_exec_mmap(struct mm_struct *mm);
> 
> extern void membarrier_update_current_mm(struct mm_struct *next_mm);
> 
> +/*
> + * Called by the core scheduler after calling switch_mm_irqs_off().
> + * Architectures that have implicit barriers when switching mms can
> + * override this as an optimization.
> + */
> +#ifndef membarrier_finish_switch_mm
> +static inline void membarrier_finish_switch_mm(struct mm_struct *mm)
> +{
> +	if (atomic_read(&mm->membarrier_state) &
> +	    (MEMBARRIER_STATE_GLOBAL_EXPEDITED | MEMBARRIER_STATE_PRIVATE_EXPEDITED))
> +		smp_mb();
> +}
> +#endif
> +
> #else
> static inline void membarrier_exec_mmap(struct mm_struct *mm)
> {
> @@ -356,6 +370,10 @@ static inline void membarrier_exec_mmap(struct mm_struct
> *mm)
> static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
> {
> }
> +static inline void membarrier_finish_switch_mm(struct mm_struct *mm)
> +{
> +}
> +
> #endif
> 
> #endif /* _LINUX_SCHED_MM_H */
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 5b37a8567168..396ae78a1a34 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1361,25 +1361,15 @@ void kthread_use_mm(struct mm_struct *mm)
> 	tsk->mm = mm;
> 	membarrier_update_current_mm(mm);
> 	switch_mm_irqs_off(active_mm, mm, tsk);
> +	membarrier_finish_switch_mm(mm);
> 	local_irq_enable();
> 	task_unlock(tsk);
> #ifdef finish_arch_post_lock_switch
> 	finish_arch_post_lock_switch();
> #endif
> 
> -	/*
> -	 * When a kthread starts operating on an address space, the loop
> -	 * in membarrier_{private,global}_expedited() may not observe
> -	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> -	 * memory barrier after storing to tsk->mm, before accessing
> -	 * user-space memory. A full memory barrier for membarrier
> -	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
> -	 * mmdrop(), or explicitly with smp_mb().
> -	 */
> 	if (active_mm != mm)
> 		mmdrop(active_mm);
> -	else
> -		smp_mb();
> 
> 	to_kthread(tsk)->oldfs = force_uaccess_begin();
> }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6a1db8264c7b..917068b0a145 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4824,14 +4824,6 @@ static struct rq *finish_task_switch(struct task_struct
> *prev)
> 	fire_sched_in_preempt_notifiers(current);
> 
> 	/*
> -	 * When switching through a kernel thread, the loop in
> -	 * membarrier_{private,global}_expedited() may have observed that
> -	 * kernel thread and not issued an IPI. It is therefore possible to
> -	 * schedule between user->kernel->user threads without passing though
> -	 * switch_mm(). Membarrier requires a barrier after storing to
> -	 * rq->curr, before returning to userspace, and mmdrop() provides
> -	 * this barrier.
> -	 *
> 	 * If an architecture needs to take a specific action for
> 	 * SYNC_CORE, it can do so in switch_mm_irqs_off().
> 	 */
> @@ -4915,15 +4907,14 @@ context_switch(struct rq *rq, struct task_struct *prev,
> 			prev->active_mm = NULL;
> 	} else {                                        // to user
> 		membarrier_switch_mm(rq, prev->active_mm, next->mm);
> +		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +
> 		/*
> 		 * sys_membarrier() requires an smp_mb() between setting
> -		 * rq->curr / membarrier_switch_mm() and returning to userspace.
> -		 *
> -		 * The below provides this either through switch_mm(), or in
> -		 * case 'prev->active_mm == next->mm' through
> -		 * finish_task_switch()'s mmdrop().
> +		 * rq->curr->mm to a membarrier-enabled mm and returning
> +		 * to userspace.
> 		 */
> -		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +		membarrier_finish_switch_mm(next->mm);
> 
> 		if (!prev->mm) {                        // from kernel
> 			/* will mmdrop() in finish_task_switch(). */
> @@ -6264,17 +6255,10 @@ static void __sched notrace __schedule(unsigned int
> sched_mode)
> 		RCU_INIT_POINTER(rq->curr, next);
> 		/*
> 		 * The membarrier system call requires each architecture
> -		 * to have a full memory barrier after updating
> -		 * rq->curr, before returning to user-space.
> -		 *
> -		 * Here are the schemes providing that barrier on the
> -		 * various architectures:
> -		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
> -		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
> -		 * - finish_lock_switch() for weakly-ordered
> -		 *   architectures where spin_unlock is a full barrier,
> -		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
> -		 *   is a RELEASE barrier),
> +		 * to have a full memory barrier before and after updating
> +		 * rq->curr->mm, before returning to userspace.  This
> +		 * is provided by membarrier_finish_switch_mm().  Architectures
> +		 * that want to optimize this can override that function.
> 		 */
> 		++*switch_count;
> 
> --
> 2.33.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/23] membarrier, kthread: Use _ONCE accessors for task->mm
  2022-01-08 16:43 ` [PATCH 05/23] membarrier, kthread: Use _ONCE accessors for task->mm Andy Lutomirski
@ 2022-01-12 15:55   ` Mathieu Desnoyers
  0 siblings, 0 replies; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 15:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, linux-mm, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	x86, riel, Dave Hansen, Peter Zijlstra, Nadav Amit

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> membarrier reads cpu_rq(remote cpu)->curr->mm without locking.  Use
> READ_ONCE() and WRITE_ONCE() to remove the data races.
> 

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Acked-by: Nicholas Piggin <npiggin@gmail.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> fs/exec.c                 | 2 +-
> kernel/exit.c             | 2 +-
> kernel/kthread.c          | 4 ++--
> kernel/sched/membarrier.c | 7 ++++---
> 4 files changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 3abbd0294e73..38b05e01c5bd 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1006,7 +1006,7 @@ static int exec_mmap(struct mm_struct *mm)
> 	local_irq_disable();
> 	active_mm = tsk->active_mm;
> 	tsk->active_mm = mm;
> -	tsk->mm = mm;
> +	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
> 	/*
> 	 * This prevents preemption while active_mm is being loaded and
> 	 * it and mm are being updated, which could cause problems for
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 91a43e57a32e..70f2cbc42015 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -491,7 +491,7 @@ static void exit_mm(void)
> 	 */
> 	smp_mb__after_spinlock();
> 	local_irq_disable();
> -	current->mm = NULL;
> +	WRITE_ONCE(current->mm, NULL);
> 	membarrier_update_current_mm(NULL);
> 	enter_lazy_tlb(mm, current);
> 	local_irq_enable();
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 396ae78a1a34..3b18329f885c 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1358,7 +1358,7 @@ void kthread_use_mm(struct mm_struct *mm)
> 		mmgrab(mm);
> 		tsk->active_mm = mm;
> 	}
> -	tsk->mm = mm;
> +	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
> 	membarrier_update_current_mm(mm);
> 	switch_mm_irqs_off(active_mm, mm, tsk);
> 	membarrier_finish_switch_mm(mm);
> @@ -1399,7 +1399,7 @@ void kthread_unuse_mm(struct mm_struct *mm)
> 	smp_mb__after_spinlock();
> 	sync_mm_rss(mm);
> 	local_irq_disable();
> -	tsk->mm = NULL;
> +	WRITE_ONCE(tsk->mm, NULL);  /* membarrier reads this without locks */
> 	membarrier_update_current_mm(NULL);
> 	/* active_mm is still 'mm' */
> 	enter_lazy_tlb(mm, tsk);
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 30e964b9689d..327830f89c37 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -411,7 +411,7 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> 			goto out;
> 		rcu_read_lock();
> 		p = rcu_dereference(cpu_rq(cpu_id)->curr);
> -		if (!p || p->mm != mm) {
> +		if (!p || READ_ONCE(p->mm) != mm) {
> 			rcu_read_unlock();
> 			goto out;
> 		}
> @@ -424,7 +424,7 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> 			struct task_struct *p;
> 
> 			p = rcu_dereference(cpu_rq(cpu)->curr);
> -			if (p && p->mm == mm)
> +			if (p && READ_ONCE(p->mm) == mm)
> 				__cpumask_set_cpu(cpu, tmpmask);
> 		}
> 		rcu_read_unlock();
> @@ -522,7 +522,8 @@ static int sync_runqueues_membarrier_state(struct mm_struct
> *mm)
> 		struct task_struct *p;
> 
> 		p = rcu_dereference(rq->curr);
> -		if (p && p->mm == mm)
> +		/* exec and kthread_use_mm() write ->mm without locks */
> +		if (p && READ_ONCE(p->mm) == mm)
> 			__cpumask_set_cpu(cpu, tmpmask);
> 	}
> 	rcu_read_unlock();
> --
> 2.33.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch
  2022-01-08 16:43   ` Andy Lutomirski
@ 2022-01-12 15:57     ` Mathieu Desnoyers
  -1 siblings, 0 replies; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 15:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, linux-mm, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	x86, riel, Dave Hansen, Peter Zijlstra, Nadav Amit,
	Michael Ellerman, Paul Mackerras, linuxppc-dev

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> powerpc did the following on some, but not all, paths through
> switch_mm_irqs_off():
> 
>       /*
>        * Only need the full barrier when switching between processes.
>        * Barrier when switching from kernel to userspace is not
>        * required here, given that it is implied by mmdrop(). Barrier
>        * when switching from userspace to kernel is not needed after
>        * store to rq->curr.
>        */
>       if (likely(!(atomic_read(&next->membarrier_state) &
>                    (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
>                     MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
>               return;
> 
> This is puzzling: if !prev, then one might expect that we are switching
> from kernel to user, not user to kernel, which is inconsistent with the
> comment.  But this is all nonsense, because the one and only caller would
> never have prev == NULL and would, in fact, OOPS if prev == NULL.
> 
> In any event, this code is unnecessary, since the new generic
> membarrier_finish_switch_mm() provides the same barrier without arch help.
> 
> arch/powerpc/include/asm/membarrier.h remains as an empty header,
> because a later patch in this series will add code to it.

My disagreement with "membarrier: Make the post-switch-mm barrier explicit"
may affect this patch significantly, or even make it irrelevant.

Thanks,

Mathieu

> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> arch/powerpc/include/asm/membarrier.h | 24 ------------------------
> arch/powerpc/mm/mmu_context.c         |  1 -
> 2 files changed, 25 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/membarrier.h
> b/arch/powerpc/include/asm/membarrier.h
> index de7f79157918..b90766e95bd1 100644
> --- a/arch/powerpc/include/asm/membarrier.h
> +++ b/arch/powerpc/include/asm/membarrier.h
> @@ -1,28 +1,4 @@
> #ifndef _ASM_POWERPC_MEMBARRIER_H
> #define _ASM_POWERPC_MEMBARRIER_H
> 
> -static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> -					     struct mm_struct *next,
> -					     struct task_struct *tsk)
> -{
> -	/*
> -	 * Only need the full barrier when switching between processes.
> -	 * Barrier when switching from kernel to userspace is not
> -	 * required here, given that it is implied by mmdrop(). Barrier
> -	 * when switching from userspace to kernel is not needed after
> -	 * store to rq->curr.
> -	 */
> -	if (IS_ENABLED(CONFIG_SMP) &&
> -	    likely(!(atomic_read(&next->membarrier_state) &
> -		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
> -		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
> -		return;
> -
> -	/*
> -	 * The membarrier system call requires a full memory barrier
> -	 * after storing to rq->curr, before going back to user-space.
> -	 */
> -	smp_mb();
> -}
> -
> #endif /* _ASM_POWERPC_MEMBARRIER_H */
> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
> index 74246536b832..5f2daa6b0497 100644
> --- a/arch/powerpc/mm/mmu_context.c
> +++ b/arch/powerpc/mm/mmu_context.c
> @@ -84,7 +84,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,
> 		asm volatile ("dssall");
> 
> 	if (!new_on_cpu)
> -		membarrier_arch_switch_mm(prev, next, tsk);
> 
> 	/*
> 	 * The actual HW switching method differs between the various
> --
> 2.33.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch
@ 2022-01-12 15:57     ` Mathieu Desnoyers
  0 siblings, 0 replies; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 15:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-arch, x86, riel, Peter Zijlstra, Randy Dunlap, Dave Hansen,
	linuxppc-dev, Nicholas Piggin, linux-mm, Paul Mackerras,
	Andrew Morton, Nadav Amit

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> powerpc did the following on some, but not all, paths through
> switch_mm_irqs_off():
> 
>       /*
>        * Only need the full barrier when switching between processes.
>        * Barrier when switching from kernel to userspace is not
>        * required here, given that it is implied by mmdrop(). Barrier
>        * when switching from userspace to kernel is not needed after
>        * store to rq->curr.
>        */
>       if (likely(!(atomic_read(&next->membarrier_state) &
>                    (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
>                     MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
>               return;
> 
> This is puzzling: if !prev, then one might expect that we are switching
> from kernel to user, not user to kernel, which is inconsistent with the
> comment.  But this is all nonsense, because the one and only caller would
> never have prev == NULL and would, in fact, OOPS if prev == NULL.
> 
> In any event, this code is unnecessary, since the new generic
> membarrier_finish_switch_mm() provides the same barrier without arch help.
> 
> arch/powerpc/include/asm/membarrier.h remains as an empty header,
> because a later patch in this series will add code to it.

My disagreement with "membarrier: Make the post-switch-mm barrier explicit"
may affect this patch significantly, or even make it irrelevant.

Thanks,

Mathieu

> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> arch/powerpc/include/asm/membarrier.h | 24 ------------------------
> arch/powerpc/mm/mmu_context.c         |  1 -
> 2 files changed, 25 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/membarrier.h
> b/arch/powerpc/include/asm/membarrier.h
> index de7f79157918..b90766e95bd1 100644
> --- a/arch/powerpc/include/asm/membarrier.h
> +++ b/arch/powerpc/include/asm/membarrier.h
> @@ -1,28 +1,4 @@
> #ifndef _ASM_POWERPC_MEMBARRIER_H
> #define _ASM_POWERPC_MEMBARRIER_H
> 
> -static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> -					     struct mm_struct *next,
> -					     struct task_struct *tsk)
> -{
> -	/*
> -	 * Only need the full barrier when switching between processes.
> -	 * Barrier when switching from kernel to userspace is not
> -	 * required here, given that it is implied by mmdrop(). Barrier
> -	 * when switching from userspace to kernel is not needed after
> -	 * store to rq->curr.
> -	 */
> -	if (IS_ENABLED(CONFIG_SMP) &&
> -	    likely(!(atomic_read(&next->membarrier_state) &
> -		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
> -		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
> -		return;
> -
> -	/*
> -	 * The membarrier system call requires a full memory barrier
> -	 * after storing to rq->curr, before going back to user-space.
> -	 */
> -	smp_mb();
> -}
> -
> #endif /* _ASM_POWERPC_MEMBARRIER_H */
> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
> index 74246536b832..5f2daa6b0497 100644
> --- a/arch/powerpc/mm/mmu_context.c
> +++ b/arch/powerpc/mm/mmu_context.c
> @@ -84,7 +84,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,
> 		asm volatile ("dssall");
> 
> 	if (!new_on_cpu)
> -		membarrier_arch_switch_mm(prev, next, tsk);
> 
> 	/*
> 	 * The actual HW switching method differs between the various
> --
> 2.33.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2022-01-08 16:43   ` Andy Lutomirski
  (?)
@ 2022-01-12 16:11     ` Mathieu Desnoyers
  -1 siblings, 0 replies; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 16:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, linux-mm, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	x86, riel, Dave Hansen, Peter Zijlstra, Nadav Amit,
	Michael Ellerman, Paul Mackerras, linuxppc-dev, Catalin Marinas,
	Will Deacon, linux-arm-kernel, stable

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> The old sync_core_before_usermode() comments suggested that a
> non-icache-syncing return-to-usermode instruction is x86-specific and that
> all other architectures automatically notice cross-modified code on return
> to userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm, arm64 and powerpc, one must flush the icache and then flush the
> pipeline on the target CPU, although the CPU manuals don't necessarily use
> this language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.  This means x86, arm64, and powerpc for now.  Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.
> 
> (It may well be the case that, on real x86 processors, synchronizing the
> icache (which requires no action at all) and "flushing the pipeline" is
> sufficient, but trying to use this language would be confusing at best.
> LFENCE does something awfully like "flushing the pipeline", but the SDM
> does not permit LFENCE as an alternative to a "serializing instruction"
> for this purpose.)

A few comments below:

[...]

> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.

Any thoughts from ppc maintainers ?

[...]

> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index e9da3dc71254..b47cd22b2eb1 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -17,7 +17,7 @@
> #include <linux/kprobes.h>
> #include <linux/mmu_context.h>
> #include <linux/bsearch.h>
> -#include <linux/sync_core.h>
> +#include <asm/sync_core.h>

All this churn wrt move from linux/sync_core.h to asm/sync_core.h
should probably be moved to a separate cleanup patch.

> #include <asm/text-patching.h>
> #include <asm/alternative.h>
> #include <asm/sections.h>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 193204aee880..a2529e09f620 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -41,12 +41,12 @@
> #include <linux/irq_work.h>
> #include <linux/export.h>
> #include <linux/set_memory.h>
> -#include <linux/sync_core.h>
> #include <linux/task_work.h>
> #include <linux/hardirq.h>
> 
> #include <asm/intel-family.h>
> #include <asm/processor.h>
> +#include <asm/sync_core.h>
> #include <asm/traps.h>
> #include <asm/tlbflush.h>
> #include <asm/mce.h>

[...]

> diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> index d7ef61e602ed..462c667bd6c4 100644
> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -20,8 +20,8 @@
> #include <linux/io.h>
> #include <linux/uaccess.h>
> #include <linux/security.h>
> -#include <linux/sync_core.h>
> #include <linux/prefetch.h>
> +#include <asm/sync_core.h>
> #include "gru.h"
> #include "grutables.h"
> #include "grulib.h"
> diff --git a/drivers/misc/sgi-gru/gruhandles.c
> b/drivers/misc/sgi-gru/gruhandles.c
> index 1d75d5e540bc..c8cba1c1b00f 100644
> --- a/drivers/misc/sgi-gru/gruhandles.c
> +++ b/drivers/misc/sgi-gru/gruhandles.c
> @@ -16,7 +16,7 @@
> #define GRU_OPERATION_TIMEOUT	(((cycles_t) local_cpu_data->itc_freq)*10)
> #define CLKS2NSEC(c)		((c) *1000000000 / local_cpu_data->itc_freq)
> #else
> -#include <linux/sync_core.h>
> +#include <asm/sync_core.h>
> #include <asm/tsc.h>
> #define GRU_OPERATION_TIMEOUT	((cycles_t) tsc_khz*10*1000)
> #define CLKS2NSEC(c)		((c) * 1000000 / tsc_khz)
> diff --git a/drivers/misc/sgi-gru/grukservices.c
> b/drivers/misc/sgi-gru/grukservices.c
> index 0ea923fe6371..ce03ff3f7c3a 100644
> --- a/drivers/misc/sgi-gru/grukservices.c
> +++ b/drivers/misc/sgi-gru/grukservices.c
> @@ -16,10 +16,10 @@
> #include <linux/miscdevice.h>
> #include <linux/proc_fs.h>
> #include <linux/interrupt.h>
> -#include <linux/sync_core.h>
> #include <linux/uaccess.h>
> #include <linux/delay.h>
> #include <linux/export.h>
> +#include <asm/sync_core.h>
> #include <asm/io_apic.h>
> #include "gru.h"
> #include "grulib.h"
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index e8919995d8dd..e107f292fc42 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -7,7 +7,6 @@
> #include <linux/sched.h>
> #include <linux/mm_types.h>
> #include <linux/gfp.h>
> -#include <linux/sync_core.h>
> 
> /*
>  * Routines for handling mm_structs
> diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
> deleted file mode 100644
> index 013da4b8b327..000000000000
> --- a/include/linux/sync_core.h
> +++ /dev/null
> @@ -1,21 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef _LINUX_SYNC_CORE_H
> -#define _LINUX_SYNC_CORE_H
> -
> -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> -#include <asm/sync_core.h>
> -#else
> -/*
> - * This is a dummy sync_core_before_usermode() implementation that can be used
> - * on all architectures which return to user-space through core serializing
> - * instructions.
> - * If your architecture returns to user-space through non-core-serializing
> - * instructions, you need to write your own functions.
> - */
> -static inline void sync_core_before_usermode(void)
> -{
> -}
> -#endif
> -
> -#endif /* _LINUX_SYNC_CORE_H */
> -

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2022-01-12 16:11     ` Mathieu Desnoyers
  0 siblings, 0 replies; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 16:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-arch, x86, Catalin Marinas, Will Deacon, riel,
	Peter Zijlstra, Randy Dunlap, Dave Hansen, linuxppc-dev,
	Nicholas Piggin, linux-mm, Paul Mackerras, stable, Andrew Morton,
	Nadav Amit, linux-arm-kernel

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> The old sync_core_before_usermode() comments suggested that a
> non-icache-syncing return-to-usermode instruction is x86-specific and that
> all other architectures automatically notice cross-modified code on return
> to userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm, arm64 and powerpc, one must flush the icache and then flush the
> pipeline on the target CPU, although the CPU manuals don't necessarily use
> this language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.  This means x86, arm64, and powerpc for now.  Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.
> 
> (It may well be the case that, on real x86 processors, synchronizing the
> icache (which requires no action at all) and "flushing the pipeline" is
> sufficient, but trying to use this language would be confusing at best.
> LFENCE does something awfully like "flushing the pipeline", but the SDM
> does not permit LFENCE as an alternative to a "serializing instruction"
> for this purpose.)

A few comments below:

[...]

> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.

Any thoughts from ppc maintainers ?

[...]

> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index e9da3dc71254..b47cd22b2eb1 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -17,7 +17,7 @@
> #include <linux/kprobes.h>
> #include <linux/mmu_context.h>
> #include <linux/bsearch.h>
> -#include <linux/sync_core.h>
> +#include <asm/sync_core.h>

All this churn wrt move from linux/sync_core.h to asm/sync_core.h
should probably be moved to a separate cleanup patch.

> #include <asm/text-patching.h>
> #include <asm/alternative.h>
> #include <asm/sections.h>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 193204aee880..a2529e09f620 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -41,12 +41,12 @@
> #include <linux/irq_work.h>
> #include <linux/export.h>
> #include <linux/set_memory.h>
> -#include <linux/sync_core.h>
> #include <linux/task_work.h>
> #include <linux/hardirq.h>
> 
> #include <asm/intel-family.h>
> #include <asm/processor.h>
> +#include <asm/sync_core.h>
> #include <asm/traps.h>
> #include <asm/tlbflush.h>
> #include <asm/mce.h>

[...]

> diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> index d7ef61e602ed..462c667bd6c4 100644
> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -20,8 +20,8 @@
> #include <linux/io.h>
> #include <linux/uaccess.h>
> #include <linux/security.h>
> -#include <linux/sync_core.h>
> #include <linux/prefetch.h>
> +#include <asm/sync_core.h>
> #include "gru.h"
> #include "grutables.h"
> #include "grulib.h"
> diff --git a/drivers/misc/sgi-gru/gruhandles.c
> b/drivers/misc/sgi-gru/gruhandles.c
> index 1d75d5e540bc..c8cba1c1b00f 100644
> --- a/drivers/misc/sgi-gru/gruhandles.c
> +++ b/drivers/misc/sgi-gru/gruhandles.c
> @@ -16,7 +16,7 @@
> #define GRU_OPERATION_TIMEOUT	(((cycles_t) local_cpu_data->itc_freq)*10)
> #define CLKS2NSEC(c)		((c) *1000000000 / local_cpu_data->itc_freq)
> #else
> -#include <linux/sync_core.h>
> +#include <asm/sync_core.h>
> #include <asm/tsc.h>
> #define GRU_OPERATION_TIMEOUT	((cycles_t) tsc_khz*10*1000)
> #define CLKS2NSEC(c)		((c) * 1000000 / tsc_khz)
> diff --git a/drivers/misc/sgi-gru/grukservices.c
> b/drivers/misc/sgi-gru/grukservices.c
> index 0ea923fe6371..ce03ff3f7c3a 100644
> --- a/drivers/misc/sgi-gru/grukservices.c
> +++ b/drivers/misc/sgi-gru/grukservices.c
> @@ -16,10 +16,10 @@
> #include <linux/miscdevice.h>
> #include <linux/proc_fs.h>
> #include <linux/interrupt.h>
> -#include <linux/sync_core.h>
> #include <linux/uaccess.h>
> #include <linux/delay.h>
> #include <linux/export.h>
> +#include <asm/sync_core.h>
> #include <asm/io_apic.h>
> #include "gru.h"
> #include "grulib.h"
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index e8919995d8dd..e107f292fc42 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -7,7 +7,6 @@
> #include <linux/sched.h>
> #include <linux/mm_types.h>
> #include <linux/gfp.h>
> -#include <linux/sync_core.h>
> 
> /*
>  * Routines for handling mm_structs
> diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
> deleted file mode 100644
> index 013da4b8b327..000000000000
> --- a/include/linux/sync_core.h
> +++ /dev/null
> @@ -1,21 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef _LINUX_SYNC_CORE_H
> -#define _LINUX_SYNC_CORE_H
> -
> -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> -#include <asm/sync_core.h>
> -#else
> -/*
> - * This is a dummy sync_core_before_usermode() implementation that can be used
> - * on all architectures which return to user-space through core serializing
> - * instructions.
> - * If your architecture returns to user-space through non-core-serializing
> - * instructions, you need to write your own functions.
> - */
> -static inline void sync_core_before_usermode(void)
> -{
> -}
> -#endif
> -
> -#endif /* _LINUX_SYNC_CORE_H */
> -

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2022-01-12 16:11     ` Mathieu Desnoyers
  0 siblings, 0 replies; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 16:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, linux-mm, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	x86, riel, Dave Hansen, Peter Zijlstra, Nadav Amit,
	Michael Ellerman, Paul Mackerras, linuxppc-dev, Catalin Marinas,
	Will Deacon, linux-arm-kernel, stable

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> The old sync_core_before_usermode() comments suggested that a
> non-icache-syncing return-to-usermode instruction is x86-specific and that
> all other architectures automatically notice cross-modified code on return
> to userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm, arm64 and powerpc, one must flush the icache and then flush the
> pipeline on the target CPU, although the CPU manuals don't necessarily use
> this language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.  This means x86, arm64, and powerpc for now.  Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.
> 
> (It may well be the case that, on real x86 processors, synchronizing the
> icache (which requires no action at all) and "flushing the pipeline" is
> sufficient, but trying to use this language would be confusing at best.
> LFENCE does something awfully like "flushing the pipeline", but the SDM
> does not permit LFENCE as an alternative to a "serializing instruction"
> for this purpose.)

A few comments below:

[...]

> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.

Any thoughts from ppc maintainers ?

[...]

> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index e9da3dc71254..b47cd22b2eb1 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -17,7 +17,7 @@
> #include <linux/kprobes.h>
> #include <linux/mmu_context.h>
> #include <linux/bsearch.h>
> -#include <linux/sync_core.h>
> +#include <asm/sync_core.h>

All this churn wrt move from linux/sync_core.h to asm/sync_core.h
should probably be moved to a separate cleanup patch.

> #include <asm/text-patching.h>
> #include <asm/alternative.h>
> #include <asm/sections.h>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 193204aee880..a2529e09f620 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -41,12 +41,12 @@
> #include <linux/irq_work.h>
> #include <linux/export.h>
> #include <linux/set_memory.h>
> -#include <linux/sync_core.h>
> #include <linux/task_work.h>
> #include <linux/hardirq.h>
> 
> #include <asm/intel-family.h>
> #include <asm/processor.h>
> +#include <asm/sync_core.h>
> #include <asm/traps.h>
> #include <asm/tlbflush.h>
> #include <asm/mce.h>

[...]

> diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> index d7ef61e602ed..462c667bd6c4 100644
> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -20,8 +20,8 @@
> #include <linux/io.h>
> #include <linux/uaccess.h>
> #include <linux/security.h>
> -#include <linux/sync_core.h>
> #include <linux/prefetch.h>
> +#include <asm/sync_core.h>
> #include "gru.h"
> #include "grutables.h"
> #include "grulib.h"
> diff --git a/drivers/misc/sgi-gru/gruhandles.c
> b/drivers/misc/sgi-gru/gruhandles.c
> index 1d75d5e540bc..c8cba1c1b00f 100644
> --- a/drivers/misc/sgi-gru/gruhandles.c
> +++ b/drivers/misc/sgi-gru/gruhandles.c
> @@ -16,7 +16,7 @@
> #define GRU_OPERATION_TIMEOUT	(((cycles_t) local_cpu_data->itc_freq)*10)
> #define CLKS2NSEC(c)		((c) *1000000000 / local_cpu_data->itc_freq)
> #else
> -#include <linux/sync_core.h>
> +#include <asm/sync_core.h>
> #include <asm/tsc.h>
> #define GRU_OPERATION_TIMEOUT	((cycles_t) tsc_khz*10*1000)
> #define CLKS2NSEC(c)		((c) * 1000000 / tsc_khz)
> diff --git a/drivers/misc/sgi-gru/grukservices.c
> b/drivers/misc/sgi-gru/grukservices.c
> index 0ea923fe6371..ce03ff3f7c3a 100644
> --- a/drivers/misc/sgi-gru/grukservices.c
> +++ b/drivers/misc/sgi-gru/grukservices.c
> @@ -16,10 +16,10 @@
> #include <linux/miscdevice.h>
> #include <linux/proc_fs.h>
> #include <linux/interrupt.h>
> -#include <linux/sync_core.h>
> #include <linux/uaccess.h>
> #include <linux/delay.h>
> #include <linux/export.h>
> +#include <asm/sync_core.h>
> #include <asm/io_apic.h>
> #include "gru.h"
> #include "grulib.h"
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index e8919995d8dd..e107f292fc42 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -7,7 +7,6 @@
> #include <linux/sched.h>
> #include <linux/mm_types.h>
> #include <linux/gfp.h>
> -#include <linux/sync_core.h>
> 
> /*
>  * Routines for handling mm_structs
> diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
> deleted file mode 100644
> index 013da4b8b327..000000000000
> --- a/include/linux/sync_core.h
> +++ /dev/null
> @@ -1,21 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef _LINUX_SYNC_CORE_H
> -#define _LINUX_SYNC_CORE_H
> -
> -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> -#include <asm/sync_core.h>
> -#else
> -/*
> - * This is a dummy sync_core_before_usermode() implementation that can be used
> - * on all architectures which return to user-space through core serializing
> - * instructions.
> - * If your architecture returns to user-space through non-core-serializing
> - * instructions, you need to write your own functions.
> - */
> -static inline void sync_core_before_usermode(void)
> -{
> -}
> -#endif
> -
> -#endif /* _LINUX_SYNC_CORE_H */
> -

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap()
  2022-01-08 16:43 ` [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap() Andy Lutomirski
@ 2022-01-12 16:13   ` Mathieu Desnoyers
  0 siblings, 0 replies; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 16:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, linux-mm, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	x86, riel, Dave Hansen, Peter Zijlstra, Nadav Amit

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> exec_mmap() supplies a brand-new mm from mm_alloc(), and membarrier_state
> is already 0.  There's no need to clear it again.

Then I suspect we might want to tweak the comment just above the memory barrier ?

        /*
         * Issue a memory barrier before clearing membarrier_state to
         * guarantee that no memory access prior to exec is reordered after
         * clearing this state.
         */

Is that barrier still needed ?

Thanks,

Mathieu

> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> kernel/sched/membarrier.c | 1 -
> 1 file changed, 1 deletion(-)
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index eb73eeaedc7d..c38014c2ed66 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -285,7 +285,6 @@ void membarrier_exec_mmap(struct mm_struct *mm)
> 	 * clearing this state.
> 	 */
> 	smp_mb();
> -	atomic_set(&mm->membarrier_state, 0);
> 	/*
> 	 * Keep the runqueue membarrier_state in sync with this mm
> 	 * membarrier_state.
> --
> 2.33.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com




^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm()
  2022-01-08 16:43 ` [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() Andy Lutomirski
@ 2022-01-12 16:30   ` Mathieu Desnoyers
  2022-01-12 17:08     ` Mathieu Desnoyers
  0 siblings, 1 reply; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 16:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, linux-mm, Nicholas Piggin, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras, Randy Dunlap, linux-arch,
	x86, riel, Dave Hansen, Peter Zijlstra, Nadav Amit

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> membarrier() requires a barrier before changes to rq->curr->mm, not just
> before writes to rq->membarrier_state.  Move the barrier in exec_mmap() to
> the right place.

I don't see anything that was technically wrong with membarrier_exec_mmap
before this patchset. membarrier_exec-mmap issued a smp_mb just after
the task_lock(), and proceeded to clear the mm->membarrier_state and
runqueue membarrier state. And then the tsk->mm is set *after* the smp_mb().

So from this commit message we could be led to think there was something
wrong before, but I do not think it's true. This first part of the proposed
change is merely a performance optimization that removes a useless memory
barrier on architectures where smp_mb__after_spinlock() is a no-op, and
removes a useless store to mm->membarrier_state because it is already
zero-initialized. This is all very nice, but does not belong in a "Fix" patch.

> Add the barrier in kthread_use_mm() -- it was entirely
> missing before.

This is correct. This second part of the patch is indeed a relevant fix.

Thanks,

Mathieu

> 
> This patch makes exec_mmap() and kthread_use_mm() use the same membarrier
> hooks, which results in some code deletion.
> 
> As an added bonus, this will eliminate a redundant barrier in execve() on
> arches for which spinlock acquisition is a barrier.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> fs/exec.c                 |  6 +++++-
> include/linux/sched/mm.h  |  2 --
> kernel/kthread.c          |  5 +++++
> kernel/sched/membarrier.c | 15 ---------------
> 4 files changed, 10 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 38b05e01c5bd..325dab98bc51 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1001,12 +1001,16 @@ static int exec_mmap(struct mm_struct *mm)
> 	}
> 
> 	task_lock(tsk);
> -	membarrier_exec_mmap(mm);
> +	/*
> +	 * membarrier() requires a full barrier before switching mm.
> +	 */
> +	smp_mb__after_spinlock();
> 
> 	local_irq_disable();
> 	active_mm = tsk->active_mm;
> 	tsk->active_mm = mm;
> 	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
> +	membarrier_update_current_mm(mm);
> 	/*
> 	 * This prevents preemption while active_mm is being loaded and
> 	 * it and mm are being updated, which could cause problems for
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index e107f292fc42..f1d2beac464c 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -344,8 +344,6 @@ enum {
> #include <asm/membarrier.h>
> #endif
> 
> -extern void membarrier_exec_mmap(struct mm_struct *mm);
> -
> extern void membarrier_update_current_mm(struct mm_struct *next_mm);
> 
> /*
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3b18329f885c..18b0a2e0e3b2 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1351,6 +1351,11 @@ void kthread_use_mm(struct mm_struct *mm)
> 	WARN_ON_ONCE(tsk->mm);
> 
> 	task_lock(tsk);
> +	/*
> +	 * membarrier() requires a full barrier before switching mm.
> +	 */
> +	smp_mb__after_spinlock();
> +
> 	/* Hold off tlb flush IPIs while switching mm's */
> 	local_irq_disable();
> 	active_mm = tsk->active_mm;
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index c38014c2ed66..44fafa6e1efd 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -277,21 +277,6 @@ static void ipi_sync_rq_state(void *info)
> 	smp_mb();
> }
> 
> -void membarrier_exec_mmap(struct mm_struct *mm)
> -{
> -	/*
> -	 * Issue a memory barrier before clearing membarrier_state to
> -	 * guarantee that no memory access prior to exec is reordered after
> -	 * clearing this state.
> -	 */
> -	smp_mb();
> -	/*
> -	 * Keep the runqueue membarrier_state in sync with this mm
> -	 * membarrier_state.
> -	 */
> -	this_cpu_write(runqueues.membarrier_state, 0);
> -}
> -
> void membarrier_update_current_mm(struct mm_struct *next_mm)
> {
> 	struct rq *rq = this_rq();
> --
> 2.33.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm()
  2022-01-12 16:30   ` Mathieu Desnoyers
@ 2022-01-12 17:08     ` Mathieu Desnoyers
  0 siblings, 0 replies; 79+ messages in thread
From: Mathieu Desnoyers @ 2022-01-12 17:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Andrew Morton, linux-mm, Nicholas Piggin,
	Anton Blanchard, Benjamin Herrenschmidt, Paul Mackerras,
	Randy Dunlap, linux-arch, x86, riel, Dave Hansen, Peter Zijlstra,
	Nadav Amit


----- Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> ----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:
> 
> > membarrier() requires a barrier before changes to rq->curr->mm, not just
> > before writes to rq->membarrier_state.  Move the barrier in exec_mmap() to
> > the right place.
> 
> I don't see anything that was technically wrong with membarrier_exec_mmap
> before this patchset. membarrier_exec-mmap issued a smp_mb just after
> the task_lock(), and proceeded to clear the mm->membarrier_state and
> runqueue membarrier state. And then the tsk->mm is set *after* the smp_mb().
> 
> So from this commit message we could be led to think there was something
> wrong before, but I do not think it's true. This first part of the proposed
> change is merely a performance optimization that removes a useless memory
> barrier on architectures where smp_mb__after_spinlock() is a no-op, and
> removes a useless store to mm->membarrier_state because it is already
> zero-initialized. This is all very nice, but does not belong in a "Fix" patch.
> 
> > Add the barrier in kthread_use_mm() -- it was entirely
> > missing before.
> 
> This is correct. This second part of the patch is indeed a relevant fix.

However this adds a useless barrier for CONFIG_MEMBARRIER=n.

Thanks,

Mathieu


> 
> Thanks,
> 
> Mathieu
> 
> > 
> > This patch makes exec_mmap() and kthread_use_mm() use the same membarrier
> > hooks, which results in some code deletion.
> > 
> > As an added bonus, this will eliminate a redundant barrier in execve() on
> > arches for which spinlock acquisition is a barrier.
> > 
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > ---
> > fs/exec.c                 |  6 +++++-
> > include/linux/sched/mm.h  |  2 --
> > kernel/kthread.c          |  5 +++++
> > kernel/sched/membarrier.c | 15 ---------------
> > 4 files changed, 10 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 38b05e01c5bd..325dab98bc51 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1001,12 +1001,16 @@ static int exec_mmap(struct mm_struct *mm)
> > 	}
> > 
> > 	task_lock(tsk);
> > -	membarrier_exec_mmap(mm);
> > +	/*
> > +	 * membarrier() requires a full barrier before switching mm.
> > +	 */
> > +	smp_mb__after_spinlock();
> > 
> > 	local_irq_disable();
> > 	active_mm = tsk->active_mm;
> > 	tsk->active_mm = mm;
> > 	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
> > +	membarrier_update_current_mm(mm);
> > 	/*
> > 	 * This prevents preemption while active_mm is being loaded and
> > 	 * it and mm are being updated, which could cause problems for
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index e107f292fc42..f1d2beac464c 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -344,8 +344,6 @@ enum {
> > #include <asm/membarrier.h>
> > #endif
> > 
> > -extern void membarrier_exec_mmap(struct mm_struct *mm);
> > -
> > extern void membarrier_update_current_mm(struct mm_struct *next_mm);
> > 
> > /*
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index 3b18329f885c..18b0a2e0e3b2 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -1351,6 +1351,11 @@ void kthread_use_mm(struct mm_struct *mm)
> > 	WARN_ON_ONCE(tsk->mm);
> > 
> > 	task_lock(tsk);
> > +	/*
> > +	 * membarrier() requires a full barrier before switching mm.
> > +	 */
> > +	smp_mb__after_spinlock();
> > +
> > 	/* Hold off tlb flush IPIs while switching mm's */
> > 	local_irq_disable();
> > 	active_mm = tsk->active_mm;
> > diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> > index c38014c2ed66..44fafa6e1efd 100644
> > --- a/kernel/sched/membarrier.c
> > +++ b/kernel/sched/membarrier.c
> > @@ -277,21 +277,6 @@ static void ipi_sync_rq_state(void *info)
> > 	smp_mb();
> > }
> > 
> > -void membarrier_exec_mmap(struct mm_struct *mm)
> > -{
> > -	/*
> > -	 * Issue a memory barrier before clearing membarrier_state to
> > -	 * guarantee that no memory access prior to exec is reordered after
> > -	 * clearing this state.
> > -	 */
> > -	smp_mb();
> > -	/*
> > -	 * Keep the runqueue membarrier_state in sync with this mm
> > -	 * membarrier_state.
> > -	 */
> > -	this_cpu_write(runqueues.membarrier_state, 0);
> > -}
> > -
> > void membarrier_update_current_mm(struct mm_struct *next_mm)
> > {
> > 	struct rq *rq = this_rq();
> > --
> > 2.33.1
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2022-01-12 17:08 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
2022-01-12 15:30   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
2022-01-12 15:40   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
2022-01-08 16:43 ` [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
2022-01-12 15:52   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 05/23] membarrier, kthread: Use _ONCE accessors for task->mm Andy Lutomirski
2022-01-12 15:55   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch Andy Lutomirski
2022-01-08 16:43   ` Andy Lutomirski
2022-01-10  8:42   ` Christophe Leroy
2022-01-10  8:42     ` Christophe Leroy
2022-01-12 15:57   ` Mathieu Desnoyers
2022-01-12 15:57     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation Andy Lutomirski
2022-01-08 16:43   ` Andy Lutomirski
2022-01-08 16:43   ` Andy Lutomirski
2022-01-12 16:11   ` Mathieu Desnoyers
2022-01-12 16:11     ` Mathieu Desnoyers
2022-01-12 16:11     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap() Andy Lutomirski
2022-01-12 16:13   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() Andy Lutomirski
2022-01-12 16:30   ` Mathieu Desnoyers
2022-01-12 17:08     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references Andy Lutomirski
2022-01-08 16:43 ` [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown Andy Lutomirski
2022-01-10 22:06   ` Sami Tolvanen
2022-01-08 16:43 ` [PATCH 12/23] Rework "sched/core: Fix illegal RCU from offline CPUs" Andy Lutomirski
2022-01-08 16:43 ` [PATCH 13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap() Andy Lutomirski
2022-01-08 16:43 ` [PATCH 14/23] sched, exec: Factor current mm changes out from exec Andy Lutomirski
2022-01-08 16:44 ` [PATCH 15/23] kthread: Switch to __change_current_mm() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Andy Lutomirski
2022-01-08 19:22   ` Linus Torvalds
2022-01-08 22:04     ` Andy Lutomirski
2022-01-09  0:27       ` Linus Torvalds
2022-01-09  0:53       ` Linus Torvalds
2022-01-09  3:58         ` Andy Lutomirski
2022-01-09  4:38           ` Linus Torvalds
2022-01-09 20:19             ` Andy Lutomirski
2022-01-09 20:48               ` Linus Torvalds
2022-01-09 21:51                 ` Linus Torvalds
2022-01-10  0:52                   ` Andy Lutomirski
2022-01-10  2:36                     ` Rik van Riel
2022-01-10  3:51                       ` Linus Torvalds
2022-01-10  4:56                   ` Nicholas Piggin
2022-01-10  5:17                     ` Nicholas Piggin
2022-01-10 17:19                       ` Linus Torvalds
2022-01-11  2:24                         ` Nicholas Piggin
2022-01-10 20:52                     ` Andy Lutomirski
2022-01-11  3:10                       ` Nicholas Piggin
2022-01-11 15:39                         ` Andy Lutomirski
2022-01-11 22:48                           ` Nicholas Piggin
2022-01-12  0:42                             ` Nicholas Piggin
2022-01-11 10:39                 ` Will Deacon
2022-01-11 15:22                   ` Andy Lutomirski
2022-01-09  5:56   ` Nadav Amit
2022-01-09  6:48     ` Linus Torvalds
2022-01-09  8:49       ` Nadav Amit
2022-01-09 19:10         ` Linus Torvalds
2022-01-09 19:52           ` Andy Lutomirski
2022-01-09 20:00             ` Linus Torvalds
2022-01-09 20:34             ` Nadav Amit
2022-01-09 20:48               ` Andy Lutomirski
2022-01-09 19:22         ` Rik van Riel
2022-01-09 19:34           ` Nadav Amit
2022-01-09 19:37             ` Rik van Riel
2022-01-09 19:51               ` Nadav Amit
2022-01-09 19:54                 ` Linus Torvalds
2022-01-08 16:44 ` [PATCH 17/23] x86/mm: Make use/unuse_temporary_mm() non-static Andy Lutomirski
2022-01-08 16:44 ` [PATCH 18/23] x86/mm: Allow temporary mms when IRQs are on Andy Lutomirski
2022-01-08 16:44 ` [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery Andy Lutomirski
2022-01-10 13:13   ` Ard Biesheuvel
2022-01-08 16:44 ` [PATCH 20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs Andy Lutomirski
2022-01-08 16:44 ` [PATCH 22/23] x86/mm: Optimize for_each_possible_lazymm_cpu() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 23/23] x86/mm: Opt in to IRQs-off activate_mm() Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.