[PATCH v7 0/5] shoot lazy tlbs (lazy tlb refcount scalability improvement)

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/5] shoot lazy tlbs (lazy tlb refcount scalability improvement)
@ 2023-02-03  7:18 Nicholas Piggin
  2023-02-03  7:18 ` [PATCH v7 1/5] kthread: simplify kthread_use_mm refcounting Nicholas Piggin
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Nicholas Piggin @ 2023-02-03  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Rik van Riel, Will Deacon, Peter Zijlstra,
	Linus Torvalds, Dave Hansen, linuxppc-dev, Nicholas Piggin,
	linux-mm, Andy Lutomirski, Catalin Marinas, Nadav Amit

(Sorry about the double send)

Hi Andrew,

This series improves scalability of context switching between user and
kernel threads on large systems with a threaded process spread across a
lot of CPUs.

Please consider these patches for mm. Discussion of v6 here:

https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/

No objections so far, Linus think they look okay in principle but has
not reviewed in detail.

With the exception of patch 1, there should be no functional change
on non-powerpc archs with this series.

Changes since v6:
- Dropped the final patch to optimise powerpc more, as mentioned this
  will be taken through the powerpc tree after the base series is
  upstream.
- Split the first patch into patch 1 and 2 in this series so the
  functional change is isolated to minimal patch.
- Removed ifdefs and churn from sched/core.c that were not required
  because ifdefs in .h refcount functions do the same job.
- Split DEBUG_VM option out to its own sub-option because it IPIs all
  CPUs on on every process exit which is pretty heavy.
- Changed comment style as noted by Nadav.
- Added description about how to test it, requested by Linus.
- Added link and credit to Rik's earlier work in the same vein.
- Did a pass over comments and changelogs to improve readability.

Nicholas Piggin (5):
  kthread: simplify kthread_use_mm refcounting
  lazy tlb: introduce lazy tlb mm refcount helper functions
  lazy tlb: allow lazy tlb mm refcounting to be configurable
  lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling
    scheme
  powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

 Documentation/mm/active_mm.rst       |  6 +++
 arch/Kconfig                         | 32 ++++++++++++++
 arch/arm/mach-rpc/ecard.c            |  2 +-
 arch/powerpc/Kconfig                 |  1 +
 arch/powerpc/kernel/smp.c            |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  4 +-
 fs/exec.c                            |  2 +-
 include/linux/sched/mm.h             | 28 ++++++++++++
 kernel/cpu.c                         |  2 +-
 kernel/exit.c                        |  2 +-
 kernel/fork.c                        | 65 ++++++++++++++++++++++++++++
 kernel/kthread.c                     | 22 ++++++----
 kernel/sched/core.c                  | 15 ++++---
 lib/Kconfig.debug                    | 10 +++++
 14 files changed, 170 insertions(+), 23 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v7 1/5] kthread: simplify kthread_use_mm refcounting
  2023-02-03  7:18 [PATCH v7 0/5] shoot lazy tlbs (lazy tlb refcount scalability improvement) Nicholas Piggin
@ 2023-02-03  7:18 ` Nicholas Piggin
  2023-02-03  7:18 ` [PATCH v7 2/5] lazy tlb: introduce lazy tlb mm refcount helper functions Nicholas Piggin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Nicholas Piggin @ 2023-02-03  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Rik van Riel, Will Deacon, Peter Zijlstra,
	Linus Torvalds, Dave Hansen, linuxppc-dev, Nicholas Piggin,
	linux-mm, Andy Lutomirski, Catalin Marinas, Nadav Amit

Remove the special case avoiding refcounting when the mm to be used is
the same as the kernel thread's active (lazy tlb) mm. kthread_use_mm()
should not be such a performance critical path that this matters much.
This simplifies a later change to lazy tlb mm refcounting.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/kthread.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index f97fd01a2932..7424a1839e9a 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1410,14 +1410,13 @@ void kthread_use_mm(struct mm_struct *mm)
 	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
 	WARN_ON_ONCE(tsk->mm);
 
+	mmgrab(mm);
+
 	task_lock(tsk);
 	/* Hold off tlb flush IPIs while switching mm's */
 	local_irq_disable();
 	active_mm = tsk->active_mm;
-	if (active_mm != mm) {
-		mmgrab(mm);
-		tsk->active_mm = mm;
-	}
+	tsk->active_mm = mm;
 	tsk->mm = mm;
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
@@ -1434,12 +1433,9 @@ void kthread_use_mm(struct mm_struct *mm)
 	 * memory barrier after storing to tsk->mm, before accessing
 	 * user-space memory. A full memory barrier for membarrier
 	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
-	 * mmdrop(), or explicitly with smp_mb().
+	 * mmdrop().
 	 */
-	if (active_mm != mm)
-		mmdrop(active_mm);
-	else
-		smp_mb();
+	mmdrop(active_mm);
 }
 EXPORT_SYMBOL_GPL(kthread_use_mm);
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 2/5] lazy tlb: introduce lazy tlb mm refcount helper functions
  2023-02-03  7:18 [PATCH v7 0/5] shoot lazy tlbs (lazy tlb refcount scalability improvement) Nicholas Piggin
  2023-02-03  7:18 ` [PATCH v7 1/5] kthread: simplify kthread_use_mm refcounting Nicholas Piggin
@ 2023-02-03  7:18 ` Nicholas Piggin
  2023-02-03  7:18 ` [PATCH v7 3/5] lazy tlb: allow lazy tlb mm refcounting to be configurable Nicholas Piggin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Nicholas Piggin @ 2023-02-03  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Rik van Riel, Will Deacon, Peter Zijlstra,
	Linus Torvalds, Dave Hansen, linuxppc-dev, Nicholas Piggin,
	linux-mm, Andy Lutomirski, Catalin Marinas, Nadav Amit

Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting.
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes. There is no
functional change with this patch.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/arm/mach-rpc/ecard.c            |  2 +-
 arch/powerpc/kernel/smp.c            |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  4 ++--
 fs/exec.c                            |  2 +-
 include/linux/sched/mm.h             | 16 ++++++++++++++++
 kernel/cpu.c                         |  2 +-
 kernel/exit.c                        |  2 +-
 kernel/kthread.c                     | 12 ++++++++++--
 kernel/sched/core.c                  | 15 ++++++++-------
 9 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index 53813f9464a2..c30df1097c52 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -253,7 +253,7 @@ static int ecard_init_mm(void)
 	current->mm = mm;
 	current->active_mm = mm;
 	activate_mm(active_mm, mm);
-	mmdrop(active_mm);
+	mmdrop_lazy_tlb(active_mm);
 	ecard_init_pgtables(mm);
 	return 0;
 }
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6b90f10a6c81..7db6b3faea65 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1611,7 +1611,7 @@ void start_secondary(void *unused)
 	if (IS_ENABLED(CONFIG_PPC32))
 		setup_kup();
 
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	smp_store_cpu_info(cpu);
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 4e29b619578c..282359ab525b 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -794,10 +794,10 @@ void exit_lazy_flush_tlb(struct mm_struct *mm, bool always_flush)
 	if (current->active_mm == mm) {
 		WARN_ON_ONCE(current->mm != NULL);
 		/* Is a kernel thread and is using mm as the lazy tlb */
-		mmgrab(&init_mm);
+		mmgrab_lazy_tlb(&init_mm);
 		current->active_mm = &init_mm;
 		switch_mm_irqs_off(mm, &init_mm, current);
-		mmdrop(mm);
+		mmdrop_lazy_tlb(mm);
 	}
 
 	/*
diff --git a/fs/exec.c b/fs/exec.c
index ab913243a367..1a32a88db173 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1033,7 +1033,7 @@ static int exec_mmap(struct mm_struct *mm)
 		mmput(old_mm);
 		return 0;
 	}
-	mmdrop(active_mm);
+	mmdrop_lazy_tlb(active_mm);
 	return 0;
 }
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2a243616f222..5376caf6fcf3 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -79,6 +79,22 @@ static inline void mmdrop_sched(struct mm_struct *mm)
 }
 #endif
 
+/* Helpers for lazy TLB mm refcounting */
+static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
+{
+	mmgrab(mm);
+}
+
+static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
+{
+	mmdrop(mm);
+}
+
+static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
+{
+	mmdrop_sched(mm);
+}
+
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6c0a92ca6bb5..189895288d9d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -623,7 +623,7 @@ static int finish_cpu(unsigned int cpu)
 	 */
 	if (mm != &init_mm)
 		idle->active_mm = &init_mm;
-	mmdrop(mm);
+	mmdrop_lazy_tlb(mm);
 	return 0;
 }
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 15dc2ec80c46..1a4608d765e4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -537,7 +537,7 @@ static void exit_mm(void)
 		return;
 	sync_mm_rss(mm);
 	mmap_read_lock(mm);
-	mmgrab(mm);
+	mmgrab_lazy_tlb(mm);
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 7424a1839e9a..e4bc32a88866 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1410,6 +1410,11 @@ void kthread_use_mm(struct mm_struct *mm)
 	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
 	WARN_ON_ONCE(tsk->mm);
 
+	/*
+	 * It is possible for mm to be the same as tsk->active_mm, but
+	 * we must still mmgrab(mm) and mmdrop_lazy_tlb(active_mm),
+	 * because these references are not equivalent.
+	 */
 	mmgrab(mm);
 
 	task_lock(tsk);
@@ -1433,9 +1438,9 @@ void kthread_use_mm(struct mm_struct *mm)
 	 * memory barrier after storing to tsk->mm, before accessing
 	 * user-space memory. A full memory barrier for membarrier
 	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
-	 * mmdrop().
+	 * mmdrop_lazy_tlb().
 	 */
-	mmdrop(active_mm);
+	mmdrop_lazy_tlb(active_mm);
 }
 EXPORT_SYMBOL_GPL(kthread_use_mm);
 
@@ -1463,10 +1468,13 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	local_irq_disable();
 	tsk->mm = NULL;
 	membarrier_update_current_mm(NULL);
+	mmgrab_lazy_tlb(mm);
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	local_irq_enable();
 	task_unlock(tsk);
+
+	mmdrop(mm);
 }
 EXPORT_SYMBOL_GPL(kthread_unuse_mm);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e838feb6adc5..495f9a021de9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5189,13 +5189,14 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 * rq->curr, before returning to userspace, so provide them here:
 	 *
 	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
-	 *   provided by mmdrop(),
+	 *   provided by mmdrop_lazy_tlb(),
 	 * - a sync_core for SYNC_CORE.
 	 */
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
-		mmdrop_sched(mm);
+		mmdrop_lazy_tlb_sched(mm);
 	}
+
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
@@ -5252,9 +5253,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
 
 	/*
 	 * kernel -> kernel   lazy + transfer active
-	 *   user -> kernel   lazy + mmgrab() active
+	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
 	 *
-	 * kernel ->   user   switch + mmdrop() active
+	 * kernel ->   user   switch + mmdrop_lazy_tlb() active
 	 *   user ->   user   switch
 	 */
 	if (!next->mm) {                                // to kernel
@@ -5262,7 +5263,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 
 		next->active_mm = prev->active_mm;
 		if (prev->mm)                           // from user
-			mmgrab(prev->active_mm);
+			mmgrab_lazy_tlb(prev->active_mm);
 		else
 			prev->active_mm = NULL;
 	} else {                                        // to user
@@ -5279,7 +5280,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		lru_gen_use_mm(next->mm);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop() in finish_task_switch(). */
+			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
 			rq->prev_mm = prev->active_mm;
 			prev->active_mm = NULL;
 		}
@@ -9916,7 +9917,7 @@ void __init sched_init(void)
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	enter_lazy_tlb(&init_mm, current);
 
 	/*
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 3/5] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2023-02-03  7:18 [PATCH v7 0/5] shoot lazy tlbs (lazy tlb refcount scalability improvement) Nicholas Piggin
  2023-02-03  7:18 ` [PATCH v7 1/5] kthread: simplify kthread_use_mm refcounting Nicholas Piggin
  2023-02-03  7:18 ` [PATCH v7 2/5] lazy tlb: introduce lazy tlb mm refcount helper functions Nicholas Piggin
@ 2023-02-03  7:18 ` Nicholas Piggin
  2023-02-03  7:18 ` [PATCH v7 4/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme Nicholas Piggin
  2023-02-03  7:18 ` [PATCH v7 5/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Nicholas Piggin
  4 siblings, 0 replies; 9+ messages in thread
From: Nicholas Piggin @ 2023-02-03  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Rik van Riel, Will Deacon, Peter Zijlstra,
	Linus Torvalds, Dave Hansen, linuxppc-dev, Nicholas Piggin,
	linux-mm, Andy Lutomirski, Catalin Marinas, Nadav Amit

Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched. This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the
last refcount is dropped. Currently this is always enabled, so the patch
introduces no functional change.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 Documentation/mm/active_mm.rst |  6 ++++++
 arch/Kconfig                   | 17 +++++++++++++++++
 include/linux/sched/mm.h       | 18 +++++++++++++++---
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst
index 6f8269c284ed..0114d80d406a 100644
--- a/Documentation/mm/active_mm.rst
+++ b/Documentation/mm/active_mm.rst
@@ -4,6 +4,12 @@
 Active MM
 =========
 
+Note, the mm_count refcount may no longer include the "lazy" users
+(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
+with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
+references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
+helpers, which abstract this config option.
+
 ::
 
  List:       linux-kernel
diff --git a/arch/Kconfig b/arch/Kconfig
index 12e3ddabac9d..11e8915c0652 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -465,6 +465,23 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  irqs disabled over activate_mm. Architectures that do IPI based TLB
 	  shootdowns should enable this.
 
+# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
+# to/from kernel threads when the same mm is running on a lot of CPUs (a large
+# multi-threaded application), by reducing contention on the mm refcount.
+#
+# This can be disabled if the architecture ensures no CPUs are using an mm as a
+# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
+# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
+# final exit(2) TLB flush, for example.
+#
+# To implement this, an arch *must*:
+# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when manipulating
+# the lazy tlb reference of a kthread's ->active_mm (non-arch code has been
+# converted already).
+config MMU_LAZY_TLB_REFCOUNT
+	def_bool y
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5376caf6fcf3..689dbe812563 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -82,17 +82,29 @@ static inline void mmdrop_sched(struct mm_struct *mm)
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-	mmgrab(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+		mmgrab(mm);
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-	mmdrop(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+		mmdrop(mm);
+	} else {
+		/*
+		 * mmdrop_lazy_tlb must provide a full memory barrier, see the
+		 * membarrier comment finish_task_switch which relies on this.
+		 */
+		smp_mb();
+	}
 }
 
 static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
 {
-	mmdrop_sched(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+		mmdrop_sched(mm);
+	else
+		smp_mb(); /* see mmdrop_lazy_tlb() above */
 }
 
 /**
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 4/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
  2023-02-03  7:18 [PATCH v7 0/5] shoot lazy tlbs (lazy tlb refcount scalability improvement) Nicholas Piggin
                   ` (2 preceding siblings ...)
  2023-02-03  7:18 ` [PATCH v7 3/5] lazy tlb: allow lazy tlb mm refcounting to be configurable Nicholas Piggin
@ 2023-02-03  7:18 ` Nicholas Piggin
  2023-02-03  7:18 ` [PATCH v7 5/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Nicholas Piggin
  4 siblings, 0 replies; 9+ messages in thread
From: Nicholas Piggin @ 2023-02-03  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Rik van Riel, Will Deacon, Peter Zijlstra,
	Linus Torvalds, Dave Hansen, linuxppc-dev, Nicholas Piggin,
	linux-mm, Andy Lutomirski, Catalin Marinas, Nadav Amit

On big systems, the mm refcount can become highly contented when doing a
lot of context switching with threaded applications. user<->idle switch
is one of the important cases. Abandoning lazy tlb entirely slows this
switching down quite a bit in the common uncontended case, so that is
not viable.

Implement a scheme where lazy tlb mm references do not contribute to the
refcount, instead they get explicitly removed when the refcount reaches
zero.

The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
switch away from this mm to init_mm if it was being used as the lazy tlb
mm. Enabling the shoot lazies option therefore requires that the arch
ensures that mm_cpumask contains all CPUs that could possibly be using
mm. A DEBUG_VM option IPIs every CPU in the system after this to ensure
there are no references remaining before the mm is freed.

Shootdown IPIs cost could be an issue, but they have not been observed
to be a serious problem with this scheme, because short-lived processes
tend not to migrate CPUs much, therefore they don't get much chance to
leave lazy tlb mm references on remote CPUs. There are a lot of options
to reduce them if necessary, described in comments.

The near-worst-case can be benchmarked with will-it-scale:

  context_switch1_threads -t $(($(nproc) / 2))

This will create nproc threads (nproc / 2 switching pairs) all sharing
the same mm that spread over all CPUs so each CPU does
thread->idle->thread switching.

[ Rik came up with basically the same idea a few years ago, so credit
  to him for that. ]

Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig      | 15 +++++++++++
 kernel/fork.c     | 65 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig.debug | 10 ++++++++
 3 files changed, 90 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 11e8915c0652..0d2021aed57e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -481,6 +481,21 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # converted already).
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
+	depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
+# mm as a lazy tlb beyond its last reference count, by shooting down these
+# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
+# be using the mm as a lazy tlb, so that they may switch themselves to using
+# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
+# may be using mm as a lazy tlb mm.
+#
+# To implement this, an arch *must*:
+# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains
+#   at least all possible CPUs in which the mm is lazy.
+# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above).
+config MMU_LAZY_TLB_SHOOTDOWN
+	bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..e7d81db7e885 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -780,6 +780,67 @@ static void check_mm(struct mm_struct *mm)
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_check_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void do_shoot_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	if (current->active_mm == mm) {
+		WARN_ON_ONCE(current->mm);
+		current->active_mm = &init_mm;
+		switch_mm(mm, &init_mm, current);
+	}
+}
+
+static void cleanup_lazy_tlbs(struct mm_struct *mm)
+{
+	if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+		/*
+		 * In this case, lazy tlb mms are refounted and would not reach
+		 * __mmdrop until all CPUs have switched away and mmdrop()ed.
+		 */
+		return;
+	}
+
+	/*
+	 * Lazy mm shootdown does not refcount "lazy tlb mm" usage, rather it
+	 * requires lazy mm users to switch to another mm when the refcount
+	 * drops to zero, before the mm is freed. This requires IPIs here to
+	 * switch kernel threads to init_mm.
+	 *
+	 * archs that use IPIs to flush TLBs can piggy-back that lazy tlb mm
+	 * switch with the final userspace teardown TLB flush which leaves the
+	 * mm lazy on this CPU but no others, reducing the need for additional
+	 * IPIs here. There are cases where a final IPI is still required here,
+	 * such as the final mmdrop being performed on a different CPU than the
+	 * one exiting, or kernel threads using the mm when userspace exits.
+	 *
+	 * IPI overheads have not found to be expensive, but they could be
+	 * reduced in a number of possible ways, for example (roughly
+	 * increasing order of complexity):
+	 * - The last lazy reference created by exit_mm() could instead switch
+	 *   to init_mm, however it's probable this will run on the same CPU
+	 *   immediately afterwards, so this may not reduce IPIs much.
+	 * - A batch of mms requiring IPIs could be gathered and freed at once.
+	 * - CPUs store active_mm where it can be remotely checked without a
+	 *   lock, to filter out false-positives in the cpumask.
+	 * - After mm_users or mm_count reaches zero, switching away from the
+	 *   mm could clear mm_cpumask to reduce some IPIs, perhaps together
+	 *   with some batching or delaying of the final IPIs.
+	 * - A delayed freeing and RCU-like quiescing sequence based on mm
+	 *   switching to avoid IPIs completely.
+	 */
+	on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
+	if (IS_ENABLED(CONFIG_DEBUG_VM_SHOOT_LAZIES))
+		on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+}
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -791,6 +852,10 @@ void __mmdrop(struct mm_struct *mm)
 
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
+
+	/* Ensure no CPUs are using this as their lazy tlb mm */
+	cleanup_lazy_tlbs(mm);
+
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 61a9425a311f..1a5849f9f414 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -852,6 +852,16 @@ config DEBUG_VM
 
 	  If unsure, say N.
 
+config DEBUG_VM_SHOOT_LAZIES
+	bool "Debug MMU_LAZY_TLB_SHOOTDOWN implementation"
+	depends on DEBUG_VM
+	depends on MMU_LAZY_TLB_SHOOTDOWN
+	help
+	  Enable additional IPIs that ensure lazy tlb mm references are removed
+	  before the mm is freed.
+
+	  If unsure, say N.
+
 config DEBUG_VM_MAPLE_TREE
 	bool "Debug VM maple trees"
 	depends on DEBUG_VM
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 5/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  2023-02-03  7:18 [PATCH v7 0/5] shoot lazy tlbs (lazy tlb refcount scalability improvement) Nicholas Piggin
                   ` (3 preceding siblings ...)
  2023-02-03  7:18 ` [PATCH v7 4/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme Nicholas Piggin
@ 2023-02-03  7:18 ` Nicholas Piggin
  2023-02-26 22:12   ` Andrew Morton
  4 siblings, 1 reply; 9+ messages in thread
From: Nicholas Piggin @ 2023-02-03  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Rik van Riel, Will Deacon, Peter Zijlstra,
	Linus Torvalds, Dave Hansen, linuxppc-dev, Nicholas Piggin,
	linux-mm, Andy Lutomirski, Catalin Marinas, Nadav Amit

On a 16-socket 192-core POWER8 system, the context_switch1_threads
benchmark from will-it-scale (see earlier changelog), upstream can
achieve a rate of about 1 million context switches per second, due to
contention on the mm refcount.

64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
the option. This increases the above benchmark to 118 million context
switches per second.

This generates 314 additional IPI interrupts on a 144 CPU system doing
a kernel compile, which is in the noise in terms of kernel cycles.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b8c4ac56bddc..600ace5a7f1a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -265,6 +265,7 @@ config PPC
 	select MMU_GATHER_PAGE_SIZE
 	select MMU_GATHER_RCU_TABLE_FREE
 	select MMU_GATHER_MERGE_VMAS
+	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
 	select NEED_PER_CPU_EMBED_FIRST_CHUNK	if PPC64
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v7 5/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  2023-02-03  7:18 ` [PATCH v7 5/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Nicholas Piggin
@ 2023-02-26 22:12   ` Andrew Morton
  2023-02-27 13:33     ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2023-02-26 22:12 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Rik van Riel, Will Deacon, Peter Zijlstra,
	Linus Torvalds, Dave Hansen, linuxppc-dev, linux-mm,
	Andy Lutomirski, Catalin Marinas, Nadav Amit

On Fri,  3 Feb 2023 17:18:37 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:

> On a 16-socket 192-core POWER8 system, the context_switch1_threads
> benchmark from will-it-scale (see earlier changelog), upstream can
> achieve a rate of about 1 million context switches per second, due to
> contention on the mm refcount.
> 
> 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> the option. This increases the above benchmark to 118 million context
> switches per second.

Is that the best you can do ;)

> This generates 314 additional IPI interrupts on a 144 CPU system doing
> a kernel compile, which is in the noise in terms of kernel cycles.
> 
> ...
>
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -265,6 +265,7 @@ config PPC
>  	select MMU_GATHER_PAGE_SIZE
>  	select MMU_GATHER_RCU_TABLE_FREE
>  	select MMU_GATHER_MERGE_VMAS
> +	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
>  	select MODULES_USE_ELF_RELA
>  	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
>  	select NEED_PER_CPU_EMBED_FIRST_CHUNK	if PPC64

Can we please have a summary of which other architectures might benefit
from this, and what must they do?

As this is powerpc-only, I expect it won't get a lot of testing in
mm.git or in linux-next.  The powerpc maintainers might choose to merge
in the mm-stable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm if this is a
concern.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v7 5/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  2023-02-26 22:12   ` Andrew Morton
@ 2023-02-27 13:33     ` Peter Zijlstra
  2023-03-21  3:54       ` Nicholas Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2023-02-27 13:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Rik van Riel, Will Deacon, Linus Torvalds,
	Dave Hansen, linuxppc-dev, Nicholas Piggin, Andy Lutomirski,
	linux-mm, Andy Lutomirski, Catalin Marinas, Nadav Amit

On Sun, Feb 26, 2023 at 02:12:38PM -0800, Andrew Morton wrote:
> On Fri,  3 Feb 2023 17:18:37 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> > On a 16-socket 192-core POWER8 system, the context_switch1_threads
> > benchmark from will-it-scale (see earlier changelog), upstream can
> > achieve a rate of about 1 million context switches per second, due to
> > contention on the mm refcount.
> > 
> > 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> > the option. This increases the above benchmark to 118 million context
> > switches per second.
> 
> Is that the best you can do ;)
> 
> > This generates 314 additional IPI interrupts on a 144 CPU system doing
> > a kernel compile, which is in the noise in terms of kernel cycles.
> > 
> > ...
> >
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -265,6 +265,7 @@ config PPC
> >  	select MMU_GATHER_PAGE_SIZE
> >  	select MMU_GATHER_RCU_TABLE_FREE
> >  	select MMU_GATHER_MERGE_VMAS
> > +	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
> >  	select MODULES_USE_ELF_RELA
> >  	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
> >  	select NEED_PER_CPU_EMBED_FIRST_CHUNK	if PPC64
> 
> Can we please have a summary of which other architectures might benefit
> from this, and what must they do?
> 
> As this is powerpc-only, I expect it won't get a lot of testing in
> mm.git or in linux-next.  The powerpc maintainers might choose to merge
> in the mm-stable branch at
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm if this is a
> concern.

I haven't really had time to page all of this back in, but x86 is very
close to be able to use this, it mostly just needs cleaning up some
accidental active_mm usage.

I've got a branch here:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/lazy

That's mostly Nick's patches with a bunch of Andy's old patches stuck on
top. I also have a pile of notes, but alas, not finished in any way.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v7 5/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  2023-02-27 13:33     ` Peter Zijlstra
@ 2023-03-21  3:54       ` Nicholas Piggin
  0 siblings, 0 replies; 9+ messages in thread
From: Nicholas Piggin @ 2023-03-21  3:54 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton
  Cc: linux-arch, Rik van Riel, linux-mm, Will Deacon, Dave Hansen,
	linuxppc-dev, Andy Lutomirski, Linus Torvalds, Andy Lutomirski,
	Catalin Marinas, Nadav Amit

On Mon Feb 27, 2023 at 11:33 PM AEST, Peter Zijlstra wrote:
> On Sun, Feb 26, 2023 at 02:12:38PM -0800, Andrew Morton wrote:
> > On Fri,  3 Feb 2023 17:18:37 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> > 
> > > On a 16-socket 192-core POWER8 system, the context_switch1_threads
> > > benchmark from will-it-scale (see earlier changelog), upstream can
> > > achieve a rate of about 1 million context switches per second, due to
> > > contention on the mm refcount.
> > > 
> > > 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> > > the option. This increases the above benchmark to 118 million context
> > > switches per second.
> > 
> > Is that the best you can do ;)
> > 
> > > This generates 314 additional IPI interrupts on a 144 CPU system doing
> > > a kernel compile, which is in the noise in terms of kernel cycles.
> > > 
> > > ...
> > >
> > > --- a/arch/powerpc/Kconfig
> > > +++ b/arch/powerpc/Kconfig
> > > @@ -265,6 +265,7 @@ config PPC
> > >  	select MMU_GATHER_PAGE_SIZE
> > >  	select MMU_GATHER_RCU_TABLE_FREE
> > >  	select MMU_GATHER_MERGE_VMAS
> > > +	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
> > >  	select MODULES_USE_ELF_RELA
> > >  	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
> > >  	select NEED_PER_CPU_EMBED_FIRST_CHUNK	if PPC64
> > 
> > Can we please have a summary of which other architectures might benefit
> > from this, and what must they do?

Coming back to this... The recipes to enable are somewhat documented I
Kconfig. If those weren't clear I can improve or.. not sure where else
to add this stuff. It would be nice if all these options had more
explanation and requirements, I'm just not sure what's going to work
best (beyond what I did in Kconfig).

Not much noise from other archs so far, so I'll take a guess and say
archs that have large SMP systems might. x86 and s390 perhaps. Seems
to be some work still ongoing in the x86 branch, I didn't hear if you
found the docs inadequate or any suggestions to improve understanding?
Some were very confused by it, but I was never able to help them grasp
the concepts or get to the bottom of what the problem was, so that
was a dead end unfortunately.

> > 
> > As this is powerpc-only, I expect it won't get a lot of testing in
> > mm.git or in linux-next.  The powerpc maintainers might choose to merge
> > in the mm-stable branch at
> > git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm if this is a
> > concern.
>
> I haven't really had time to page all of this back in, but x86 is very
> close to be able to use this, it mostly just needs cleaning up some
> accidental active_mm usage.
>
> I've got a branch here:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/lazy
>
> That's mostly Nick's patches with a bunch of Andy's old patches stuck on
> top. I also have a pile of notes, but alas, not finished in any way.

Great that a proof of concept shows it can work for x86, I guess
that's an ack for this series from x86? :)

x86 implementation presumably won't be merged until objectionable
active_mm and other code in core code that makes things difficult for
the arch is cleaned up so we don't get into the situation again where
crap keeps getting built on crap and everybody else's nice clean patches
gets nacked for years because one arch is festering. Will be great to
see those cleanups.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-03-21  3:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-03  7:18 [PATCH v7 0/5] shoot lazy tlbs (lazy tlb refcount scalability improvement) Nicholas Piggin
2023-02-03  7:18 ` [PATCH v7 1/5] kthread: simplify kthread_use_mm refcounting Nicholas Piggin
2023-02-03  7:18 ` [PATCH v7 2/5] lazy tlb: introduce lazy tlb mm refcount helper functions Nicholas Piggin
2023-02-03  7:18 ` [PATCH v7 3/5] lazy tlb: allow lazy tlb mm refcounting to be configurable Nicholas Piggin
2023-02-03  7:18 ` [PATCH v7 4/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme Nicholas Piggin
2023-02-03  7:18 ` [PATCH v7 5/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Nicholas Piggin
2023-02-26 22:12   ` Andrew Morton
2023-02-27 13:33     ` Peter Zijlstra
2023-03-21  3:54       ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).