linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/5] shoot lazy tlbs
@ 2023-01-18  8:00 Nicholas Piggin
  2023-01-18  8:00 ` [PATCH v6 1/5] lazy tlb: introduce lazy tlb mm refcount helper functions Nicholas Piggin
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Nicholas Piggin @ 2023-01-18  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Andy Lutomirski, Linus Torvalds, linux-arch,
	linux-mm, linuxppc-dev

It's time for that annual flamewar. Nothing really changed in core code
to clean things up or make it better for x86 last year, so I think we
can table that objection.

IIRC the situation left off with Andy proposing a different approach,
and Linus preferring to shoot the lazies at exit time (piggybacking on
the TLB flush IPI), which is what this series allows an arch to do.
Discussion thread here:

https://lore.kernel.org/linux-arch/7c9c388c388df8e88bb5d14828053ac0cb11cf69.1641659630.git.luto@kernel.org/

I don't think there was any movement on this or other alternatives, or
code cleanups since then, but correct me if I'm wrong.

Since v5 of this series, there has just been a minor rebase to upstream,
and some tweaking of comments and code style. No functional changes.

Also included patch 5 which is the optimisation that combines final TLB
shootdown with the lazy tlb mm shootdown IPIs. Included because Linus
expected to see it. It works fine, but I have some other powerpc changes
I would like to go ahead of it so I would like to take those through the
powerpc tree. And actually giving it a release cycle without that
optimization will help stress test the final IPI cleanup path too, which
I would like.

Even without the last patch, the additional IPIs caused by shoot lazy
is down in the noise so I'm not too concerned about it.

Thanks,
Nick

Nicholas Piggin (5):
  lazy tlb: introduce lazy tlb mm refcount helper functions
  lazy tlb: allow lazy tlb mm refcounting to be configurable
  lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling
    scheme
  powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown
    IPIs

 Documentation/mm/active_mm.rst       |  6 +++
 arch/Kconfig                         | 32 ++++++++++++++
 arch/arm/mach-rpc/ecard.c            |  2 +-
 arch/powerpc/Kconfig                 |  1 +
 arch/powerpc/kernel/smp.c            |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c | 30 +++++++++++--
 fs/exec.c                            |  2 +-
 include/linux/sched/mm.h             | 28 ++++++++++++
 kernel/cpu.c                         |  2 +-
 kernel/exit.c                        |  2 +-
 kernel/fork.c                        | 65 ++++++++++++++++++++++++++++
 kernel/kthread.c                     | 21 +++++----
 kernel/sched/core.c                  | 35 ++++++++++-----
 kernel/sched/sched.h                 |  4 +-
 14 files changed, 205 insertions(+), 27 deletions(-)

-- 
2.37.2



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v6 1/5] lazy tlb: introduce lazy tlb mm refcount helper functions
  2023-01-18  8:00 [PATCH v6 0/5] shoot lazy tlbs Nicholas Piggin
@ 2023-01-18  8:00 ` Nicholas Piggin
  2023-01-18  8:00 ` [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable Nicholas Piggin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Nicholas Piggin @ 2023-01-18  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Andy Lutomirski, Linus Torvalds, linux-arch,
	linux-mm, linuxppc-dev

Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting.
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes.

The only functional change is in kthread_use_mm/kthread_unuse_mm is
because it is clever with refcounting: If it happens that the kthread's
lazy tlb mm (active_mm) is the same as the mm to be used, the code
doesn't touch the refcount but rather transfers the lazy refcount to
used-mm refcount. If the lazy tlb mm refcount is no longer equivalent to
the regular refcount, this trick can not be used. mmgrab a regular
reference on mm to use, and mmdrop_lazy_tlb the previous active_mm.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/arm/mach-rpc/ecard.c            |  2 +-
 arch/powerpc/kernel/smp.c            |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  4 ++--
 fs/exec.c                            |  2 +-
 include/linux/sched/mm.h             | 16 ++++++++++++++++
 kernel/cpu.c                         |  2 +-
 kernel/exit.c                        |  2 +-
 kernel/kthread.c                     | 21 +++++++++++++--------
 kernel/sched/core.c                  | 15 ++++++++-------
 9 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index 53813f9464a2..c30df1097c52 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -253,7 +253,7 @@ static int ecard_init_mm(void)
 	current->mm = mm;
 	current->active_mm = mm;
 	activate_mm(active_mm, mm);
-	mmdrop(active_mm);
+	mmdrop_lazy_tlb(active_mm);
 	ecard_init_pgtables(mm);
 	return 0;
 }
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6b90f10a6c81..7db6b3faea65 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1611,7 +1611,7 @@ void start_secondary(void *unused)
 	if (IS_ENABLED(CONFIG_PPC32))
 		setup_kup();
 
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	smp_store_cpu_info(cpu);
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 4e29b619578c..282359ab525b 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -794,10 +794,10 @@ void exit_lazy_flush_tlb(struct mm_struct *mm, bool always_flush)
 	if (current->active_mm == mm) {
 		WARN_ON_ONCE(current->mm != NULL);
 		/* Is a kernel thread and is using mm as the lazy tlb */
-		mmgrab(&init_mm);
+		mmgrab_lazy_tlb(&init_mm);
 		current->active_mm = &init_mm;
 		switch_mm_irqs_off(mm, &init_mm, current);
-		mmdrop(mm);
+		mmdrop_lazy_tlb(mm);
 	}
 
 	/*
diff --git a/fs/exec.c b/fs/exec.c
index ab913243a367..1a32a88db173 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1033,7 +1033,7 @@ static int exec_mmap(struct mm_struct *mm)
 		mmput(old_mm);
 		return 0;
 	}
-	mmdrop(active_mm);
+	mmdrop_lazy_tlb(active_mm);
 	return 0;
 }
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2a243616f222..5376caf6fcf3 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -79,6 +79,22 @@ static inline void mmdrop_sched(struct mm_struct *mm)
 }
 #endif
 
+/* Helpers for lazy TLB mm refcounting */
+static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
+{
+	mmgrab(mm);
+}
+
+static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
+{
+	mmdrop(mm);
+}
+
+static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
+{
+	mmdrop_sched(mm);
+}
+
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6c0a92ca6bb5..189895288d9d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -623,7 +623,7 @@ static int finish_cpu(unsigned int cpu)
 	 */
 	if (mm != &init_mm)
 		idle->active_mm = &init_mm;
-	mmdrop(mm);
+	mmdrop_lazy_tlb(mm);
 	return 0;
 }
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 15dc2ec80c46..1a4608d765e4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -537,7 +537,7 @@ static void exit_mm(void)
 		return;
 	sync_mm_rss(mm);
 	mmap_read_lock(mm);
-	mmgrab(mm);
+	mmgrab_lazy_tlb(mm);
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index f97fd01a2932..691b213e578f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1410,14 +1410,19 @@ void kthread_use_mm(struct mm_struct *mm)
 	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
 	WARN_ON_ONCE(tsk->mm);
 
+	/*
+	 * It's possible that tsk->active_mm == mm here, but we must
+	 * still mmgrab(mm) and mmdrop_lazy_tlb(active_mm), because lazy
+	 * mm may not have its own refcount (see mmgrab/drop_lazy_tlb()).
+	 */
+	mmgrab(mm);
+
 	task_lock(tsk);
 	/* Hold off tlb flush IPIs while switching mm's */
 	local_irq_disable();
 	active_mm = tsk->active_mm;
-	if (active_mm != mm) {
-		mmgrab(mm);
+	if (active_mm != mm)
 		tsk->active_mm = mm;
-	}
 	tsk->mm = mm;
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
@@ -1434,12 +1439,9 @@ void kthread_use_mm(struct mm_struct *mm)
 	 * memory barrier after storing to tsk->mm, before accessing
 	 * user-space memory. A full memory barrier for membarrier
 	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
-	 * mmdrop(), or explicitly with smp_mb().
+	 * mmdrop_lazy_tlb().
 	 */
-	if (active_mm != mm)
-		mmdrop(active_mm);
-	else
-		smp_mb();
+	mmdrop_lazy_tlb(active_mm);
 }
 EXPORT_SYMBOL_GPL(kthread_use_mm);
 
@@ -1467,10 +1469,13 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	local_irq_disable();
 	tsk->mm = NULL;
 	membarrier_update_current_mm(NULL);
+	mmgrab_lazy_tlb(mm);
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	local_irq_enable();
 	task_unlock(tsk);
+
+	mmdrop(mm);
 }
 EXPORT_SYMBOL_GPL(kthread_unuse_mm);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 25b582b6ee5f..26aaa974ee6d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5140,13 +5140,14 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 * rq->curr, before returning to userspace, so provide them here:
 	 *
 	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
-	 *   provided by mmdrop(),
+	 *   provided by mmdrop_lazy_tlb(),
 	 * - a sync_core for SYNC_CORE.
 	 */
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
-		mmdrop_sched(mm);
+		mmdrop_lazy_tlb_sched(mm);
 	}
+
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
@@ -5203,9 +5204,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
 
 	/*
 	 * kernel -> kernel   lazy + transfer active
-	 *   user -> kernel   lazy + mmgrab() active
+	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
 	 *
-	 * kernel ->   user   switch + mmdrop() active
+	 * kernel ->   user   switch + mmdrop_lazy_tlb() active
 	 *   user ->   user   switch
 	 */
 	if (!next->mm) {                                // to kernel
@@ -5213,7 +5214,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 
 		next->active_mm = prev->active_mm;
 		if (prev->mm)                           // from user
-			mmgrab(prev->active_mm);
+			mmgrab_lazy_tlb(prev->active_mm);
 		else
 			prev->active_mm = NULL;
 	} else {                                        // to user
@@ -5230,7 +5231,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		lru_gen_use_mm(next->mm);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop() in finish_task_switch(). */
+			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
 			rq->prev_mm = prev->active_mm;
 			prev->active_mm = NULL;
 		}
@@ -9859,7 +9860,7 @@ void __init sched_init(void)
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	enter_lazy_tlb(&init_mm, current);
 
 	/*
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2023-01-18  8:00 [PATCH v6 0/5] shoot lazy tlbs Nicholas Piggin
  2023-01-18  8:00 ` [PATCH v6 1/5] lazy tlb: introduce lazy tlb mm refcount helper functions Nicholas Piggin
@ 2023-01-18  8:00 ` Nicholas Piggin
  2023-01-23  7:35   ` Nadav Amit
  2023-01-18  8:00 ` [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme Nicholas Piggin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Nicholas Piggin @ 2023-01-18  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Andy Lutomirski, Linus Torvalds, linux-arch,
	linux-mm, linuxppc-dev

Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched. This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the
last refcount is dropped. Currently this is always enabled, which is
what existing code does, so the patch is effectively a no-op.

Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 Documentation/mm/active_mm.rst |  6 ++++++
 arch/Kconfig                   | 17 +++++++++++++++++
 include/linux/sched/mm.h       | 18 +++++++++++++++---
 kernel/sched/core.c            | 22 ++++++++++++++++++----
 kernel/sched/sched.h           |  4 +++-
 5 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst
index 6f8269c284ed..2b0d08332400 100644
--- a/Documentation/mm/active_mm.rst
+++ b/Documentation/mm/active_mm.rst
@@ -4,6 +4,12 @@
 Active MM
 =========
 
+Note, the mm_count refcount may no longer include the "lazy" users
+(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
+with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
+references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
+helpers which abstracts this config option.
+
 ::
 
  List:       linux-kernel
diff --git a/arch/Kconfig b/arch/Kconfig
index 12e3ddabac9d..b07d36f08fea 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -465,6 +465,23 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  irqs disabled over activate_mm. Architectures that do IPI based TLB
 	  shootdowns should enable this.
 
+# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
+# to/from kernel threads when the same mm is running on a lot of CPUs (a large
+# multi-threaded application), by reducing contention on the mm refcount.
+#
+# This can be disabled if the architecture ensures no CPUs are using an mm as a
+# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
+# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
+# final exit(2) TLB flush, for example.
+#
+# To implement this, an arch *must*:
+# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when dropping the
+# lazy reference of a kthread's ->active_mm (non-arch code has been converted
+# already).
+config MMU_LAZY_TLB_REFCOUNT
+	def_bool y
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5376caf6fcf3..68bbe8d90c2e 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -82,17 +82,29 @@ static inline void mmdrop_sched(struct mm_struct *mm)
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-	mmgrab(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+		mmgrab(mm);
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-	mmdrop(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+		mmdrop(mm);
+	} else {
+		/*
+		 * mmdrop_lazy_tlb must provide a full memory barrier, see the
+		 * membarrier comment finish_task_switch which relies on this.
+		 */
+		smp_mb();
+	}
 }
 
 static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
 {
-	mmdrop_sched(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+		mmdrop_sched(mm);
+	else
+		smp_mb(); // see above
 }
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26aaa974ee6d..1ea14d849a0d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5081,7 +5081,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	__releases(rq->lock)
 {
 	struct rq *rq = this_rq();
-	struct mm_struct *mm = rq->prev_mm;
+	struct mm_struct *mm = NULL;
 	unsigned int prev_state;
 
 	/*
@@ -5100,7 +5100,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		      current->comm, current->pid, preempt_count()))
 		preempt_count_set(FORK_PREEMPT_COUNT);
 
-	rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+	mm = rq->prev_lazy_mm;
+	rq->prev_lazy_mm = NULL;
+#endif
 
 	/*
 	 * A task struct has one reference for the use as "current".
@@ -5231,9 +5234,20 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		lru_gen_use_mm(next->mm);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
-			rq->prev_mm = prev->active_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+			/* Will mmdrop_lazy_tlb() in finish_task_switch(). */
+			rq->prev_lazy_mm = prev->active_mm;
 			prev->active_mm = NULL;
+#else
+			/*
+			 * Without MMU_LAZY_TLB_REFCOUNT there is no lazy
+			 * tracking (because no rq->prev_lazy_mm) in
+			 * finish_task_switch, so no mmdrop_lazy_tlb(), so no
+			 * memory barrier for membarrier (see the membarrier
+			 * comment in finish_task_switch()).  Do it here.
+			 */
+			smp_mb();
+#endif
 		}
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 771f8ddb7053..33da8fa8b5a5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1009,7 +1009,9 @@ struct rq {
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
-	struct mm_struct	*prev_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+	struct mm_struct	*prev_lazy_mm;
+#endif
 
 	unsigned int		clock_update_flags;
 	u64			clock;
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
  2023-01-18  8:00 [PATCH v6 0/5] shoot lazy tlbs Nicholas Piggin
  2023-01-18  8:00 ` [PATCH v6 1/5] lazy tlb: introduce lazy tlb mm refcount helper functions Nicholas Piggin
  2023-01-18  8:00 ` [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable Nicholas Piggin
@ 2023-01-18  8:00 ` Nicholas Piggin
  2023-01-18 22:22   ` Nadav Amit
  2023-01-18  8:00 ` [PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Nicholas Piggin
  2023-01-18  8:00 ` [PATCH v6 5/5] powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs Nicholas Piggin
  4 siblings, 1 reply; 16+ messages in thread
From: Nicholas Piggin @ 2023-01-18  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Andy Lutomirski, Linus Torvalds, linux-arch,
	linux-mm, linuxppc-dev

On big systems, the mm refcount can become highly contented when doing
a lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so instead implement a non-refcounted scheme
that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
any remaining lazy ones.

Shootdown IPIs cost could be an issue, but they have not been observed
to be a serious problem with this scheme, because short-lived processes
tend not to migrate CPUs much, therefore they don't get much chance to
leave lazy tlb mm references on remote CPUs. There are a lot of options
to reduce them if necessary.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig  | 15 ++++++++++++
 kernel/fork.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index b07d36f08fea..f7da34e4bc62 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -481,6 +481,21 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # already).
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
+	depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
+# mm as a lazy tlb beyond its last reference count, by shooting down these
+# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
+# be using the mm as a lazy tlb, so that they may switch themselves to using
+# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
+# may be using mm as a lazy tlb mm.
+#
+# To implement this, an arch *must*:
+# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains
+#   at least all possible CPUs in which the mm is lazy.
+# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above).
+config MMU_LAZY_TLB_SHOOTDOWN
+	bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..263660e78c2a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -780,6 +780,67 @@ static void check_mm(struct mm_struct *mm)
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_check_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void do_shoot_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	if (current->active_mm == mm) {
+		WARN_ON_ONCE(current->mm);
+		current->active_mm = &init_mm;
+		switch_mm(mm, &init_mm, current);
+	}
+}
+
+static void cleanup_lazy_tlbs(struct mm_struct *mm)
+{
+	if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+		/*
+		 * In this case, lazy tlb mms are refounted and would not reach
+		 * __mmdrop until all CPUs have switched away and mmdrop()ed.
+		 */
+		return;
+	}
+
+	/*
+	 * Lazy TLB shootdown does not refcount "lazy tlb mm" usage, rather it
+	 * requires lazy mm users to switch to another mm when the refcount
+	 * drops to zero, before the mm is freed. This requires IPIs here to
+	 * switch kernel threads to init_mm.
+	 *
+	 * archs that use IPIs to flush TLBs can piggy-back that lazy tlb mm
+	 * switch with the final userspace teardown TLB flush which leaves the
+	 * mm lazy on this CPU but no others, reducing the need for additional
+	 * IPIs here. There are cases where a final IPI is still required here,
+	 * such as the final mmdrop being performed on a different CPU than the
+	 * one exiting, or kernel threads using the mm when userspace exits.
+	 *
+	 * IPI overheads have not found to be expensive, but they could be
+	 * reduced in a number of possible ways, for example (roughly
+	 * increasing order of complexity):
+	 * - The last lazy reference created by exit_mm() could instead switch
+	 *   to init_mm, however it's probable this will run on the same CPU
+	 *   immediately afterwards, so this may not reduce IPIs much.
+	 * - A batch of mms requiring IPIs could be gathered and freed at once.
+	 * - CPUs store active_mm where it can be remotely checked without a
+	 *   lock, to filter out false-positives in the cpumask.
+	 * - After mm_users or mm_count reaches zero, switching away from the
+	 *   mm could clear mm_cpumask to reduce some IPIs, perhaps together
+	 *   with some batching or delaying of the final IPIs.
+	 * - A delayed freeing and RCU-like quiescing sequence based on mm
+	 *   switching to avoid IPIs completely.
+	 */
+	on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+}
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -791,6 +852,10 @@ void __mmdrop(struct mm_struct *mm)
 
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
+
+	/* Ensure no CPUs are using this as their lazy tlb mm */
+	cleanup_lazy_tlbs(mm);
+
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  2023-01-18  8:00 [PATCH v6 0/5] shoot lazy tlbs Nicholas Piggin
                   ` (2 preceding siblings ...)
  2023-01-18  8:00 ` [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme Nicholas Piggin
@ 2023-01-18  8:00 ` Nicholas Piggin
  2023-01-18 17:30   ` Linus Torvalds
  2023-01-18  8:00 ` [PATCH v6 5/5] powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs Nicholas Piggin
  4 siblings, 1 reply; 16+ messages in thread
From: Nicholas Piggin @ 2023-01-18  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Andy Lutomirski, Linus Torvalds, linux-arch,
	linux-mm, linuxppc-dev

On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second, due to contention on the mm refcount.

64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
the option. This increases the above benchmark to 118 million context
switches per second.

This generates 314 additional IPI interrupts on a 144 CPU system doing
a kernel compile, which is in the noise in terms of kernel cycles.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b8c4ac56bddc..600ace5a7f1a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -265,6 +265,7 @@ config PPC
 	select MMU_GATHER_PAGE_SIZE
 	select MMU_GATHER_RCU_TABLE_FREE
 	select MMU_GATHER_MERGE_VMAS
+	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
 	select NEED_PER_CPU_EMBED_FIRST_CHUNK	if PPC64
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 5/5] powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs
  2023-01-18  8:00 [PATCH v6 0/5] shoot lazy tlbs Nicholas Piggin
                   ` (3 preceding siblings ...)
  2023-01-18  8:00 ` [PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Nicholas Piggin
@ 2023-01-18  8:00 ` Nicholas Piggin
  4 siblings, 0 replies; 16+ messages in thread
From: Nicholas Piggin @ 2023-01-18  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Andy Lutomirski, Linus Torvalds, linux-arch,
	linux-mm, linuxppc-dev

** Not for merge **

CONFIG_MMU_LAZY_TLB_SHOOTDOWN that requires IPIs to clear the "lazy tlb"
references to an mm that is being freed. With the radix MMU, the final
userspace exit TLB flush can be performed with IPIs, and those IPIs can
also clear lazy tlb mm references, which mostly eliminates the final
IPIs required by MMU_LAZY_TLB_SHOOTDOWN.

This does mean the final TLB flush is not done with TLBIE, which can be
faster than IPI+TLBIEL, but we would have to do those IPIs for lazy
shootdown so using TLBIEL should be a win.

The final cpumask test and possible IPIs are still needed to clean up
some rare race cases. We could prevent those entirely (e.g., prevent new
lazy tlb mm references if userspace has gone away, or move the final
TLB flush later), but I'd have to see actual numbers that matter before
adding any more complexity for it. I can't imagine it would ever be
worthwhile.

This takes lazy tlb mm shootdown IPI interrupts from 314 to 3 on a 144
CPU system doing a kernel compile. It also takes care of the one
potential problem workload which is a short-lived process with multiple
CPU-bound threads that want to be spread to other CPUs, because the mm
exit happens after the process is back to single-threaded.

---
 arch/powerpc/mm/book3s64/radix_tlb.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 282359ab525b..f34b78cb4c7d 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1303,7 +1303,31 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 	 * See the comment for radix in arch_exit_mmap().
 	 */
 	if (tlb->fullmm || tlb->need_flush_all) {
-		__flush_all_mm(mm, true);
+		if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+			/*
+			 * Shootdown based lazy tlb mm refcounting means we
+			 * have to IPI everyone in the mm_cpumask anyway soon
+			 * when the mm goes away, so might as well do it as
+			 * part of the final flush now.
+			 *
+			 * If lazy shootdown was improved to reduce IPIs (e.g.,
+			 * by batching), then it may end up being better to use
+			 * tlbies here instead.
+			 */
+			smp_mb(); /* see radix__flush_tlb_mm */
+			exit_flush_lazy_tlbs(mm);
+			_tlbiel_pid(mm->context.id, RIC_FLUSH_ALL);
+
+			/*
+			 * It should not be possible to have coprocessors still
+			 * attached here.
+			 */
+			if (WARN_ON_ONCE(atomic_read(&mm->context.copros) > 0))
+				__flush_all_mm(mm, true);
+		} else {
+			__flush_all_mm(mm, true);
+		}
+
 	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
 		if (!tlb->freed_tables)
 			radix__flush_tlb_mm(mm);
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  2023-01-18  8:00 ` [PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Nicholas Piggin
@ 2023-01-18 17:30   ` Linus Torvalds
  2023-01-19  3:04     ` Nicholas Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2023-01-18 17:30 UTC (permalink / raw)
  To: Nicholas Piggin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Catalin Marinas, Will Deacon
  Cc: Andrew Morton, linux-arch, linux-mm, linuxppc-dev

[ Adding a few more x86 and arm64 maintainers - while linux-arch is
the right mailing list, I'm not convinced people actually follow it
all that closely ]

On Wed, Jan 18, 2023 at 12:00 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> On a 16-socket 192-core POWER8 system, a context switching benchmark
> with as many software threads as CPUs (so each switch will go in and
> out of idle), upstream can achieve a rate of about 1 million context
> switches per second, due to contention on the mm refcount.
>
> 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> the option. This increases the above benchmark to 118 million context
> switches per second.

Well, the 1M -> 118M change does seem like a good reason for this series.

The patches certainly don't look offensive to me, so Ack as far as I'm
concerned, but honestly, it's been some time since I've personally
been active on the idle and lazy TLB code, so that ack is probably
largely worthless.

If anything, my main reaction to this all is to wonder whether the
config option is a good idea - maybe we could do this unconditionally,
and make the source code (and logic) simpler to follow when you don't
have to worry about the CONFIG_MMU_LAZY_TLB_REFCOUNT option.

I wouldn't be surprised to hear that x86 can have the same issue where
the mm_struct refcount is a bigger issue than the possibility of an
extra TLB shootdown at the final exit time.

But having the config options as a way to switch people over gradually
(and perhaps then removing it later) doesn't sound wrong to me either.

And I personally find the argument in patch 3/5 fairly convincing:

  Shootdown IPIs cost could be an issue, but they have not been observed
  to be a serious problem with this scheme, because short-lived processes
  tend not to migrate CPUs much, therefore they don't get much chance to
  leave lazy tlb mm references on remote CPUs.

Andy? PeterZ? Catalin?

Nick - it might be good to link to the actual benchmark, and let
people who have access to big machines perhaps just try it out on
non-powerpc platforms...

                   Linus


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
  2023-01-18  8:00 ` [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme Nicholas Piggin
@ 2023-01-18 22:22   ` Nadav Amit
  2023-01-19  0:53     ` Nicholas Piggin
  2023-01-19  4:22     ` Nicholas Piggin
  0 siblings, 2 replies; 16+ messages in thread
From: Nadav Amit @ 2023-01-18 22:22 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andrew Morton, Andy Lutomirski, Linus Torvalds, linux-arch,
	linux-mm, linuxppc-dev



> On Jan 18, 2023, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> +static void do_shoot_lazy_tlb(void *arg)
> +{
> +	struct mm_struct *mm = arg;
> +
> + 	if (current->active_mm == mm) {
> + 		WARN_ON_ONCE(current->mm);
> + 		current->active_mm = &init_mm;
> + 		switch_mm(mm, &init_mm, current);
> + 	}
> +}

I might be out of touch - doesn’t a flush already take place when we free
the page-tables, at least on common cases on x86?

IIUC exit_mmap() would free page-tables, and whenever page-tables are
freed, on x86, we do shootdown regardless to whether the target CPU TLB state
marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
everything should be fine, no?

[ I understand you care about powerpc, just wondering on the effect on x86 ]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
  2023-01-18 22:22   ` Nadav Amit
@ 2023-01-19  0:53     ` Nicholas Piggin
  2023-01-19  4:22     ` Nicholas Piggin
  1 sibling, 0 replies; 16+ messages in thread
From: Nicholas Piggin @ 2023-01-19  0:53 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Andy Lutomirski, Linus Torvalds, linux-arch,
	linux-mm, linuxppc-dev

On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote:
>
>
> > On Jan 18, 2023, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> > 
> > +static void do_shoot_lazy_tlb(void *arg)
> > +{
> > +	struct mm_struct *mm = arg;
> > +
> > + 	if (current->active_mm == mm) {
> > + 		WARN_ON_ONCE(current->mm);
> > + 		current->active_mm = &init_mm;
> > + 		switch_mm(mm, &init_mm, current);
> > + 	}
> > +}
>
> I might be out of touch - doesn’t a flush already take place when we free
> the page-tables, at least on common cases on x86?
>
> IIUC exit_mmap() would free page-tables, and whenever page-tables are
> freed, on x86, we do shootdown regardless to whether the target CPU TLB state
> marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
> everything should be fine, no?
>
> [ I understand you care about powerpc, just wondering on the effect on x86 ]

If you can easily piggyback on IPI work you already do in exit_mmap then
that's likely to be preferable. I don't know the details of x86 these
days but there is some discussion about it in last year's thread, it
sounded quite feasible.

This is stil required at final __mmdrop() time because it's still
possible that lazy mm refs will need to be cleaned. exit_mmap() itself
explicitly creates one, so if the __mmdrop() runs on a different CPU,
then there's one. kthreads using the mm could create others. If that
part of it is unclear or under-commented, I can try improve it.

Thanks,
Nick



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  2023-01-18 17:30   ` Linus Torvalds
@ 2023-01-19  3:04     ` Nicholas Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nicholas Piggin @ 2023-01-19  3:04 UTC (permalink / raw)
  To: Linus Torvalds, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Catalin Marinas, Will Deacon
  Cc: Andrew Morton, linux-arch, linux-mm, linuxppc-dev

On Thu Jan 19, 2023 at 3:30 AM AEST, Linus Torvalds wrote:
> [ Adding a few more x86 and arm64 maintainers - while linux-arch is
> the right mailing list, I'm not convinced people actually follow it
> all that closely ]
>
> On Wed, Jan 18, 2023 at 12:00 AM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > On a 16-socket 192-core POWER8 system, a context switching benchmark
> > with as many software threads as CPUs (so each switch will go in and
> > out of idle), upstream can achieve a rate of about 1 million context
> > switches per second, due to contention on the mm refcount.
> >
> > 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> > the option. This increases the above benchmark to 118 million context
> > switches per second.
>
> Well, the 1M -> 118M change does seem like a good reason for this series.

It was an artificial corner case, mind you. I don't think it's a reason
to panic and likely smaller systems with faster atomics will care far
less than our big 2-hop systems.

Benchmark is will-it-scale:

  ./context_switch1_threads -t 768
  min:2174 max:2690 total:1827952

    33.52%  [k] finish_task_switch
    27.26%  [k] interrupt_return
    22.66%  [k] __schedule
     2.30%  [k] _raw_spin_trylock

  ./context_switch1_threads -t 1536
  min:103000 max:120100 total:177201906

The top case has 1/2 the switching pairs to available CPU, which makes
them all switch the same mm between real and lazy. Bottom case is
just switching between user threads so that doesn't hit the lazy
refcount.

> The patches certainly don't look offensive to me, so Ack as far as I'm
> concerned, but honestly, it's been some time since I've personally
> been active on the idle and lazy TLB code, so that ack is probably
> largely worthless.
>
> If anything, my main reaction to this all is to wonder whether the
> config option is a good idea - maybe we could do this unconditionally,
> and make the source code (and logic) simpler to follow when you don't
> have to worry about the CONFIG_MMU_LAZY_TLB_REFCOUNT option.
>
> I wouldn't be surprised to hear that x86 can have the same issue where
> the mm_struct refcount is a bigger issue than the possibility of an
> extra TLB shootdown at the final exit time.
>
> But having the config options as a way to switch people over gradually
> (and perhaps then removing it later) doesn't sound wrong to me either.

IMO it's trivial enough that we could carry both, but everything's a
straw on the camel's back so if we can consolidate it would always be
preferebale. Let's see how it plays out for a few releases.

> And I personally find the argument in patch 3/5 fairly convincing:
>
>   Shootdown IPIs cost could be an issue, but they have not been observed
>   to be a serious problem with this scheme, because short-lived processes
>   tend not to migrate CPUs much, therefore they don't get much chance to
>   leave lazy tlb mm references on remote CPUs.
>
> Andy? PeterZ? Catalin?
>
> Nick - it might be good to link to the actual benchmark, and let
> people who have access to big machines perhaps just try it out on
> non-powerpc platforms...

Yep good point, I'll put it in the changelog. I might submit another
round to Andrew in a bit with acks and any minor tweaks and minus the
last patch, assuming no major changes or objections.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
  2023-01-18 22:22   ` Nadav Amit
  2023-01-19  0:53     ` Nicholas Piggin
@ 2023-01-19  4:22     ` Nicholas Piggin
  2023-01-23  8:16       ` Nadav Amit
  1 sibling, 1 reply; 16+ messages in thread
From: Nicholas Piggin @ 2023-01-19  4:22 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Andy Lutomirski, Linus Torvalds, linux-arch,
	linux-mm, linuxppc-dev

On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote:
>
>
> > On Jan 18, 2023, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> > 
> > +static void do_shoot_lazy_tlb(void *arg)
> > +{
> > +	struct mm_struct *mm = arg;
> > +
> > + 	if (current->active_mm == mm) {
> > + 		WARN_ON_ONCE(current->mm);
> > + 		current->active_mm = &init_mm;
> > + 		switch_mm(mm, &init_mm, current);
> > + 	}
> > +}
>
> I might be out of touch - doesn’t a flush already take place when we free
> the page-tables, at least on common cases on x86?
>
> IIUC exit_mmap() would free page-tables, and whenever page-tables are
> freed, on x86, we do shootdown regardless to whether the target CPU TLB state
> marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
> everything should be fine, no?
>
> [ I understand you care about powerpc, just wondering on the effect on x86 ]

Now I come to think of it, Rik had done this for x86 a while back.

https://lore.kernel.org/all/20180728215357.3249-10-riel@surriel.com/

I didn't know about it when I wrote this, so I never dug into why it
didn't get merged. It might have missed the final __mmdrop races but
I'm not not sure, x86 lazy tlb mode is too complicated to know at a
glance. I would check with him though.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2023-01-18  8:00 ` [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable Nicholas Piggin
@ 2023-01-23  7:35   ` Nadav Amit
  2023-01-23  8:02     ` Nadav Amit
  0 siblings, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2023-01-23  7:35 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton
  Cc: Andy Lutomirski, Linus Torvalds, linux-arch, linux-mm, linuxppc-dev



On 1/18/23 10:00 AM, Nicholas Piggin wrote:
> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
> when it is context switched. This can be disabled by architectures that
> don't require this refcounting if they clean up lazy tlb mms when the
> last refcount is dropped. Currently this is always enabled, which is
> what existing code does, so the patch is effectively a no-op.
> 
> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   Documentation/mm/active_mm.rst |  6 ++++++
>   arch/Kconfig                   | 17 +++++++++++++++++
>   include/linux/sched/mm.h       | 18 +++++++++++++++---
>   kernel/sched/core.c            | 22 ++++++++++++++++++----
>   kernel/sched/sched.h           |  4 +++-
>   5 files changed, 59 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst
> index 6f8269c284ed..2b0d08332400 100644
> --- a/Documentation/mm/active_mm.rst
> +++ b/Documentation/mm/active_mm.rst
> @@ -4,6 +4,12 @@
>   Active MM
>   =========
>   
> +Note, the mm_count refcount may no longer include the "lazy" users
> +(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
> +with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
> +references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
> +helpers which abstracts this config option.
> +
>   ::
>   
>    List:       linux-kernel
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 12e3ddabac9d..b07d36f08fea 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -465,6 +465,23 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>   	  irqs disabled over activate_mm. Architectures that do IPI based TLB
>   	  shootdowns should enable this.
>   
> +# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
> +# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
> +# to/from kernel threads when the same mm is running on a lot of CPUs (a large
> +# multi-threaded application), by reducing contention on the mm refcount.
> +#
> +# This can be disabled if the architecture ensures no CPUs are using an mm as a
> +# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
> +# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
> +# final exit(2) TLB flush, for example.
> +#
> +# To implement this, an arch *must*:
> +# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when dropping the
> +# lazy reference of a kthread's ->active_mm (non-arch code has been converted
> +# already).
> +config MMU_LAZY_TLB_REFCOUNT
> +	def_bool y
> +
>   config ARCH_HAVE_NMI_SAFE_CMPXCHG
>   	bool
>   
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 5376caf6fcf3..68bbe8d90c2e 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -82,17 +82,29 @@ static inline void mmdrop_sched(struct mm_struct *mm)
>   /* Helpers for lazy TLB mm refcounting */
>   static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
>   {
> -	mmgrab(mm);
> +	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
> +		mmgrab(mm);
>   }
>   
>   static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
>   {
> -	mmdrop(mm);
> +	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
> +		mmdrop(mm);
> +	} else {
> +		/*
> +		 * mmdrop_lazy_tlb must provide a full memory barrier, see the
> +		 * membarrier comment finish_task_switch which relies on this.
> +		 */
> +		smp_mb();
> +	}
>   }

Considering the fact that mmdrop_lazy_tlb() replaced mmdrop() in various 
locations in which smp_mb() was not required, this comment might be 
confusing. IOW, for the cases in most cases where mmdrop_lazy_tlb() 
replaced mmdrop(), this comment was irrelevant, and therefore it now 
becomes confusing.

I am not sure the include the smp_mb() here instead of "open-coding" it 
helps.

>   
>   static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
>   {
> -	mmdrop_sched(mm);
> +	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
> +		mmdrop_sched(mm);
> +	else
> +		smp_mb(); // see above
>   }

Wrong style of comment.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2023-01-23  7:35   ` Nadav Amit
@ 2023-01-23  8:02     ` Nadav Amit
  2023-01-24  2:29       ` Nicholas Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2023-01-23  8:02 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton
  Cc: Andy Lutomirski, Linus Torvalds, linux-arch, linux-mm, linuxppc-dev



On 1/23/23 9:35 AM, Nadav Amit wrote:
>> +    if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
>> +        mmdrop(mm);
>> +    } else {
>> +        /*
>> +         * mmdrop_lazy_tlb must provide a full memory barrier, see the
>> +         * membarrier comment finish_task_switch which relies on this.
>> +         */
>> +        smp_mb();
>> +    }
>>   }
> 
> Considering the fact that mmdrop_lazy_tlb() replaced mmdrop() in various 
> locations in which smp_mb() was not required, this comment might be 
> confusing. IOW, for the cases in most cases where mmdrop_lazy_tlb() 
> replaced mmdrop(), this comment was irrelevant, and therefore it now 
> becomes confusing.
> 
> I am not sure the include the smp_mb() here instead of "open-coding" it 
> helps.
I think that I now understand why you do need the smp_mb() here, so 
ignore my comment.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
  2023-01-19  4:22     ` Nicholas Piggin
@ 2023-01-23  8:16       ` Nadav Amit
  2023-01-24  3:16         ` Nicholas Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2023-01-23  8:16 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andrew Morton, Andy Lutomirski, Linus Torvalds, linux-arch,
	linux-mm, linuxppc-dev



On 1/19/23 6:22 AM, Nicholas Piggin wrote:
> On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote:
>>
>>
>>> On Jan 18, 2023, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>>>
>>> +static void do_shoot_lazy_tlb(void *arg)
>>> +{
>>> +	struct mm_struct *mm = arg;
>>> +
>>> + 	if (current->active_mm == mm) {
>>> + 		WARN_ON_ONCE(current->mm);
>>> + 		current->active_mm = &init_mm;
>>> + 		switch_mm(mm, &init_mm, current);
>>> + 	}
>>> +}
>>
>> I might be out of touch - doesn’t a flush already take place when we free
>> the page-tables, at least on common cases on x86?
>>
>> IIUC exit_mmap() would free page-tables, and whenever page-tables are
>> freed, on x86, we do shootdown regardless to whether the target CPU TLB state
>> marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
>> everything should be fine, no?
>>
>> [ I understand you care about powerpc, just wondering on the effect on x86 ]
> 
> Now I come to think of it, Rik had done this for x86 a while back.
> 
> https://lore.kernel.org/all/20180728215357.3249-10-riel@surriel.com/
> 
> I didn't know about it when I wrote this, so I never dug into why it
> didn't get merged. It might have missed the final __mmdrop races but
> I'm not not sure, x86 lazy tlb mode is too complicated to know at a
> glance. I would check with him though.

My point was that naturally (i.e., as done today), when exit_mmap() is 
done, you release the page tables (not just the pages). On x86 it means 
that you also send shootdown IPI to all the *lazy* CPUs to perform a 
flush, so they would exit the lazy mode.

[ this should be true for 99% of the cases, excluding cases where there
   were not page-tables, for instance ]

So the patch of Rik, I think, does not help in the common cases, 
although it may perhaps make implicit actions more explicit in the code.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2023-01-23  8:02     ` Nadav Amit
@ 2023-01-24  2:29       ` Nicholas Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nicholas Piggin @ 2023-01-24  2:29 UTC (permalink / raw)
  To: Nadav Amit, Andrew Morton
  Cc: Andy Lutomirski, Linus Torvalds, linux-arch, linux-mm, linuxppc-dev

On Mon Jan 23, 2023 at 6:02 PM AEST, Nadav Amit wrote:
>
>
> On 1/23/23 9:35 AM, Nadav Amit wrote:
> >> +    if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
> >> +        mmdrop(mm);
> >> +    } else {
> >> +        /*
> >> +         * mmdrop_lazy_tlb must provide a full memory barrier, see the
> >> +         * membarrier comment finish_task_switch which relies on this.
> >> +         */
> >> +        smp_mb();
> >> +    }
> >>   }
> > 
> > Considering the fact that mmdrop_lazy_tlb() replaced mmdrop() in various 
> > locations in which smp_mb() was not required, this comment might be 
> > confusing. IOW, for the cases in most cases where mmdrop_lazy_tlb() 
> > replaced mmdrop(), this comment was irrelevant, and therefore it now 
> > becomes confusing.
> > 
> > I am not sure the include the smp_mb() here instead of "open-coding" it 
> > helps.
> I think that I now understand why you do need the smp_mb() here, so 
> ignore my comment.

For the moment it's basically a convenience thing so the caller does not
have to care what option is configured. Possibly we could weaken it and
do necessary barriers in callers if we consolidated to one option, but
I'd have to be convinced it'd be worthwhile, because it would still make
it deviate from mmdrop(), and we'd probably at least need a release
barrier to drop the reference.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
  2023-01-23  8:16       ` Nadav Amit
@ 2023-01-24  3:16         ` Nicholas Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nicholas Piggin @ 2023-01-24  3:16 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Andy Lutomirski, Linus Torvalds, linux-arch,
	linux-mm, linuxppc-dev

On Mon Jan 23, 2023 at 6:16 PM AEST, Nadav Amit wrote:
>
>
> On 1/19/23 6:22 AM, Nicholas Piggin wrote:
> > On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote:
> >>
> >>
> >>> On Jan 18, 2023, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> >>>
> >>> +static void do_shoot_lazy_tlb(void *arg)
> >>> +{
> >>> +	struct mm_struct *mm = arg;
> >>> +
> >>> + 	if (current->active_mm == mm) {
> >>> + 		WARN_ON_ONCE(current->mm);
> >>> + 		current->active_mm = &init_mm;
> >>> + 		switch_mm(mm, &init_mm, current);
> >>> + 	}
> >>> +}
> >>
> >> I might be out of touch - doesn’t a flush already take place when we free
> >> the page-tables, at least on common cases on x86?
> >>
> >> IIUC exit_mmap() would free page-tables, and whenever page-tables are
> >> freed, on x86, we do shootdown regardless to whether the target CPU TLB state
> >> marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
> >> everything should be fine, no?
> >>
> >> [ I understand you care about powerpc, just wondering on the effect on x86 ]
> > 
> > Now I come to think of it, Rik had done this for x86 a while back.
> > 
> > https://lore.kernel.org/all/20180728215357.3249-10-riel@surriel.com/
> > 
> > I didn't know about it when I wrote this, so I never dug into why it
> > didn't get merged. It might have missed the final __mmdrop races but
> > I'm not not sure, x86 lazy tlb mode is too complicated to know at a
> > glance. I would check with him though.
>
> My point was that naturally (i.e., as done today), when exit_mmap() is 
> done, you release the page tables (not just the pages). On x86 it means 
> that you also send shootdown IPI to all the *lazy* CPUs to perform a 
> flush, so they would exit the lazy mode.
>
> [ this should be true for 99% of the cases, excluding cases where there
>    were not page-tables, for instance ]
>
> So the patch of Rik, I think, does not help in the common cases, 
> although it may perhaps make implicit actions more explicit in the code.

If that's what it does, then sure. IIRC x86 didn't used to work that way
long ago, but you would know what it does today. You might find it
doesn't need much arch change to work. OTOH Andy has major problems with
active_mm and some other x86 use-after-free weirdness that that I wasn't
able to comprehend. He'll be naking x86 implementation until that's all
cleaned up so better try to understand what's going on with that first.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-01-24  3:16 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-18  8:00 [PATCH v6 0/5] shoot lazy tlbs Nicholas Piggin
2023-01-18  8:00 ` [PATCH v6 1/5] lazy tlb: introduce lazy tlb mm refcount helper functions Nicholas Piggin
2023-01-18  8:00 ` [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable Nicholas Piggin
2023-01-23  7:35   ` Nadav Amit
2023-01-23  8:02     ` Nadav Amit
2023-01-24  2:29       ` Nicholas Piggin
2023-01-18  8:00 ` [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme Nicholas Piggin
2023-01-18 22:22   ` Nadav Amit
2023-01-19  0:53     ` Nicholas Piggin
2023-01-19  4:22     ` Nicholas Piggin
2023-01-23  8:16       ` Nadav Amit
2023-01-24  3:16         ` Nicholas Piggin
2023-01-18  8:00 ` [PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Nicholas Piggin
2023-01-18 17:30   ` Linus Torvalds
2023-01-19  3:04     ` Nicholas Piggin
2023-01-18  8:00 ` [PATCH v6 5/5] powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).