All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] shoot lazy tlbs
@ 2021-06-05  1:42 ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-05  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Randy Dunlap, linux-kernel, linux-arch,
	linuxppc-dev, linux-mm, Anton Blanchard, Andy Lutomirski

The additional unused config option was a valid criticism, so this now
purely just toggles refcounting of the lazy tlb mm.

Thanks,
Nick

Since v3:
- Removed the extra config option, MMU_LAZY_TLB=n. This can be
  resurrected if an arch wants it.

Nicholas Piggin (4):
  lazy tlb: introduce lazy mm refcount helper functions
  lazy tlb: allow lazy tlb mm refcounting to be configurable
  lazy tlb: shoot lazies, a non-refcounting lazy tlb option
  powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

 arch/Kconfig                         | 17 ++++++++++
 arch/arm/mach-rpc/ecard.c            |  2 +-
 arch/powerpc/Kconfig                 |  1 +
 arch/powerpc/kernel/smp.c            |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  4 +--
 fs/exec.c                            |  4 +--
 include/linux/sched/mm.h             | 20 +++++++++++
 kernel/cpu.c                         |  2 +-
 kernel/exit.c                        |  2 +-
 kernel/fork.c                        | 51 ++++++++++++++++++++++++++++
 kernel/kthread.c                     | 11 +++---
 kernel/sched/core.c                  | 35 +++++++++++++------
 kernel/sched/sched.h                 |  4 ++-
 13 files changed, 132 insertions(+), 23 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v4 0/4] shoot lazy tlbs
@ 2021-06-05  1:42 ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-05  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Randy Dunlap, linux-kernel, Nicholas Piggin,
	linux-mm, Andy Lutomirski, linuxppc-dev

The additional unused config option was a valid criticism, so this now
purely just toggles refcounting of the lazy tlb mm.

Thanks,
Nick

Since v3:
- Removed the extra config option, MMU_LAZY_TLB=n. This can be
  resurrected if an arch wants it.

Nicholas Piggin (4):
  lazy tlb: introduce lazy mm refcount helper functions
  lazy tlb: allow lazy tlb mm refcounting to be configurable
  lazy tlb: shoot lazies, a non-refcounting lazy tlb option
  powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

 arch/Kconfig                         | 17 ++++++++++
 arch/arm/mach-rpc/ecard.c            |  2 +-
 arch/powerpc/Kconfig                 |  1 +
 arch/powerpc/kernel/smp.c            |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  4 +--
 fs/exec.c                            |  4 +--
 include/linux/sched/mm.h             | 20 +++++++++++
 kernel/cpu.c                         |  2 +-
 kernel/exit.c                        |  2 +-
 kernel/fork.c                        | 51 ++++++++++++++++++++++++++++
 kernel/kthread.c                     | 11 +++---
 kernel/sched/core.c                  | 35 +++++++++++++------
 kernel/sched/sched.h                 |  4 ++-
 13 files changed, 132 insertions(+), 23 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions
  2021-06-05  1:42 ` Nicholas Piggin
@ 2021-06-05  1:42   ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-05  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Randy Dunlap, linux-kernel, linux-arch,
	linuxppc-dev, linux-mm, Anton Blanchard, Andy Lutomirski

Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
This makes lazy mm references more obvious, and allows explicit
refcounting to be removed if it is not used.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/arm/mach-rpc/ecard.c            |  2 +-
 arch/powerpc/kernel/smp.c            |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  4 ++--
 fs/exec.c                            |  4 ++--
 include/linux/sched/mm.h             | 11 +++++++++++
 kernel/cpu.c                         |  2 +-
 kernel/exit.c                        |  2 +-
 kernel/kthread.c                     | 11 +++++++----
 kernel/sched/core.c                  | 15 ++++++++-------
 9 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index 827b50f1c73e..1b4a41aad793 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -253,7 +253,7 @@ static int ecard_init_mm(void)
 	current->mm = mm;
 	current->active_mm = mm;
 	activate_mm(active_mm, mm);
-	mmdrop(active_mm);
+	mmdrop_lazy_tlb(active_mm);
 	ecard_init_pgtables(mm);
 	return 0;
 }
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 2e05c783440a..fb0bdfc67366 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1541,7 +1541,7 @@ void start_secondary(void *unused)
 {
 	unsigned int cpu = raw_smp_processor_id();
 
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	smp_store_cpu_info(cpu);
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 409e61210789..2962082787c0 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -663,10 +663,10 @@ void exit_lazy_flush_tlb(struct mm_struct *mm, bool always_flush)
 	if (current->active_mm == mm) {
 		WARN_ON_ONCE(current->mm != NULL);
 		/* Is a kernel thread and is using mm as the lazy tlb */
-		mmgrab(&init_mm);
+		mmgrab_lazy_tlb(&init_mm);
 		current->active_mm = &init_mm;
 		switch_mm_irqs_off(mm, &init_mm, current);
-		mmdrop(mm);
+		mmdrop_lazy_tlb(mm);
 	}
 
 	/*
diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..ca0f8b1af23a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1029,9 +1029,9 @@ static int exec_mmap(struct mm_struct *mm)
 		setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
 		mm_update_next_owner(old_mm);
 		mmput(old_mm);
-		return 0;
+	} else {
+		mmdrop_lazy_tlb(active_mm);
 	}
-	mmdrop(active_mm);
 	return 0;
 }
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe348e3..bfd1baca5266 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,17 @@ static inline void mmdrop(struct mm_struct *mm)
 		__mmdrop(mm);
 }
 
+/* Helpers for lazy TLB mm refcounting */
+static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
+{
+	mmgrab(mm);
+}
+
+static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
+{
+	mmdrop(mm);
+}
+
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
diff --git a/kernel/cpu.c b/kernel/cpu.c
index e538518556f4..e87a89824e6c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -602,7 +602,7 @@ static int finish_cpu(unsigned int cpu)
 	 */
 	if (mm != &init_mm)
 		idle->active_mm = &init_mm;
-	mmdrop(mm);
+	mmdrop_lazy_tlb(mm);
 	return 0;
 }
 
diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..8e87ec5f6be2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -476,7 +476,7 @@ static void exit_mm(void)
 		__set_current_state(TASK_RUNNING);
 		mmap_read_lock(mm);
 	}
-	mmgrab(mm);
+	mmgrab_lazy_tlb(mm);
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fe3f2a40d61e..b70e28431a01 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1314,14 +1314,14 @@ void kthread_use_mm(struct mm_struct *mm)
 	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
 	WARN_ON_ONCE(tsk->mm);
 
+	mmgrab(mm);
+
 	task_lock(tsk);
 	/* Hold off tlb flush IPIs while switching mm's */
 	local_irq_disable();
 	active_mm = tsk->active_mm;
-	if (active_mm != mm) {
-		mmgrab(mm);
+	if (active_mm != mm)
 		tsk->active_mm = mm;
-	}
 	tsk->mm = mm;
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
@@ -1341,7 +1341,7 @@ void kthread_use_mm(struct mm_struct *mm)
 	 * mmdrop(), or explicitly with smp_mb().
 	 */
 	if (active_mm != mm)
-		mmdrop(active_mm);
+		mmdrop_lazy_tlb(active_mm);
 	else
 		smp_mb();
 
@@ -1375,10 +1375,13 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	local_irq_disable();
 	tsk->mm = NULL;
 	membarrier_update_current_mm(NULL);
+	mmgrab_lazy_tlb(mm);
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	local_irq_enable();
 	task_unlock(tsk);
+
+	mmdrop(mm);
 }
 EXPORT_SYMBOL_GPL(kthread_unuse_mm);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..e359c76ea2e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4229,13 +4229,14 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 * rq->curr, before returning to userspace, so provide them here:
 	 *
 	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
-	 *   provided by mmdrop(),
+	 *   provided by mmdrop_lazy_tlb(),
 	 * - a sync_core for SYNC_CORE.
 	 */
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
-		mmdrop(mm);
+		mmdrop_lazy_tlb(mm);
 	}
+
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
@@ -4299,9 +4300,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
 
 	/*
 	 * kernel -> kernel   lazy + transfer active
-	 *   user -> kernel   lazy + mmgrab() active
+	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
 	 *
-	 * kernel ->   user   switch + mmdrop() active
+	 * kernel ->   user   switch + mmdrop_lazy_tlb() active
 	 *   user ->   user   switch
 	 */
 	if (!next->mm) {                                // to kernel
@@ -4309,7 +4310,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 
 		next->active_mm = prev->active_mm;
 		if (prev->mm)                           // from user
-			mmgrab(prev->active_mm);
+			mmgrab_lazy_tlb(prev->active_mm);
 		else
 			prev->active_mm = NULL;
 	} else {                                        // to user
@@ -4325,7 +4326,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop() in finish_task_switch(). */
+			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
 			rq->prev_mm = prev->active_mm;
 			prev->active_mm = NULL;
 		}
@@ -8239,7 +8240,7 @@ void __init sched_init(void)
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	enter_lazy_tlb(&init_mm, current);
 
 	/*
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions
@ 2021-06-05  1:42   ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-05  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Randy Dunlap, linux-kernel, Nicholas Piggin,
	linux-mm, Andy Lutomirski, linuxppc-dev

Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
This makes lazy mm references more obvious, and allows explicit
refcounting to be removed if it is not used.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/arm/mach-rpc/ecard.c            |  2 +-
 arch/powerpc/kernel/smp.c            |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  4 ++--
 fs/exec.c                            |  4 ++--
 include/linux/sched/mm.h             | 11 +++++++++++
 kernel/cpu.c                         |  2 +-
 kernel/exit.c                        |  2 +-
 kernel/kthread.c                     | 11 +++++++----
 kernel/sched/core.c                  | 15 ++++++++-------
 9 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index 827b50f1c73e..1b4a41aad793 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -253,7 +253,7 @@ static int ecard_init_mm(void)
 	current->mm = mm;
 	current->active_mm = mm;
 	activate_mm(active_mm, mm);
-	mmdrop(active_mm);
+	mmdrop_lazy_tlb(active_mm);
 	ecard_init_pgtables(mm);
 	return 0;
 }
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 2e05c783440a..fb0bdfc67366 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1541,7 +1541,7 @@ void start_secondary(void *unused)
 {
 	unsigned int cpu = raw_smp_processor_id();
 
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	smp_store_cpu_info(cpu);
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 409e61210789..2962082787c0 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -663,10 +663,10 @@ void exit_lazy_flush_tlb(struct mm_struct *mm, bool always_flush)
 	if (current->active_mm == mm) {
 		WARN_ON_ONCE(current->mm != NULL);
 		/* Is a kernel thread and is using mm as the lazy tlb */
-		mmgrab(&init_mm);
+		mmgrab_lazy_tlb(&init_mm);
 		current->active_mm = &init_mm;
 		switch_mm_irqs_off(mm, &init_mm, current);
-		mmdrop(mm);
+		mmdrop_lazy_tlb(mm);
 	}
 
 	/*
diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..ca0f8b1af23a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1029,9 +1029,9 @@ static int exec_mmap(struct mm_struct *mm)
 		setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
 		mm_update_next_owner(old_mm);
 		mmput(old_mm);
-		return 0;
+	} else {
+		mmdrop_lazy_tlb(active_mm);
 	}
-	mmdrop(active_mm);
 	return 0;
 }
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe348e3..bfd1baca5266 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,17 @@ static inline void mmdrop(struct mm_struct *mm)
 		__mmdrop(mm);
 }
 
+/* Helpers for lazy TLB mm refcounting */
+static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
+{
+	mmgrab(mm);
+}
+
+static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
+{
+	mmdrop(mm);
+}
+
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
diff --git a/kernel/cpu.c b/kernel/cpu.c
index e538518556f4..e87a89824e6c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -602,7 +602,7 @@ static int finish_cpu(unsigned int cpu)
 	 */
 	if (mm != &init_mm)
 		idle->active_mm = &init_mm;
-	mmdrop(mm);
+	mmdrop_lazy_tlb(mm);
 	return 0;
 }
 
diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..8e87ec5f6be2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -476,7 +476,7 @@ static void exit_mm(void)
 		__set_current_state(TASK_RUNNING);
 		mmap_read_lock(mm);
 	}
-	mmgrab(mm);
+	mmgrab_lazy_tlb(mm);
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fe3f2a40d61e..b70e28431a01 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1314,14 +1314,14 @@ void kthread_use_mm(struct mm_struct *mm)
 	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
 	WARN_ON_ONCE(tsk->mm);
 
+	mmgrab(mm);
+
 	task_lock(tsk);
 	/* Hold off tlb flush IPIs while switching mm's */
 	local_irq_disable();
 	active_mm = tsk->active_mm;
-	if (active_mm != mm) {
-		mmgrab(mm);
+	if (active_mm != mm)
 		tsk->active_mm = mm;
-	}
 	tsk->mm = mm;
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
@@ -1341,7 +1341,7 @@ void kthread_use_mm(struct mm_struct *mm)
 	 * mmdrop(), or explicitly with smp_mb().
 	 */
 	if (active_mm != mm)
-		mmdrop(active_mm);
+		mmdrop_lazy_tlb(active_mm);
 	else
 		smp_mb();
 
@@ -1375,10 +1375,13 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	local_irq_disable();
 	tsk->mm = NULL;
 	membarrier_update_current_mm(NULL);
+	mmgrab_lazy_tlb(mm);
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	local_irq_enable();
 	task_unlock(tsk);
+
+	mmdrop(mm);
 }
 EXPORT_SYMBOL_GPL(kthread_unuse_mm);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..e359c76ea2e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4229,13 +4229,14 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 * rq->curr, before returning to userspace, so provide them here:
 	 *
 	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
-	 *   provided by mmdrop(),
+	 *   provided by mmdrop_lazy_tlb(),
 	 * - a sync_core for SYNC_CORE.
 	 */
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
-		mmdrop(mm);
+		mmdrop_lazy_tlb(mm);
 	}
+
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
@@ -4299,9 +4300,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
 
 	/*
 	 * kernel -> kernel   lazy + transfer active
-	 *   user -> kernel   lazy + mmgrab() active
+	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
 	 *
-	 * kernel ->   user   switch + mmdrop() active
+	 * kernel ->   user   switch + mmdrop_lazy_tlb() active
 	 *   user ->   user   switch
 	 */
 	if (!next->mm) {                                // to kernel
@@ -4309,7 +4310,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 
 		next->active_mm = prev->active_mm;
 		if (prev->mm)                           // from user
-			mmgrab(prev->active_mm);
+			mmgrab_lazy_tlb(prev->active_mm);
 		else
 			prev->active_mm = NULL;
 	} else {                                        // to user
@@ -4325,7 +4326,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop() in finish_task_switch(). */
+			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
 			rq->prev_mm = prev->active_mm;
 			prev->active_mm = NULL;
 		}
@@ -8239,7 +8240,7 @@ void __init sched_init(void)
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	enter_lazy_tlb(&init_mm, current);
 
 	/*
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-05  1:42 ` Nicholas Piggin
@ 2021-06-05  1:42   ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-05  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Randy Dunlap, linux-kernel, linux-arch,
	linuxppc-dev, linux-mm, Anton Blanchard, Andy Lutomirski

Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched. This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the
last refcount is dropped. Currently this is always enabled, which is
what existing code does, so the patch is effectively a no-op.

Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig             |  4 ++++
 include/linux/sched/mm.h | 13 +++++++++++--
 kernel/sched/core.c      | 22 ++++++++++++++++++----
 kernel/sched/sched.h     |  4 +++-
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index c45b770d3579..1cff045cdde6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -418,6 +418,10 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  irqs disabled over activate_mm. Architectures that do IPI based TLB
 	  shootdowns should enable this.
 
+# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+config MMU_LAZY_TLB_REFCOUNT
+	def_bool y
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index bfd1baca5266..29e4638ad124 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -52,12 +52,21 @@ static inline void mmdrop(struct mm_struct *mm)
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-	mmgrab(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+		mmgrab(mm);
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-	mmdrop(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+		mmdrop(mm);
+	} else {
+		/*
+		 * mmdrop_lazy_tlb must provide a full memory barrier, see the
+		 * membarrier comment finish_task_switch which relies on this.
+		 */
+		smp_mb();
+	}
 }
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e359c76ea2e2..5e10cb712be3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4171,7 +4171,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	__releases(rq->lock)
 {
 	struct rq *rq = this_rq();
-	struct mm_struct *mm = rq->prev_mm;
+	struct mm_struct *mm = NULL;
 	long prev_state;
 
 	/*
@@ -4190,7 +4190,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		      current->comm, current->pid, preempt_count()))
 		preempt_count_set(FORK_PREEMPT_COUNT);
 
-	rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+	mm = rq->prev_lazy_mm;
+	rq->prev_lazy_mm = NULL;
+#endif
 
 	/*
 	 * A task struct has one reference for the use as "current".
@@ -4326,9 +4329,20 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
-			rq->prev_mm = prev->active_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+			/* Will mmdrop_lazy_tlb() in finish_task_switch(). */
+			rq->prev_lazy_mm = prev->active_mm;
 			prev->active_mm = NULL;
+#else
+			/*
+			 * Without MMU_LAZY_TLB_REFCOUNT there is no lazy
+			 * tracking (because no rq->prev_lazy_mm) in
+			 * finish_task_switch, so no mmdrop_lazy_tlb(), so no
+			 * memory barrier for membarrier (see the membarrier
+			 * comment in finish_task_switch()).  Do it here.
+			 */
+			smp_mb();
+#endif
 		}
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..0729cf19a987 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -961,7 +961,9 @@ struct rq {
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
-	struct mm_struct	*prev_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+	struct mm_struct	*prev_lazy_mm;
+#endif
 
 	unsigned int		clock_update_flags;
 	u64			clock;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-05  1:42   ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-05  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Randy Dunlap, linux-kernel, Nicholas Piggin,
	linux-mm, Andy Lutomirski, linuxppc-dev

Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched. This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the
last refcount is dropped. Currently this is always enabled, which is
what existing code does, so the patch is effectively a no-op.

Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig             |  4 ++++
 include/linux/sched/mm.h | 13 +++++++++++--
 kernel/sched/core.c      | 22 ++++++++++++++++++----
 kernel/sched/sched.h     |  4 +++-
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index c45b770d3579..1cff045cdde6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -418,6 +418,10 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  irqs disabled over activate_mm. Architectures that do IPI based TLB
 	  shootdowns should enable this.
 
+# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+config MMU_LAZY_TLB_REFCOUNT
+	def_bool y
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index bfd1baca5266..29e4638ad124 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -52,12 +52,21 @@ static inline void mmdrop(struct mm_struct *mm)
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-	mmgrab(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+		mmgrab(mm);
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-	mmdrop(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+		mmdrop(mm);
+	} else {
+		/*
+		 * mmdrop_lazy_tlb must provide a full memory barrier, see the
+		 * membarrier comment finish_task_switch which relies on this.
+		 */
+		smp_mb();
+	}
 }
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e359c76ea2e2..5e10cb712be3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4171,7 +4171,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	__releases(rq->lock)
 {
 	struct rq *rq = this_rq();
-	struct mm_struct *mm = rq->prev_mm;
+	struct mm_struct *mm = NULL;
 	long prev_state;
 
 	/*
@@ -4190,7 +4190,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		      current->comm, current->pid, preempt_count()))
 		preempt_count_set(FORK_PREEMPT_COUNT);
 
-	rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+	mm = rq->prev_lazy_mm;
+	rq->prev_lazy_mm = NULL;
+#endif
 
 	/*
 	 * A task struct has one reference for the use as "current".
@@ -4326,9 +4329,20 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
-			rq->prev_mm = prev->active_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+			/* Will mmdrop_lazy_tlb() in finish_task_switch(). */
+			rq->prev_lazy_mm = prev->active_mm;
 			prev->active_mm = NULL;
+#else
+			/*
+			 * Without MMU_LAZY_TLB_REFCOUNT there is no lazy
+			 * tracking (because no rq->prev_lazy_mm) in
+			 * finish_task_switch, so no mmdrop_lazy_tlb(), so no
+			 * memory barrier for membarrier (see the membarrier
+			 * comment in finish_task_switch()).  Do it here.
+			 */
+			smp_mb();
+#endif
 		}
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..0729cf19a987 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -961,7 +961,9 @@ struct rq {
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
-	struct mm_struct	*prev_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+	struct mm_struct	*prev_lazy_mm;
+#endif
 
 	unsigned int		clock_update_flags;
 	u64			clock;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 3/4] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
  2021-06-05  1:42 ` Nicholas Piggin
@ 2021-06-05  1:42   ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-05  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Randy Dunlap, linux-kernel, linux-arch,
	linuxppc-dev, linux-mm, Anton Blanchard, Andy Lutomirski

On big systems, the mm refcount can become highly contented when doing
a lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so instead implement a non-refcounted scheme
that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
any remaining lazy ones.

Shootdown IPIs are some concern, but they have not been observed to be
a big problem with this scheme (the powerpc implementation generated
314 additional interrupts on a 144 CPU system during a kernel compile).
There are a number of strategies that could be employed to reduce IPIs
if they turn out to be a problem for some workload.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig  | 13 +++++++++++++
 kernel/fork.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 1cff045cdde6..f8136c893991 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -421,6 +421,19 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
+	depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# Instead of refcounting the lazy mm struct for kernel thread references
+# (which can cause contention with multi-threaded apps on large multiprocessor
+# systems), this option causes __mmdrop to IPI all CPUs in the mm_cpumask and
+# switch to init_mm if they were using the to-be-freed mm as the lazy tlb. To
+# implement this, architectures must use _lazy_tlb variants of mm refcounting
+# when releasing kernel thread mm references, and mm_cpumask must include at
+# least all possible CPUs in which the mm might be lazy, at the time of the
+# final mmdrop. mmgrab/mmdrop in arch/ code must be switched to _lazy_tlb
+# postfix as necessary.
+config MMU_LAZY_TLB_SHOOTDOWN
+	bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
diff --git a/kernel/fork.c b/kernel/fork.c
index dc06afd725cb..8085ff33c7f6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -674,6 +674,53 @@ static void check_mm(struct mm_struct *mm)
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_shoot_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	if (current->active_mm == mm) {
+		WARN_ON_ONCE(current->mm);
+		current->active_mm = &init_mm;
+		switch_mm(mm, &init_mm, current);
+	}
+}
+
+static void do_check_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void shoot_lazy_tlbs(struct mm_struct *mm)
+{
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+		/*
+		 * IPI overheads have not found to be expensive, but they could
+		 * be reduced in a number of possible ways, for example (in
+		 * roughly increasing order of complexity):
+		 * - A batch of mms requiring IPIs could be gathered and freed
+		 *   at once.
+		 * - CPUs could store their active mm somewhere that can be
+		 *   remotely checked without a lock, to filter out
+		 *   false-positives in the cpumask.
+		 * - After mm_users or mm_count reaches zero, switching away
+		 *   from the mm could clear mm_cpumask to reduce some IPIs
+		 *   (some batching or delaying would help).
+		 * - A delayed freeing and RCU-like quiescing sequence based on
+		 *   mm switching to avoid IPIs completely.
+		 */
+		on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
+		if (IS_ENABLED(CONFIG_DEBUG_VM))
+			on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+	} else {
+		/*
+		 * In this case, lazy tlb mms are refounted and would not reach
+		 * __mmdrop until all CPUs have switched away and mmdrop()ed.
+		 */
+	}
+}
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -683,6 +730,10 @@ void __mmdrop(struct mm_struct *mm)
 {
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
+
+	/* Ensure no CPUs are using this as their lazy tlb mm */
+	shoot_lazy_tlbs(mm);
+
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 3/4] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
@ 2021-06-05  1:42   ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-05  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Randy Dunlap, linux-kernel, Nicholas Piggin,
	linux-mm, Andy Lutomirski, linuxppc-dev

On big systems, the mm refcount can become highly contented when doing
a lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so instead implement a non-refcounted scheme
that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
any remaining lazy ones.

Shootdown IPIs are some concern, but they have not been observed to be
a big problem with this scheme (the powerpc implementation generated
314 additional interrupts on a 144 CPU system during a kernel compile).
There are a number of strategies that could be employed to reduce IPIs
if they turn out to be a problem for some workload.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig  | 13 +++++++++++++
 kernel/fork.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 1cff045cdde6..f8136c893991 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -421,6 +421,19 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
+	depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# Instead of refcounting the lazy mm struct for kernel thread references
+# (which can cause contention with multi-threaded apps on large multiprocessor
+# systems), this option causes __mmdrop to IPI all CPUs in the mm_cpumask and
+# switch to init_mm if they were using the to-be-freed mm as the lazy tlb. To
+# implement this, architectures must use _lazy_tlb variants of mm refcounting
+# when releasing kernel thread mm references, and mm_cpumask must include at
+# least all possible CPUs in which the mm might be lazy, at the time of the
+# final mmdrop. mmgrab/mmdrop in arch/ code must be switched to _lazy_tlb
+# postfix as necessary.
+config MMU_LAZY_TLB_SHOOTDOWN
+	bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
diff --git a/kernel/fork.c b/kernel/fork.c
index dc06afd725cb..8085ff33c7f6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -674,6 +674,53 @@ static void check_mm(struct mm_struct *mm)
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_shoot_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	if (current->active_mm == mm) {
+		WARN_ON_ONCE(current->mm);
+		current->active_mm = &init_mm;
+		switch_mm(mm, &init_mm, current);
+	}
+}
+
+static void do_check_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void shoot_lazy_tlbs(struct mm_struct *mm)
+{
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+		/*
+		 * IPI overheads have not found to be expensive, but they could
+		 * be reduced in a number of possible ways, for example (in
+		 * roughly increasing order of complexity):
+		 * - A batch of mms requiring IPIs could be gathered and freed
+		 *   at once.
+		 * - CPUs could store their active mm somewhere that can be
+		 *   remotely checked without a lock, to filter out
+		 *   false-positives in the cpumask.
+		 * - After mm_users or mm_count reaches zero, switching away
+		 *   from the mm could clear mm_cpumask to reduce some IPIs
+		 *   (some batching or delaying would help).
+		 * - A delayed freeing and RCU-like quiescing sequence based on
+		 *   mm switching to avoid IPIs completely.
+		 */
+		on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
+		if (IS_ENABLED(CONFIG_DEBUG_VM))
+			on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+	} else {
+		/*
+		 * In this case, lazy tlb mms are refounted and would not reach
+		 * __mmdrop until all CPUs have switched away and mmdrop()ed.
+		 */
+	}
+}
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -683,6 +730,10 @@ void __mmdrop(struct mm_struct *mm)
 {
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
+
+	/* Ensure no CPUs are using this as their lazy tlb mm */
+	shoot_lazy_tlbs(mm);
+
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 4/4] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  2021-06-05  1:42 ` Nicholas Piggin
@ 2021-06-05  1:42   ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-05  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Randy Dunlap, linux-kernel, linux-arch,
	linuxppc-dev, linux-mm, Anton Blanchard, Andy Lutomirski

On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second. After this patch it goes up to 118 million.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 088dd2afcfe4..8a092eedc692 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -252,6 +252,7 @@ config PPC
 	select IRQ_FORCED_THREADING
 	select MMU_GATHER_PAGE_SIZE
 	select MMU_GATHER_RCU_TABLE_FREE
+	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
 	select NEED_SG_DMA_LENGTH
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 4/4] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
@ 2021-06-05  1:42   ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-05  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Randy Dunlap, linux-kernel, Nicholas Piggin,
	linux-mm, Andy Lutomirski, linuxppc-dev

On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second. After this patch it goes up to 118 million.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 088dd2afcfe4..8a092eedc692 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -252,6 +252,7 @@ config PPC
 	select IRQ_FORCED_THREADING
 	select MMU_GATHER_PAGE_SIZE
 	select MMU_GATHER_RCU_TABLE_FREE
+	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
 	select NEED_SG_DMA_LENGTH
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions
  2021-06-05  1:42   ` Nicholas Piggin
@ 2021-06-07 23:49     ` Andrew Morton
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2021-06-07 23:49 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Randy Dunlap, linux-kernel, linux-arch, linuxppc-dev, linux-mm,
	Anton Blanchard, Andy Lutomirski

On Sat,  5 Jun 2021 11:42:13 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:

> Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
> This makes lazy mm references more obvious, and allows explicit
> refcounting to be removed if it is not used.
> 
> ...
>
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1314,14 +1314,14 @@ void kthread_use_mm(struct mm_struct *mm)
>  	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
>  	WARN_ON_ONCE(tsk->mm);
>  
> +	mmgrab(mm);
> +
>  	task_lock(tsk);
>  	/* Hold off tlb flush IPIs while switching mm's */
>  	local_irq_disable();
>  	active_mm = tsk->active_mm;
> -	if (active_mm != mm) {
> -		mmgrab(mm);
> +	if (active_mm != mm)
>  		tsk->active_mm = mm;
> -	}

Looks like a functional change.  What's happening here?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions
@ 2021-06-07 23:49     ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2021-06-07 23:49 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Randy Dunlap, linux-kernel, linux-mm,
	Andy Lutomirski, linuxppc-dev

On Sat,  5 Jun 2021 11:42:13 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:

> Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
> This makes lazy mm references more obvious, and allows explicit
> refcounting to be removed if it is not used.
> 
> ...
>
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1314,14 +1314,14 @@ void kthread_use_mm(struct mm_struct *mm)
>  	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
>  	WARN_ON_ONCE(tsk->mm);
>  
> +	mmgrab(mm);
> +
>  	task_lock(tsk);
>  	/* Hold off tlb flush IPIs while switching mm's */
>  	local_irq_disable();
>  	active_mm = tsk->active_mm;
> -	if (active_mm != mm) {
> -		mmgrab(mm);
> +	if (active_mm != mm)
>  		tsk->active_mm = mm;
> -	}

Looks like a functional change.  What's happening here?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 4/4] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  2021-06-05  1:42   ` Nicholas Piggin
@ 2021-06-07 23:52     ` Andrew Morton
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2021-06-07 23:52 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Randy Dunlap, linux-kernel, linux-arch, linuxppc-dev, linux-mm,
	Anton Blanchard, Andy Lutomirski

On Sat,  5 Jun 2021 11:42:16 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:

> On a 16-socket 192-core POWER8 system, a context switching benchmark
> with as many software threads as CPUs (so each switch will go in and
> out of idle), upstream can achieve a rate of about 1 million context
> switches per second. After this patch it goes up to 118 million.

Nice.  Do we have a feel for the benefit on any real-world workloads?

Could any other architectures benefit from these changes?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 4/4] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
@ 2021-06-07 23:52     ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2021-06-07 23:52 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Randy Dunlap, linux-kernel, linux-mm,
	Andy Lutomirski, linuxppc-dev

On Sat,  5 Jun 2021 11:42:16 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:

> On a 16-socket 192-core POWER8 system, a context switching benchmark
> with as many software threads as CPUs (so each switch will go in and
> out of idle), upstream can achieve a rate of about 1 million context
> switches per second. After this patch it goes up to 118 million.

Nice.  Do we have a feel for the benefit on any real-world workloads?

Could any other architectures benefit from these changes?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions
  2021-06-07 23:49     ` Andrew Morton
@ 2021-06-08  1:39       ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-08  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Andy Lutomirski, Randy Dunlap

Excerpts from Andrew Morton's message of June 8, 2021 9:49 am:
> On Sat,  5 Jun 2021 11:42:13 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
>> This makes lazy mm references more obvious, and allows explicit
>> refcounting to be removed if it is not used.
>> 
>> ...
>>
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -1314,14 +1314,14 @@ void kthread_use_mm(struct mm_struct *mm)
>>  	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
>>  	WARN_ON_ONCE(tsk->mm);
>>  
>> +	mmgrab(mm);
>> +
>>  	task_lock(tsk);
>>  	/* Hold off tlb flush IPIs while switching mm's */
>>  	local_irq_disable();
>>  	active_mm = tsk->active_mm;
>> -	if (active_mm != mm) {
>> -		mmgrab(mm);
>> +	if (active_mm != mm)
>>  		tsk->active_mm = mm;
>> -	}
> 
> Looks like a functional change.  What's happening here?

That's kthread_use_mm being clever about the lazy tlb mm. If it happened 
that the kthread had inherited a the lazy tlb mm that happens to be the 
one we want to use here, then we already have a refcount to it via the 
lazy tlb ref.

So then it doesn't have to touch the refcount, but rather just converts
it from the lazy tlb ref to the returned reference. If the lazy tlb mm
doesn't get a reference, we can't do that.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions
@ 2021-06-08  1:39       ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-08  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Randy Dunlap, linux-kernel, linux-mm,
	Andy Lutomirski, linuxppc-dev

Excerpts from Andrew Morton's message of June 8, 2021 9:49 am:
> On Sat,  5 Jun 2021 11:42:13 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
>> This makes lazy mm references more obvious, and allows explicit
>> refcounting to be removed if it is not used.
>> 
>> ...
>>
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -1314,14 +1314,14 @@ void kthread_use_mm(struct mm_struct *mm)
>>  	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
>>  	WARN_ON_ONCE(tsk->mm);
>>  
>> +	mmgrab(mm);
>> +
>>  	task_lock(tsk);
>>  	/* Hold off tlb flush IPIs while switching mm's */
>>  	local_irq_disable();
>>  	active_mm = tsk->active_mm;
>> -	if (active_mm != mm) {
>> -		mmgrab(mm);
>> +	if (active_mm != mm)
>>  		tsk->active_mm = mm;
>> -	}
> 
> Looks like a functional change.  What's happening here?

That's kthread_use_mm being clever about the lazy tlb mm. If it happened 
that the kthread had inherited a the lazy tlb mm that happens to be the 
one we want to use here, then we already have a refcount to it via the 
lazy tlb ref.

So then it doesn't have to touch the refcount, but rather just converts
it from the lazy tlb ref to the returned reference. If the lazy tlb mm
doesn't get a reference, we can't do that.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions
  2021-06-08  1:39       ` Nicholas Piggin
@ 2021-06-08  1:48         ` Andrew Morton
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2021-06-08  1:48 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Andy Lutomirski, Randy Dunlap

On Tue, 08 Jun 2021 11:39:56 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:

> > Looks like a functional change.  What's happening here?
> 
> That's kthread_use_mm being clever about the lazy tlb mm. If it happened 
> that the kthread had inherited a the lazy tlb mm that happens to be the 
> one we want to use here, then we already have a refcount to it via the 
> lazy tlb ref.
> 
> So then it doesn't have to touch the refcount, but rather just converts
> it from the lazy tlb ref to the returned reference. If the lazy tlb mm
> doesn't get a reference, we can't do that.

Please cover this in the changelog and perhaps a code comment.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions
@ 2021-06-08  1:48         ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2021-06-08  1:48 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Randy Dunlap, linux-kernel, linux-mm,
	Andy Lutomirski, linuxppc-dev

On Tue, 08 Jun 2021 11:39:56 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:

> > Looks like a functional change.  What's happening here?
> 
> That's kthread_use_mm being clever about the lazy tlb mm. If it happened 
> that the kthread had inherited a the lazy tlb mm that happens to be the 
> one we want to use here, then we already have a refcount to it via the 
> lazy tlb ref.
> 
> So then it doesn't have to touch the refcount, but rather just converts
> it from the lazy tlb ref to the returned reference. If the lazy tlb mm
> doesn't get a reference, we can't do that.

Please cover this in the changelog and perhaps a code comment.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 4/4] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  2021-06-07 23:52     ` Andrew Morton
@ 2021-06-08  2:13       ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-08  2:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Andy Lutomirski, Randy Dunlap, Rik van Riel

Excerpts from Andrew Morton's message of June 8, 2021 9:52 am:
> On Sat,  5 Jun 2021 11:42:16 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>> with as many software threads as CPUs (so each switch will go in and
>> out of idle), upstream can achieve a rate of about 1 million context
>> switches per second. After this patch it goes up to 118 million.
> 
> Nice.  Do we have a feel for the benefit on any real-world workloads?

Not really unfortunately. I think it's always been a "known" cacheline,
it just showed up badly on will-it-scale tests recently when Anton was
doing a sweep of low hanging scalability issues on big systems.

We have some very big systems running certain in-memory databases that 
get into very high contention conditions on mutexes that push context
switch rates right up and with idle times pretty high, which would get
a lot of parallel context switching between user and idle thread, we
might be getting a bit of this contention there.

It's not something at the top of profiles though. And on multi-threaded
workloads like this, the normal refcounting of the user mm still has
fundmaental contention. It's tricky to get the change tested on these
workloads (machine time is very limited and I can't drive the software).

I suspect it could also show in things that do high net or disk IO rates
(enough to need a lot of cores), and do some user processing steps along
the way. You'd potentially get a lot of idle switching.

> 
> Could any other architectures benefit from these changes?
> 

The cacheline is going to bounce in the same situations on other archs, 
so I would say yes. Rik at one stage had some patches to try avoid it
for x86 some years ago, I don't know what happened to those.

The way powerpc has to maintain mm_cpumask for its TLB flushing makes it
relatively easy to do this shootdown, and we decided the additional IPIs
were less of a concern than the bouncing. Others have different concerns,
but I tried to make it generic and add comments explaining what other
archs can do, or possibly different ways it might be achieved.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 4/4] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
@ 2021-06-08  2:13       ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-08  2:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Rik van Riel, Randy Dunlap, linux-kernel, linux-mm,
	Andy Lutomirski, linuxppc-dev

Excerpts from Andrew Morton's message of June 8, 2021 9:52 am:
> On Sat,  5 Jun 2021 11:42:16 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>> with as many software threads as CPUs (so each switch will go in and
>> out of idle), upstream can achieve a rate of about 1 million context
>> switches per second. After this patch it goes up to 118 million.
> 
> Nice.  Do we have a feel for the benefit on any real-world workloads?

Not really unfortunately. I think it's always been a "known" cacheline,
it just showed up badly on will-it-scale tests recently when Anton was
doing a sweep of low hanging scalability issues on big systems.

We have some very big systems running certain in-memory databases that 
get into very high contention conditions on mutexes that push context
switch rates right up and with idle times pretty high, which would get
a lot of parallel context switching between user and idle thread, we
might be getting a bit of this contention there.

It's not something at the top of profiles though. And on multi-threaded
workloads like this, the normal refcounting of the user mm still has
fundmaental contention. It's tricky to get the change tested on these
workloads (machine time is very limited and I can't drive the software).

I suspect it could also show in things that do high net or disk IO rates
(enough to need a lot of cores), and do some user processing steps along
the way. You'd potentially get a lot of idle switching.

> 
> Could any other architectures benefit from these changes?
> 

The cacheline is going to bounce in the same situations on other archs, 
so I would say yes. Rik at one stage had some patches to try avoid it
for x86 some years ago, I don't know what happened to those.

The way powerpc has to maintain mm_cpumask for its TLB flushing makes it
relatively easy to do this shootdown, and we decided the additional IPIs
were less of a concern than the bouncing. Others have different concerns,
but I tried to make it generic and add comments explaining what other
archs can do, or possibly different ways it might be achieved.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-05  1:42   ` Nicholas Piggin
@ 2021-06-08  3:11     ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-08  3:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Andy Lutomirski, Randy Dunlap

Excerpts from Nicholas Piggin's message of June 5, 2021 11:42 am:
> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
> when it is context switched. This can be disabled by architectures that
> don't require this refcounting if they clean up lazy tlb mms when the
> last refcount is dropped. Currently this is always enabled, which is
> what existing code does, so the patch is effectively a no-op.
> 
> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Can I give you a couple of incremental patches for 2/4 and 3/4 to 
improve the implementation requirement comments a bit for benefit of 
other archs.

Thanks,
Nick
--

Explain the requirements for lazy tlb mm refcounting in the comment,
to help with archs that may want to disable this by some means other
than MMU_LAZY_TLB_SHOOTDOWN.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 1cff045cdde6..39d8c7dcf220 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -419,6 +419,16 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  shootdowns should enable this.
 
 # Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
+# to/from kernel threads when the same mm is running on a lot of CPUs (a large
+# multi-threaded application), by reducing contention on the mm refcount.
+#
+# This can be disabled if the architecture ensures no CPUs are using an mm as a
+# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
+# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
+# final exit(2) TLB flush, for example. arch code must also ensure the
+# _lazy_tlb variants of mmgrab/mmdrop are used when dropping the lazy reference
+# to a kthread ->active_mm (non-arch code has been converted already).
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-08  3:11     ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-08  3:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Randy Dunlap, linux-kernel, linux-mm,
	Andy Lutomirski, linuxppc-dev

Excerpts from Nicholas Piggin's message of June 5, 2021 11:42 am:
> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
> when it is context switched. This can be disabled by architectures that
> don't require this refcounting if they clean up lazy tlb mms when the
> last refcount is dropped. Currently this is always enabled, which is
> what existing code does, so the patch is effectively a no-op.
> 
> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Can I give you a couple of incremental patches for 2/4 and 3/4 to 
improve the implementation requirement comments a bit for benefit of 
other archs.

Thanks,
Nick
--

Explain the requirements for lazy tlb mm refcounting in the comment,
to help with archs that may want to disable this by some means other
than MMU_LAZY_TLB_SHOOTDOWN.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 1cff045cdde6..39d8c7dcf220 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -419,6 +419,16 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  shootdowns should enable this.
 
 # Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
+# to/from kernel threads when the same mm is running on a lot of CPUs (a large
+# multi-threaded application), by reducing contention on the mm refcount.
+#
+# This can be disabled if the architecture ensures no CPUs are using an mm as a
+# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
+# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
+# final exit(2) TLB flush, for example. arch code must also ensure the
+# _lazy_tlb variants of mmgrab/mmdrop are used when dropping the lazy reference
+# to a kthread ->active_mm (non-arch code has been converted already).
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 3/4] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
  2021-06-05  1:42   ` Nicholas Piggin
@ 2021-06-08  3:15     ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-08  3:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Andy Lutomirski, Randy Dunlap

Excerpts from Nicholas Piggin's message of June 5, 2021 11:42 am:
> On big systems, the mm refcount can become highly contented when doing
> a lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
> 
> Abandoning lazy tlb slows switching down quite a bit in the important
> user->idle->user cases, so instead implement a non-refcounted scheme
> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> any remaining lazy ones.
> 
> Shootdown IPIs are some concern, but they have not been observed to be
> a big problem with this scheme (the powerpc implementation generated
> 314 additional interrupts on a 144 CPU system during a kernel compile).
> There are a number of strategies that could be employed to reduce IPIs
> if they turn out to be a problem for some workload.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---

Update the comment to be clearer, and account for the improvement
to MMU_LAZY_TLB_REFCOUNT comment.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 2ad1a505ca55..cf468c9777d8 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -433,15 +433,16 @@ config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
 	depends on !MMU_LAZY_TLB_SHOOTDOWN
 
-# Instead of refcounting the lazy mm struct for kernel thread references
-# (which can cause contention with multi-threaded apps on large multiprocessor
-# systems), this option causes __mmdrop to IPI all CPUs in the mm_cpumask and
-# switch to init_mm if they were using the to-be-freed mm as the lazy tlb. To
-# implement this, architectures must use _lazy_tlb variants of mm refcounting
-# when releasing kernel thread mm references, and mm_cpumask must include at
-# least all possible CPUs in which the mm might be lazy, at the time of the
-# final mmdrop. mmgrab/mmdrop in arch/ code must be switched to _lazy_tlb
-# postfix as necessary.
+# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
+# mm as a lazy tlb beyond its last reference count, by shooting down these
+# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
+# be using the mm as a lazy tlb, so that they may switch themselves to using
+# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
+# may be using mm as a lazy tlb mm.
+#
+# To implement this, an arch must ensure mm_cpumask(mm) contains at least all
+# possible CPUs in which the mm is lazy, and it must meet the requirements for
+# MMU_LAZY_TLB_REFCOUNT=n (see above).
 config MMU_LAZY_TLB_SHOOTDOWN
 	bool
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 3/4] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
@ 2021-06-08  3:15     ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-08  3:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Randy Dunlap, linux-kernel, linux-mm,
	Andy Lutomirski, linuxppc-dev

Excerpts from Nicholas Piggin's message of June 5, 2021 11:42 am:
> On big systems, the mm refcount can become highly contented when doing
> a lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
> 
> Abandoning lazy tlb slows switching down quite a bit in the important
> user->idle->user cases, so instead implement a non-refcounted scheme
> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> any remaining lazy ones.
> 
> Shootdown IPIs are some concern, but they have not been observed to be
> a big problem with this scheme (the powerpc implementation generated
> 314 additional interrupts on a 144 CPU system during a kernel compile).
> There are a number of strategies that could be employed to reduce IPIs
> if they turn out to be a problem for some workload.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---

Update the comment to be clearer, and account for the improvement
to MMU_LAZY_TLB_REFCOUNT comment.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 2ad1a505ca55..cf468c9777d8 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -433,15 +433,16 @@ config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
 	depends on !MMU_LAZY_TLB_SHOOTDOWN
 
-# Instead of refcounting the lazy mm struct for kernel thread references
-# (which can cause contention with multi-threaded apps on large multiprocessor
-# systems), this option causes __mmdrop to IPI all CPUs in the mm_cpumask and
-# switch to init_mm if they were using the to-be-freed mm as the lazy tlb. To
-# implement this, architectures must use _lazy_tlb variants of mm refcounting
-# when releasing kernel thread mm references, and mm_cpumask must include at
-# least all possible CPUs in which the mm might be lazy, at the time of the
-# final mmdrop. mmgrab/mmdrop in arch/ code must be switched to _lazy_tlb
-# postfix as necessary.
+# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
+# mm as a lazy tlb beyond its last reference count, by shooting down these
+# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
+# be using the mm as a lazy tlb, so that they may switch themselves to using
+# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
+# may be using mm as a lazy tlb mm.
+#
+# To implement this, an arch must ensure mm_cpumask(mm) contains at least all
+# possible CPUs in which the mm is lazy, and it must meet the requirements for
+# MMU_LAZY_TLB_REFCOUNT=n (see above).
 config MMU_LAZY_TLB_SHOOTDOWN
 	bool
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions
  2021-06-08  1:48         ` Andrew Morton
@ 2021-06-08  4:11           ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-08  4:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Andy Lutomirski,
	Randy
	 Dunlap

Excerpts from Andrew Morton's message of June 8, 2021 11:48 am:
> On Tue, 08 Jun 2021 11:39:56 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> > Looks like a functional change.  What's happening here?
>> 
>> That's kthread_use_mm being clever about the lazy tlb mm. If it happened 
>> that the kthread had inherited a the lazy tlb mm that happens to be the 
>> one we want to use here, then we already have a refcount to it via the 
>> lazy tlb ref.
>> 
>> So then it doesn't have to touch the refcount, but rather just converts
>> it from the lazy tlb ref to the returned reference. If the lazy tlb mm
>> doesn't get a reference, we can't do that.
> 
> Please cover this in the changelog and perhaps a code comment.
> 

Yeah fair enough, I'll even throw in a bug fix as well (your nose was right, 
and it was too clever for me by half...)

Thanks,
Nick

--
Fix a refcounting bug in kthread_use_mm (the mm reference is increased
unconditionally now, but the lazy tlb refcount is still only dropped only
if mm != active_mm).

And an update for the changelog:

If a kernel thread's current lazy tlb mm happens to be the one it wants to
use, then kthread_use_mm() cleverly transfers the mm refcount from the
lazy tlb mm reference to the returned reference. If the lazy tlb mm
reference is no longer identical to a normal reference, this trick does not
work, so that is changed to be explicit about the two references.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/kthread.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index b70e28431a01..5e9797b2d06e 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1314,6 +1314,11 @@ void kthread_use_mm(struct mm_struct *mm)
 	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
 	WARN_ON_ONCE(tsk->mm);
 
+	/*
+	 * It's possible that tsk->active_mm == mm here, but we must
+	 * still mmgrab(mm) and mmdrop_lazy_tlb(active_mm), because lazy
+	 * mm may not have its own refcount (see mmgrab/drop_lazy_tlb()).
+	 */
 	mmgrab(mm);
 
 	task_lock(tsk);
@@ -1338,12 +1343,9 @@ void kthread_use_mm(struct mm_struct *mm)
 	 * memory barrier after storing to tsk->mm, before accessing
 	 * user-space memory. A full memory barrier for membarrier
 	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
-	 * mmdrop(), or explicitly with smp_mb().
+	 * mmdrop_lazy_tlb().
 	 */
-	if (active_mm != mm)
-		mmdrop_lazy_tlb(active_mm);
-	else
-		smp_mb();
+	mmdrop_lazy_tlb(active_mm);
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions
@ 2021-06-08  4:11           ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-08  4:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch,
	Randy
	 Dunlap, linux-kernel, linux-mm, Andy Lutomirski, linuxppc-dev

Excerpts from Andrew Morton's message of June 8, 2021 11:48 am:
> On Tue, 08 Jun 2021 11:39:56 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> > Looks like a functional change.  What's happening here?
>> 
>> That's kthread_use_mm being clever about the lazy tlb mm. If it happened 
>> that the kthread had inherited a the lazy tlb mm that happens to be the 
>> one we want to use here, then we already have a refcount to it via the 
>> lazy tlb ref.
>> 
>> So then it doesn't have to touch the refcount, but rather just converts
>> it from the lazy tlb ref to the returned reference. If the lazy tlb mm
>> doesn't get a reference, we can't do that.
> 
> Please cover this in the changelog and perhaps a code comment.
> 

Yeah fair enough, I'll even throw in a bug fix as well (your nose was right, 
and it was too clever for me by half...)

Thanks,
Nick

--
Fix a refcounting bug in kthread_use_mm (the mm reference is increased
unconditionally now, but the lazy tlb refcount is still only dropped only
if mm != active_mm).

And an update for the changelog:

If a kernel thread's current lazy tlb mm happens to be the one it wants to
use, then kthread_use_mm() cleverly transfers the mm refcount from the
lazy tlb mm reference to the returned reference. If the lazy tlb mm
reference is no longer identical to a normal reference, this trick does not
work, so that is changed to be explicit about the two references.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/kthread.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index b70e28431a01..5e9797b2d06e 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1314,6 +1314,11 @@ void kthread_use_mm(struct mm_struct *mm)
 	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
 	WARN_ON_ONCE(tsk->mm);
 
+	/*
+	 * It's possible that tsk->active_mm == mm here, but we must
+	 * still mmgrab(mm) and mmdrop_lazy_tlb(active_mm), because lazy
+	 * mm may not have its own refcount (see mmgrab/drop_lazy_tlb()).
+	 */
 	mmgrab(mm);
 
 	task_lock(tsk);
@@ -1338,12 +1343,9 @@ void kthread_use_mm(struct mm_struct *mm)
 	 * memory barrier after storing to tsk->mm, before accessing
 	 * user-space memory. A full memory barrier for membarrier
 	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
-	 * mmdrop(), or explicitly with smp_mb().
+	 * mmdrop_lazy_tlb().
 	 */
-	if (active_mm != mm)
-		mmdrop_lazy_tlb(active_mm);
-	else
-		smp_mb();
+	mmdrop_lazy_tlb(active_mm);
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-05  1:42   ` Nicholas Piggin
@ 2021-06-08 16:20     ` Andy Lutomirski
  -1 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-06-08 16:20 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton
  Cc: Randy Dunlap, linux-kernel, linux-arch, linuxppc-dev, linux-mm,
	Anton Blanchard

On 6/4/21 6:42 PM, Nicholas Piggin wrote:
> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
> when it is context switched. This can be disabled by architectures that
> don't require this refcounting if they clean up lazy tlb mms when the
> last refcount is dropped. Currently this is always enabled, which is
> what existing code does, so the patch is effectively a no-op.
> 
> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

I am in favor of this approach, but I would be a lot more comfortable
with the resulting code if task->active_mm were at least better
documented and possibly even guarded by ifdefs.

x86 bare metal currently does not need the core lazy mm refcounting, and
x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
if lazy mm refcounting were configured out, ->active_mm could become a
dangling pointer, and this makes me extremely uncomfortable.

So I tend to think that, depending on config, the core code should
either keep ->active_mm [1] alive or get rid of it entirely.

[1] I don't really think it belongs in task_struct at all.  It's not a
property of the task.  It's the *per-cpu* mm that the core code is
keeping alive for lazy purposes.  How about consolidating it with the
copy in rq?

I guess the short summary of my opinion is that I like making this
configurable, but I do not like the state of the code.

--Andy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-08 16:20     ` Andy Lutomirski
  0 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-06-08 16:20 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton
  Cc: linux-arch, Randy Dunlap, linux-kernel, linux-mm, linuxppc-dev

On 6/4/21 6:42 PM, Nicholas Piggin wrote:
> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
> when it is context switched. This can be disabled by architectures that
> don't require this refcounting if they clean up lazy tlb mms when the
> last refcount is dropped. Currently this is always enabled, which is
> what existing code does, so the patch is effectively a no-op.
> 
> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

I am in favor of this approach, but I would be a lot more comfortable
with the resulting code if task->active_mm were at least better
documented and possibly even guarded by ifdefs.

x86 bare metal currently does not need the core lazy mm refcounting, and
x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
if lazy mm refcounting were configured out, ->active_mm could become a
dangling pointer, and this makes me extremely uncomfortable.

So I tend to think that, depending on config, the core code should
either keep ->active_mm [1] alive or get rid of it entirely.

[1] I don't really think it belongs in task_struct at all.  It's not a
property of the task.  It's the *per-cpu* mm that the core code is
keeping alive for lazy purposes.  How about consolidating it with the
copy in rq?

I guess the short summary of my opinion is that I like making this
configurable, but I do not like the state of the code.

--Andy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-08 16:20     ` Andy Lutomirski
@ 2021-06-14  0:45       ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-14  0:45 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Randy Dunlap, Linus Torvalds

Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>> when it is context switched. This can be disabled by architectures that
>> don't require this refcounting if they clean up lazy tlb mms when the
>> last refcount is dropped. Currently this is always enabled, which is
>> what existing code does, so the patch is effectively a no-op.
>> 
>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
> 
> I am in favor of this approach, but I would be a lot more comfortable
> with the resulting code if task->active_mm were at least better
> documented and possibly even guarded by ifdefs.

active_mm is fairly well documented in Documentation/active_mm.rst IMO.
I don't think anything has changed in 20 years, I don't know what more
is needed, but if you can add to documentation that would be nice. Maybe
moving a bit of that into .c and .h files?

> x86 bare metal currently does not need the core lazy mm refcounting, and
> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
> if lazy mm refcounting were configured out, ->active_mm could become a
> dangling pointer, and this makes me extremely uncomfortable.
> 
> So I tend to think that, depending on config, the core code should
> either keep ->active_mm [1] alive or get rid of it entirely.

I don't actually know what you mean.

core code needs the concept of an "active_mm". This is the mm that your 
kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
active_mm still points to init_mm for kernel threads.

We could hide that idea behind an active_mm() function that would always 
return &init_mm if mm==NULL, but you still have the concept of an active
mm and a pointer that callers must not access after free (because some
cases will be CONFIG_LAZY_TLB=y).

> [1] I don't really think it belongs in task_struct at all.  It's not a
> property of the task.  It's the *per-cpu* mm that the core code is
> keeping alive for lazy purposes.  How about consolidating it with the
> copy in rq?

I agree it's conceptually a per-cpu property. I don't know why it was 
done this way, maybe it was just convenient and works well for mm and 
active_mm to be adjacent. Linus might have a better insight.

> I guess the short summary of my opinion is that I like making this
> configurable, but I do not like the state of the code.

I don't think I'd object to moving active_mm to rq and converting all
usages to active_mm() while we're there, it would make things a bit
more configurable. But I don't see it making core code fundamentally
less complex... if you're referring to the x86 mm switching monstrosity,
then that's understandable, but I admit I haven't spent enough time
looking at it to make a useful comment. A patch would be enlightening,
I have the leftover CONFIG_LAZY_TLB=n patch if you were thinking of 
building on that I can send it to you.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-14  0:45       ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-14  0:45 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Linus Torvalds, Randy Dunlap, linux-kernel, linux-mm,
	linuxppc-dev

Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>> when it is context switched. This can be disabled by architectures that
>> don't require this refcounting if they clean up lazy tlb mms when the
>> last refcount is dropped. Currently this is always enabled, which is
>> what existing code does, so the patch is effectively a no-op.
>> 
>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
> 
> I am in favor of this approach, but I would be a lot more comfortable
> with the resulting code if task->active_mm were at least better
> documented and possibly even guarded by ifdefs.

active_mm is fairly well documented in Documentation/active_mm.rst IMO.
I don't think anything has changed in 20 years, I don't know what more
is needed, but if you can add to documentation that would be nice. Maybe
moving a bit of that into .c and .h files?

> x86 bare metal currently does not need the core lazy mm refcounting, and
> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
> if lazy mm refcounting were configured out, ->active_mm could become a
> dangling pointer, and this makes me extremely uncomfortable.
> 
> So I tend to think that, depending on config, the core code should
> either keep ->active_mm [1] alive or get rid of it entirely.

I don't actually know what you mean.

core code needs the concept of an "active_mm". This is the mm that your 
kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
active_mm still points to init_mm for kernel threads.

We could hide that idea behind an active_mm() function that would always 
return &init_mm if mm==NULL, but you still have the concept of an active
mm and a pointer that callers must not access after free (because some
cases will be CONFIG_LAZY_TLB=y).

> [1] I don't really think it belongs in task_struct at all.  It's not a
> property of the task.  It's the *per-cpu* mm that the core code is
> keeping alive for lazy purposes.  How about consolidating it with the
> copy in rq?

I agree it's conceptually a per-cpu property. I don't know why it was 
done this way, maybe it was just convenient and works well for mm and 
active_mm to be adjacent. Linus might have a better insight.

> I guess the short summary of my opinion is that I like making this
> configurable, but I do not like the state of the code.

I don't think I'd object to moving active_mm to rq and converting all
usages to active_mm() while we're there, it would make things a bit
more configurable. But I don't see it making core code fundamentally
less complex... if you're referring to the x86 mm switching monstrosity,
then that's understandable, but I admit I haven't spent enough time
looking at it to make a useful comment. A patch would be enlightening,
I have the leftover CONFIG_LAZY_TLB=n patch if you were thinking of 
building on that I can send it to you.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-14  0:45       ` Nicholas Piggin
@ 2021-06-14  3:52         ` Andy Lutomirski
  -1 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-06-14  3:52 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Randy Dunlap, Linus Torvalds

On 6/13/21 5:45 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>> when it is context switched. This can be disabled by architectures that
>>> don't require this refcounting if they clean up lazy tlb mms when the
>>> last refcount is dropped. Currently this is always enabled, which is
>>> what existing code does, so the patch is effectively a no-op.
>>>
>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>
>> I am in favor of this approach, but I would be a lot more comfortable
>> with the resulting code if task->active_mm were at least better
>> documented and possibly even guarded by ifdefs.
> 
> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
> I don't think anything has changed in 20 years, I don't know what more
> is needed, but if you can add to documentation that would be nice. Maybe
> moving a bit of that into .c and .h files?
> 

Quoting from that file:

  - however, we obviously need to keep track of which address space we
    "stole" for such an anonymous user. For that, we have "tsk->active_mm",
    which shows what the currently active address space is.

This isn't even true right now on x86.  With your patch applied:

 To support all that, the "struct mm_struct" now has two counters: a
 "mm_users" counter that is how many "real address space users" there are,
 and a "mm_count" counter that is the number of "lazy" users (ie anonymous
 users) plus one if there are any real users.

isn't even true any more.


>> x86 bare metal currently does not need the core lazy mm refcounting, and
>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>> if lazy mm refcounting were configured out, ->active_mm could become a
>> dangling pointer, and this makes me extremely uncomfortable.
>>
>> So I tend to think that, depending on config, the core code should
>> either keep ->active_mm [1] alive or get rid of it entirely.
> 
> I don't actually know what you mean.
> 
> core code needs the concept of an "active_mm". This is the mm that your 
> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
> active_mm still points to init_mm for kernel threads.

Core code does *not* need this concept.  First, it's wrong on x86 since
at least 4.15.  Any core code that actually assumes that ->active_mm is
"active" for any sensible definition of the word active is wrong.
Fortunately there is no such code.

I looked through all active_mm references in core code.  We have:

kernel/sched/core.c: it's all refcounting, although it's a bit tangled
with membarrier.

kernel/kthread.c: same.  refcounting and membarrier stuff.

kernel/exit.c: exit_mm() a BUG_ON().

kernel/fork.c: initialization code and a warning.

kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.

fs/exec.c: nothing of interest

I didn't go through drivers, but I maintain my point.  active_mm is
there for refcounting.  So please don't just make it even more confusing
-- do your performance improvement, but improve the code at the same
time: get rid of active_mm, at least on architectures that opt out of
the refcounting.



> 
> We could hide that idea behind an active_mm() function that would always 
> return &init_mm if mm==NULL, but you still have the concept of an active
> mm and a pointer that callers must not access after free (because some
> cases will be CONFIG_LAZY_TLB=y).
> 
>> [1] I don't really think it belongs in task_struct at all.  It's not a
>> property of the task.  It's the *per-cpu* mm that the core code is
>> keeping alive for lazy purposes.  How about consolidating it with the
>> copy in rq?
> 
> I agree it's conceptually a per-cpu property. I don't know why it was 
> done this way, maybe it was just convenient and works well for mm and 
> active_mm to be adjacent. Linus might have a better insight.
> 
>> I guess the short summary of my opinion is that I like making this
>> configurable, but I do not like the state of the code.
> 
> I don't think I'd object to moving active_mm to rq and converting all
> usages to active_mm() while we're there, it would make things a bit
> more configurable. But I don't see it making core code fundamentally
> less complex... if you're referring to the x86 mm switching monstrosity,
> then that's understandable, but I admit I haven't spent enough time
> looking at it to make a useful comment. A patch would be enlightening,
> I have the leftover CONFIG_LAZY_TLB=n patch if you were thinking of 
> building on that I can send it to you.
> 
> Thanks,
> Nick
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-14  3:52         ` Andy Lutomirski
  0 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-06-14  3:52 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton
  Cc: linux-arch, Linus Torvalds, Randy Dunlap, linux-kernel, linux-mm,
	linuxppc-dev

On 6/13/21 5:45 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>> when it is context switched. This can be disabled by architectures that
>>> don't require this refcounting if they clean up lazy tlb mms when the
>>> last refcount is dropped. Currently this is always enabled, which is
>>> what existing code does, so the patch is effectively a no-op.
>>>
>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>
>> I am in favor of this approach, but I would be a lot more comfortable
>> with the resulting code if task->active_mm were at least better
>> documented and possibly even guarded by ifdefs.
> 
> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
> I don't think anything has changed in 20 years, I don't know what more
> is needed, but if you can add to documentation that would be nice. Maybe
> moving a bit of that into .c and .h files?
> 

Quoting from that file:

  - however, we obviously need to keep track of which address space we
    "stole" for such an anonymous user. For that, we have "tsk->active_mm",
    which shows what the currently active address space is.

This isn't even true right now on x86.  With your patch applied:

 To support all that, the "struct mm_struct" now has two counters: a
 "mm_users" counter that is how many "real address space users" there are,
 and a "mm_count" counter that is the number of "lazy" users (ie anonymous
 users) plus one if there are any real users.

isn't even true any more.


>> x86 bare metal currently does not need the core lazy mm refcounting, and
>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>> if lazy mm refcounting were configured out, ->active_mm could become a
>> dangling pointer, and this makes me extremely uncomfortable.
>>
>> So I tend to think that, depending on config, the core code should
>> either keep ->active_mm [1] alive or get rid of it entirely.
> 
> I don't actually know what you mean.
> 
> core code needs the concept of an "active_mm". This is the mm that your 
> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
> active_mm still points to init_mm for kernel threads.

Core code does *not* need this concept.  First, it's wrong on x86 since
at least 4.15.  Any core code that actually assumes that ->active_mm is
"active" for any sensible definition of the word active is wrong.
Fortunately there is no such code.

I looked through all active_mm references in core code.  We have:

kernel/sched/core.c: it's all refcounting, although it's a bit tangled
with membarrier.

kernel/kthread.c: same.  refcounting and membarrier stuff.

kernel/exit.c: exit_mm() a BUG_ON().

kernel/fork.c: initialization code and a warning.

kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.

fs/exec.c: nothing of interest

I didn't go through drivers, but I maintain my point.  active_mm is
there for refcounting.  So please don't just make it even more confusing
-- do your performance improvement, but improve the code at the same
time: get rid of active_mm, at least on architectures that opt out of
the refcounting.



> 
> We could hide that idea behind an active_mm() function that would always 
> return &init_mm if mm==NULL, but you still have the concept of an active
> mm and a pointer that callers must not access after free (because some
> cases will be CONFIG_LAZY_TLB=y).
> 
>> [1] I don't really think it belongs in task_struct at all.  It's not a
>> property of the task.  It's the *per-cpu* mm that the core code is
>> keeping alive for lazy purposes.  How about consolidating it with the
>> copy in rq?
> 
> I agree it's conceptually a per-cpu property. I don't know why it was 
> done this way, maybe it was just convenient and works well for mm and 
> active_mm to be adjacent. Linus might have a better insight.
> 
>> I guess the short summary of my opinion is that I like making this
>> configurable, but I do not like the state of the code.
> 
> I don't think I'd object to moving active_mm to rq and converting all
> usages to active_mm() while we're there, it would make things a bit
> more configurable. But I don't see it making core code fundamentally
> less complex... if you're referring to the x86 mm switching monstrosity,
> then that's understandable, but I admit I haven't spent enough time
> looking at it to make a useful comment. A patch would be enlightening,
> I have the leftover CONFIG_LAZY_TLB=n patch if you were thinking of 
> building on that I can send it to you.
> 
> Thanks,
> Nick
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-14  3:52         ` Andy Lutomirski
@ 2021-06-14  4:14           ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-14  4:14 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Randy Dunlap, Linus Torvalds

Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>> when it is context switched. This can be disabled by architectures that
>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>> last refcount is dropped. Currently this is always enabled, which is
>>>> what existing code does, so the patch is effectively a no-op.
>>>>
>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>
>>> I am in favor of this approach, but I would be a lot more comfortable
>>> with the resulting code if task->active_mm were at least better
>>> documented and possibly even guarded by ifdefs.
>> 
>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>> I don't think anything has changed in 20 years, I don't know what more
>> is needed, but if you can add to documentation that would be nice. Maybe
>> moving a bit of that into .c and .h files?
>> 
> 
> Quoting from that file:
> 
>   - however, we obviously need to keep track of which address space we
>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>     which shows what the currently active address space is.
> 
> This isn't even true right now on x86.

From the perspective of core code, it is. x86 might do something crazy 
with it, but it has to make it appear this way to non-arch code that
uses active_mm.

Is x86's scheme documented?

> With your patch applied:
> 
>  To support all that, the "struct mm_struct" now has two counters: a
>  "mm_users" counter that is how many "real address space users" there are,
>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>  users) plus one if there are any real users.
> 
> isn't even true any more.

Well yeah but the active_mm concept hasn't changed. The refcounting 
change is hopefully reasonably documented?

> 
> 
>>> x86 bare metal currently does not need the core lazy mm refcounting, and
>>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>>> if lazy mm refcounting were configured out, ->active_mm could become a
>>> dangling pointer, and this makes me extremely uncomfortable.
>>>
>>> So I tend to think that, depending on config, the core code should
>>> either keep ->active_mm [1] alive or get rid of it entirely.
>> 
>> I don't actually know what you mean.
>> 
>> core code needs the concept of an "active_mm". This is the mm that your 
>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
>> active_mm still points to init_mm for kernel threads.
> 
> Core code does *not* need this concept.  First, it's wrong on x86 since
> at least 4.15.  Any core code that actually assumes that ->active_mm is
> "active" for any sensible definition of the word active is wrong.
> Fortunately there is no such code.
> 
> I looked through all active_mm references in core code.  We have:
> 
> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
> with membarrier.
> 
> kernel/kthread.c: same.  refcounting and membarrier stuff.
> 
> kernel/exit.c: exit_mm() a BUG_ON().
> 
> kernel/fork.c: initialization code and a warning.
> 
> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
> 
> fs/exec.c: nothing of interest

I might not have been clear. Core code doesn't need active_mm if 
active_mm somehow goes away. I'm saying active_mm can't go away because
it's needed to support (most) archs that do lazy tlb mm switching.

The part I don't understand is when you say it can just go away. How? 

> I didn't go through drivers, but I maintain my point.  active_mm is
> there for refcounting.  So please don't just make it even more confusing
> -- do your performance improvement, but improve the code at the same
> time: get rid of active_mm, at least on architectures that opt out of
> the refcounting.

powerpc opts out of the refcounting and can not "get rid of active_mm".
Not even in theory.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-14  4:14           ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-14  4:14 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Linus Torvalds, Randy Dunlap, linux-kernel, linux-mm,
	linuxppc-dev

Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>> when it is context switched. This can be disabled by architectures that
>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>> last refcount is dropped. Currently this is always enabled, which is
>>>> what existing code does, so the patch is effectively a no-op.
>>>>
>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>
>>> I am in favor of this approach, but I would be a lot more comfortable
>>> with the resulting code if task->active_mm were at least better
>>> documented and possibly even guarded by ifdefs.
>> 
>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>> I don't think anything has changed in 20 years, I don't know what more
>> is needed, but if you can add to documentation that would be nice. Maybe
>> moving a bit of that into .c and .h files?
>> 
> 
> Quoting from that file:
> 
>   - however, we obviously need to keep track of which address space we
>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>     which shows what the currently active address space is.
> 
> This isn't even true right now on x86.

From the perspective of core code, it is. x86 might do something crazy 
with it, but it has to make it appear this way to non-arch code that
uses active_mm.

Is x86's scheme documented?

> With your patch applied:
> 
>  To support all that, the "struct mm_struct" now has two counters: a
>  "mm_users" counter that is how many "real address space users" there are,
>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>  users) plus one if there are any real users.
> 
> isn't even true any more.

Well yeah but the active_mm concept hasn't changed. The refcounting 
change is hopefully reasonably documented?

> 
> 
>>> x86 bare metal currently does not need the core lazy mm refcounting, and
>>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>>> if lazy mm refcounting were configured out, ->active_mm could become a
>>> dangling pointer, and this makes me extremely uncomfortable.
>>>
>>> So I tend to think that, depending on config, the core code should
>>> either keep ->active_mm [1] alive or get rid of it entirely.
>> 
>> I don't actually know what you mean.
>> 
>> core code needs the concept of an "active_mm". This is the mm that your 
>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
>> active_mm still points to init_mm for kernel threads.
> 
> Core code does *not* need this concept.  First, it's wrong on x86 since
> at least 4.15.  Any core code that actually assumes that ->active_mm is
> "active" for any sensible definition of the word active is wrong.
> Fortunately there is no such code.
> 
> I looked through all active_mm references in core code.  We have:
> 
> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
> with membarrier.
> 
> kernel/kthread.c: same.  refcounting and membarrier stuff.
> 
> kernel/exit.c: exit_mm() a BUG_ON().
> 
> kernel/fork.c: initialization code and a warning.
> 
> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
> 
> fs/exec.c: nothing of interest

I might not have been clear. Core code doesn't need active_mm if 
active_mm somehow goes away. I'm saying active_mm can't go away because
it's needed to support (most) archs that do lazy tlb mm switching.

The part I don't understand is when you say it can just go away. How? 

> I didn't go through drivers, but I maintain my point.  active_mm is
> there for refcounting.  So please don't just make it even more confusing
> -- do your performance improvement, but improve the code at the same
> time: get rid of active_mm, at least on architectures that opt out of
> the refcounting.

powerpc opts out of the refcounting and can not "get rid of active_mm".
Not even in theory.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-14  4:14           ` Nicholas Piggin
@ 2021-06-14  4:47             ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-14  4:47 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Randy Dunlap, Linus Torvalds

Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>> when it is context switched. This can be disabled by architectures that
>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>
>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>
>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>> with the resulting code if task->active_mm were at least better
>>>> documented and possibly even guarded by ifdefs.
>>> 
>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>> I don't think anything has changed in 20 years, I don't know what more
>>> is needed, but if you can add to documentation that would be nice. Maybe
>>> moving a bit of that into .c and .h files?
>>> 
>> 
>> Quoting from that file:
>> 
>>   - however, we obviously need to keep track of which address space we
>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>     which shows what the currently active address space is.
>> 
>> This isn't even true right now on x86.
> 
> From the perspective of core code, it is. x86 might do something crazy 
> with it, but it has to make it appear this way to non-arch code that
> uses active_mm.
> 
> Is x86's scheme documented?
> 
>> With your patch applied:
>> 
>>  To support all that, the "struct mm_struct" now has two counters: a
>>  "mm_users" counter that is how many "real address space users" there are,
>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>  users) plus one if there are any real users.
>> 
>> isn't even true any more.
> 
> Well yeah but the active_mm concept hasn't changed. The refcounting 
> change is hopefully reasonably documented?
> 
>> 
>> 
>>>> x86 bare metal currently does not need the core lazy mm refcounting, and
>>>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>>>> if lazy mm refcounting were configured out, ->active_mm could become a
>>>> dangling pointer, and this makes me extremely uncomfortable.
>>>>
>>>> So I tend to think that, depending on config, the core code should
>>>> either keep ->active_mm [1] alive or get rid of it entirely.
>>> 
>>> I don't actually know what you mean.
>>> 
>>> core code needs the concept of an "active_mm". This is the mm that your 
>>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
>>> active_mm still points to init_mm for kernel threads.
>> 
>> Core code does *not* need this concept.  First, it's wrong on x86 since
>> at least 4.15.  Any core code that actually assumes that ->active_mm is
>> "active" for any sensible definition of the word active is wrong.
>> Fortunately there is no such code.
>> 
>> I looked through all active_mm references in core code.  We have:
>> 
>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>> with membarrier.
>> 
>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>> 
>> kernel/exit.c: exit_mm() a BUG_ON().
>> 
>> kernel/fork.c: initialization code and a warning.
>> 
>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>> 
>> fs/exec.c: nothing of interest
> 
> I might not have been clear. Core code doesn't need active_mm if 
> active_mm somehow goes away. I'm saying active_mm can't go away because
> it's needed to support (most) archs that do lazy tlb mm switching.
> 
> The part I don't understand is when you say it can just go away. How? 
> 
>> I didn't go through drivers, but I maintain my point.  active_mm is
>> there for refcounting.  So please don't just make it even more confusing
>> -- do your performance improvement, but improve the code at the same
>> time: get rid of active_mm, at least on architectures that opt out of
>> the refcounting.
> 
> powerpc opts out of the refcounting and can not "get rid of active_mm".
> Not even in theory.

That is to say, it does do a type of reference management that requires 
active_mm so you can argue it has not entirely opted out of refcounting.
But we're not just doing refcounting for the sake of refcounting! That
would make no sense.

active_mm is required because that's the mm that we have switched to 
(from core code's perspective), and it is integral to know when to 
switch to a different mm. See how active_mm is a fundamental concept
in core code? It's part of the contract between core code and the
arch mm context management calls. reference counting follows from there
but it's not the _reason_ for this code.

Pretend the reference problem does not exit (whether by refcounting or 
shootdown or garbage collection or whatever). We still can't remove 
active_mm! We need it to know how to call into arch functions like 
switch_mm.

I don't know if you just forgot that critical requirement in your above 
list, or you actually are entirely using x86's mental model for this 
code which is doing something entirely different that does not need it 
at all. If that is the case I really don't mind some cleanup or wrapper 
functions for x86 do entirely do its own thing, but if that's the case
you can't criticize core code's use of active_mm due to the current
state of x86. It's x86 that needs documentation and cleaning up.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-14  4:47             ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-14  4:47 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Linus Torvalds, Randy Dunlap, linux-kernel, linux-mm,
	linuxppc-dev

Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>> when it is context switched. This can be disabled by architectures that
>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>
>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>
>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>> with the resulting code if task->active_mm were at least better
>>>> documented and possibly even guarded by ifdefs.
>>> 
>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>> I don't think anything has changed in 20 years, I don't know what more
>>> is needed, but if you can add to documentation that would be nice. Maybe
>>> moving a bit of that into .c and .h files?
>>> 
>> 
>> Quoting from that file:
>> 
>>   - however, we obviously need to keep track of which address space we
>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>     which shows what the currently active address space is.
>> 
>> This isn't even true right now on x86.
> 
> From the perspective of core code, it is. x86 might do something crazy 
> with it, but it has to make it appear this way to non-arch code that
> uses active_mm.
> 
> Is x86's scheme documented?
> 
>> With your patch applied:
>> 
>>  To support all that, the "struct mm_struct" now has two counters: a
>>  "mm_users" counter that is how many "real address space users" there are,
>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>  users) plus one if there are any real users.
>> 
>> isn't even true any more.
> 
> Well yeah but the active_mm concept hasn't changed. The refcounting 
> change is hopefully reasonably documented?
> 
>> 
>> 
>>>> x86 bare metal currently does not need the core lazy mm refcounting, and
>>>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>>>> if lazy mm refcounting were configured out, ->active_mm could become a
>>>> dangling pointer, and this makes me extremely uncomfortable.
>>>>
>>>> So I tend to think that, depending on config, the core code should
>>>> either keep ->active_mm [1] alive or get rid of it entirely.
>>> 
>>> I don't actually know what you mean.
>>> 
>>> core code needs the concept of an "active_mm". This is the mm that your 
>>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
>>> active_mm still points to init_mm for kernel threads.
>> 
>> Core code does *not* need this concept.  First, it's wrong on x86 since
>> at least 4.15.  Any core code that actually assumes that ->active_mm is
>> "active" for any sensible definition of the word active is wrong.
>> Fortunately there is no such code.
>> 
>> I looked through all active_mm references in core code.  We have:
>> 
>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>> with membarrier.
>> 
>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>> 
>> kernel/exit.c: exit_mm() a BUG_ON().
>> 
>> kernel/fork.c: initialization code and a warning.
>> 
>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>> 
>> fs/exec.c: nothing of interest
> 
> I might not have been clear. Core code doesn't need active_mm if 
> active_mm somehow goes away. I'm saying active_mm can't go away because
> it's needed to support (most) archs that do lazy tlb mm switching.
> 
> The part I don't understand is when you say it can just go away. How? 
> 
>> I didn't go through drivers, but I maintain my point.  active_mm is
>> there for refcounting.  So please don't just make it even more confusing
>> -- do your performance improvement, but improve the code at the same
>> time: get rid of active_mm, at least on architectures that opt out of
>> the refcounting.
> 
> powerpc opts out of the refcounting and can not "get rid of active_mm".
> Not even in theory.

That is to say, it does do a type of reference management that requires 
active_mm so you can argue it has not entirely opted out of refcounting.
But we're not just doing refcounting for the sake of refcounting! That
would make no sense.

active_mm is required because that's the mm that we have switched to 
(from core code's perspective), and it is integral to know when to 
switch to a different mm. See how active_mm is a fundamental concept
in core code? It's part of the contract between core code and the
arch mm context management calls. reference counting follows from there
but it's not the _reason_ for this code.

Pretend the reference problem does not exit (whether by refcounting or 
shootdown or garbage collection or whatever). We still can't remove 
active_mm! We need it to know how to call into arch functions like 
switch_mm.

I don't know if you just forgot that critical requirement in your above 
list, or you actually are entirely using x86's mental model for this 
code which is doing something entirely different that does not need it 
at all. If that is the case I really don't mind some cleanup or wrapper 
functions for x86 do entirely do its own thing, but if that's the case
you can't criticize core code's use of active_mm due to the current
state of x86. It's x86 that needs documentation and cleaning up.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-14  4:47             ` Nicholas Piggin
@ 2021-06-14  5:21               ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-14  5:21 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Randy Dunlap, Linus Torvalds

Excerpts from Nicholas Piggin's message of June 14, 2021 2:47 pm:
> Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
>> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>>> when it is context switched. This can be disabled by architectures that
>>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>>
>>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>>
>>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>>> with the resulting code if task->active_mm were at least better
>>>>> documented and possibly even guarded by ifdefs.
>>>> 
>>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>>> I don't think anything has changed in 20 years, I don't know what more
>>>> is needed, but if you can add to documentation that would be nice. Maybe
>>>> moving a bit of that into .c and .h files?
>>>> 
>>> 
>>> Quoting from that file:
>>> 
>>>   - however, we obviously need to keep track of which address space we
>>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>>     which shows what the currently active address space is.
>>> 
>>> This isn't even true right now on x86.
>> 
>> From the perspective of core code, it is. x86 might do something crazy 
>> with it, but it has to make it appear this way to non-arch code that
>> uses active_mm.
>> 
>> Is x86's scheme documented?
>> 
>>> With your patch applied:
>>> 
>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>  "mm_users" counter that is how many "real address space users" there are,
>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>  users) plus one if there are any real users.
>>> 
>>> isn't even true any more.
>> 
>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>> change is hopefully reasonably documented?
>> 
>>> 
>>> 
>>>>> x86 bare metal currently does not need the core lazy mm refcounting, and
>>>>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>>>>> if lazy mm refcounting were configured out, ->active_mm could become a
>>>>> dangling pointer, and this makes me extremely uncomfortable.
>>>>>
>>>>> So I tend to think that, depending on config, the core code should
>>>>> either keep ->active_mm [1] alive or get rid of it entirely.
>>>> 
>>>> I don't actually know what you mean.
>>>> 
>>>> core code needs the concept of an "active_mm". This is the mm that your 
>>>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
>>>> active_mm still points to init_mm for kernel threads.
>>> 
>>> Core code does *not* need this concept.  First, it's wrong on x86 since
>>> at least 4.15.  Any core code that actually assumes that ->active_mm is
>>> "active" for any sensible definition of the word active is wrong.
>>> Fortunately there is no such code.
>>> 
>>> I looked through all active_mm references in core code.  We have:
>>> 
>>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>>> with membarrier.
>>> 
>>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>>> 
>>> kernel/exit.c: exit_mm() a BUG_ON().
>>> 
>>> kernel/fork.c: initialization code and a warning.
>>> 
>>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>>> 
>>> fs/exec.c: nothing of interest
>> 
>> I might not have been clear. Core code doesn't need active_mm if 
>> active_mm somehow goes away. I'm saying active_mm can't go away because
>> it's needed to support (most) archs that do lazy tlb mm switching.
>> 
>> The part I don't understand is when you say it can just go away. How? 
>> 
>>> I didn't go through drivers, but I maintain my point.  active_mm is
>>> there for refcounting.  So please don't just make it even more confusing
>>> -- do your performance improvement, but improve the code at the same
>>> time: get rid of active_mm, at least on architectures that opt out of
>>> the refcounting.
>> 
>> powerpc opts out of the refcounting and can not "get rid of active_mm".
>> Not even in theory.
> 
> That is to say, it does do a type of reference management that requires 
> active_mm so you can argue it has not entirely opted out of refcounting.
> But we're not just doing refcounting for the sake of refcounting! That
> would make no sense.
> 
> active_mm is required because that's the mm that we have switched to 
> (from core code's perspective), and it is integral to know when to 
> switch to a different mm. See how active_mm is a fundamental concept
> in core code? It's part of the contract between core code and the
> arch mm context management calls. reference counting follows from there
> but it's not the _reason_ for this code.
> 
> Pretend the reference problem does not exit (whether by refcounting or 
> shootdown or garbage collection or whatever). We still can't remove 
> active_mm! We need it to know how to call into arch functions like 
> switch_mm.
> 
> I don't know if you just forgot that critical requirement in your above 
> list, or you actually are entirely using x86's mental model for this 
> code which is doing something entirely different that does not need it 
> at all. If that is the case I really don't mind some cleanup or wrapper 
> functions for x86 do entirely do its own thing, but if that's the case
> you can't criticize core code's use of active_mm due to the current
> state of x86. It's x86 that needs documentation and cleaning up.

Ah, that must be where your confusion is coming from: x86's switch_mm 
doesn't use prev anywhere, and the reference scheme it is using appears 
to be under-documented, although vague references in changelogs suggest 
it has not actually "opted out" of active_mm refcounting.

That's understandable, but please redirect your objections to the proper 
place. git blame suggests 3d28ebceaffab.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-14  5:21               ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-14  5:21 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Linus Torvalds, Randy Dunlap, linux-kernel, linux-mm,
	linuxppc-dev

Excerpts from Nicholas Piggin's message of June 14, 2021 2:47 pm:
> Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
>> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>>> when it is context switched. This can be disabled by architectures that
>>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>>
>>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>>
>>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>>> with the resulting code if task->active_mm were at least better
>>>>> documented and possibly even guarded by ifdefs.
>>>> 
>>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>>> I don't think anything has changed in 20 years, I don't know what more
>>>> is needed, but if you can add to documentation that would be nice. Maybe
>>>> moving a bit of that into .c and .h files?
>>>> 
>>> 
>>> Quoting from that file:
>>> 
>>>   - however, we obviously need to keep track of which address space we
>>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>>     which shows what the currently active address space is.
>>> 
>>> This isn't even true right now on x86.
>> 
>> From the perspective of core code, it is. x86 might do something crazy 
>> with it, but it has to make it appear this way to non-arch code that
>> uses active_mm.
>> 
>> Is x86's scheme documented?
>> 
>>> With your patch applied:
>>> 
>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>  "mm_users" counter that is how many "real address space users" there are,
>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>  users) plus one if there are any real users.
>>> 
>>> isn't even true any more.
>> 
>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>> change is hopefully reasonably documented?
>> 
>>> 
>>> 
>>>>> x86 bare metal currently does not need the core lazy mm refcounting, and
>>>>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>>>>> if lazy mm refcounting were configured out, ->active_mm could become a
>>>>> dangling pointer, and this makes me extremely uncomfortable.
>>>>>
>>>>> So I tend to think that, depending on config, the core code should
>>>>> either keep ->active_mm [1] alive or get rid of it entirely.
>>>> 
>>>> I don't actually know what you mean.
>>>> 
>>>> core code needs the concept of an "active_mm". This is the mm that your 
>>>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
>>>> active_mm still points to init_mm for kernel threads.
>>> 
>>> Core code does *not* need this concept.  First, it's wrong on x86 since
>>> at least 4.15.  Any core code that actually assumes that ->active_mm is
>>> "active" for any sensible definition of the word active is wrong.
>>> Fortunately there is no such code.
>>> 
>>> I looked through all active_mm references in core code.  We have:
>>> 
>>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>>> with membarrier.
>>> 
>>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>>> 
>>> kernel/exit.c: exit_mm() a BUG_ON().
>>> 
>>> kernel/fork.c: initialization code and a warning.
>>> 
>>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>>> 
>>> fs/exec.c: nothing of interest
>> 
>> I might not have been clear. Core code doesn't need active_mm if 
>> active_mm somehow goes away. I'm saying active_mm can't go away because
>> it's needed to support (most) archs that do lazy tlb mm switching.
>> 
>> The part I don't understand is when you say it can just go away. How? 
>> 
>>> I didn't go through drivers, but I maintain my point.  active_mm is
>>> there for refcounting.  So please don't just make it even more confusing
>>> -- do your performance improvement, but improve the code at the same
>>> time: get rid of active_mm, at least on architectures that opt out of
>>> the refcounting.
>> 
>> powerpc opts out of the refcounting and can not "get rid of active_mm".
>> Not even in theory.
> 
> That is to say, it does do a type of reference management that requires 
> active_mm so you can argue it has not entirely opted out of refcounting.
> But we're not just doing refcounting for the sake of refcounting! That
> would make no sense.
> 
> active_mm is required because that's the mm that we have switched to 
> (from core code's perspective), and it is integral to know when to 
> switch to a different mm. See how active_mm is a fundamental concept
> in core code? It's part of the contract between core code and the
> arch mm context management calls. reference counting follows from there
> but it's not the _reason_ for this code.
> 
> Pretend the reference problem does not exit (whether by refcounting or 
> shootdown or garbage collection or whatever). We still can't remove 
> active_mm! We need it to know how to call into arch functions like 
> switch_mm.
> 
> I don't know if you just forgot that critical requirement in your above 
> list, or you actually are entirely using x86's mental model for this 
> code which is doing something entirely different that does not need it 
> at all. If that is the case I really don't mind some cleanup or wrapper 
> functions for x86 do entirely do its own thing, but if that's the case
> you can't criticize core code's use of active_mm due to the current
> state of x86. It's x86 that needs documentation and cleaning up.

Ah, that must be where your confusion is coming from: x86's switch_mm 
doesn't use prev anywhere, and the reference scheme it is using appears 
to be under-documented, although vague references in changelogs suggest 
it has not actually "opted out" of active_mm refcounting.

That's understandable, but please redirect your objections to the proper 
place. git blame suggests 3d28ebceaffab.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-14  5:21               ` Nicholas Piggin
@ 2021-06-14 16:20                 ` Andy Lutomirski
  -1 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-06-14 16:20 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Randy Dunlap, Linus Torvalds, Rik van Riel

Replying to several emails at once...


On 6/13/21 10:21 PM, Nicholas Piggin wrote:
> Excerpts from Nicholas Piggin's message of June 14, 2021 2:47 pm:
>> Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
>>> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>>>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>>>> when it is context switched. This can be disabled by architectures that
>>>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>>>
>>>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>>>
>>>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>>>> with the resulting code if task->active_mm were at least better
>>>>>> documented and possibly even guarded by ifdefs.
>>>>>
>>>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>>>> I don't think anything has changed in 20 years, I don't know what more
>>>>> is needed, but if you can add to documentation that would be nice. Maybe
>>>>> moving a bit of that into .c and .h files?
>>>>>
>>>>
>>>> Quoting from that file:
>>>>
>>>>   - however, we obviously need to keep track of which address space we
>>>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>>>     which shows what the currently active address space is.
>>>>
>>>> This isn't even true right now on x86.
>>>
>>> From the perspective of core code, it is. x86 might do something crazy 
>>> with it, but it has to make it appear this way to non-arch code that
>>> uses active_mm.
>>>
>>> Is x86's scheme documented?

arch/x86/include/asm/tlbflush.h documents it a bit:

        /*
         * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
         * are on.  This means that it may not match current->active_mm,
         * which will contain the previous user mm when we're in lazy TLB
         * mode even if we've already switched back to swapper_pg_dir.
         *
         * During switch_mm_irqs_off(), loaded_mm will be set to
         * LOADED_MM_SWITCHING during the brief interrupts-off window
         * when CR3 and loaded_mm would otherwise be inconsistent.  This
         * is for nmi_uaccess_okay()'s benefit.
         */



>>>
>>>> With your patch applied:
>>>>
>>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>>  "mm_users" counter that is how many "real address space users" there are,
>>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>>  users) plus one if there are any real users.
>>>>
>>>> isn't even true any more.
>>>
>>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>>> change is hopefully reasonably documented?

active_mm is *only* refcounting in the core code.  See below.

>>>>
>>>> I looked through all active_mm references in core code.  We have:
>>>>
>>>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>>>> with membarrier.
>>>>
>>>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>>>>
>>>> kernel/exit.c: exit_mm() a BUG_ON().
>>>>
>>>> kernel/fork.c: initialization code and a warning.
>>>>
>>>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>>>>
>>>> fs/exec.c: nothing of interest
>>>
>>> I might not have been clear. Core code doesn't need active_mm if 
>>> active_mm somehow goes away. I'm saying active_mm can't go away because
>>> it's needed to support (most) archs that do lazy tlb mm switching.
>>>
>>> The part I don't understand is when you say it can just go away. How? 

#ifdef CONFIG_MMU_TLB_REFCOUNT
	struct mm_struct *active_mm;
#endif

>>>
>>>> I didn't go through drivers, but I maintain my point.  active_mm is
>>>> there for refcounting.  So please don't just make it even more confusing
>>>> -- do your performance improvement, but improve the code at the same
>>>> time: get rid of active_mm, at least on architectures that opt out of
>>>> the refcounting.
>>>
>>> powerpc opts out of the refcounting and can not "get rid of active_mm".
>>> Not even in theory.
>>
>> That is to say, it does do a type of reference management that requires 
>> active_mm so you can argue it has not entirely opted out of refcounting.
>> But we're not just doing refcounting for the sake of refcounting! That
>> would make no sense.
>>
>> active_mm is required because that's the mm that we have switched to 
>> (from core code's perspective), and it is integral to know when to 
>> switch to a different mm. See how active_mm is a fundamental concept
>> in core code? It's part of the contract between core code and the
>> arch mm context management calls. reference counting follows from there
>> but it's not the _reason_ for this code.

I don't understand what contract you're talking about.  The core code
maintains an active_mm counter and keeps that mm_struct from
disappearing.  That's *it*.  The core code does not care that active_mm
is active, and x86 provides evidence of that -- on x86,
current->active_mm may well be completely unused.

>>
>> Pretend the reference problem does not exit (whether by refcounting or 
>> shootdown or garbage collection or whatever). We still can't remove 
>> active_mm! We need it to know how to call into arch functions like 
>> switch_mm.

static inline void do_switch_mm(struct task_struct *prev_task, ...)
{
#ifdef CONFIG_MMU_TLB_REFCOUNT
	switch_mm(...);
#else
	switch_mm(fewer parameters);
	/* or pass NULL or whatever. */
#endif
}

>>
>> I don't know if you just forgot that critical requirement in your above 
>> list, or you actually are entirely using x86's mental model for this 
>> code which is doing something entirely different that does not need it 
>> at all. If that is the case I really don't mind some cleanup or wrapper 
>> functions for x86 do entirely do its own thing, but if that's the case
>> you can't criticize core code's use of active_mm due to the current
>> state of x86. It's x86 that needs documentation and cleaning up.
> 
> Ah, that must be where your confusion is coming from: x86's switch_mm 
> doesn't use prev anywhere, and the reference scheme it is using appears 
> to be under-documented, although vague references in changelogs suggest 
> it has not actually "opted out" of active_mm refcounting.

All of this is true, except I don't believe I'm confused.

> 
> That's understandable, but please redirect your objections to the proper 
> place. git blame suggests 3d28ebceaffab.

Thanks for the snark.

Here's the situation.

Before that patch, x86 more or less fully used the core scheme.  With
that patch, x86 has the property that loaded_mm (which is used) either
points to init_mm (which is permanently live) or to active_mm (which the
core code keeps alive, which is the whole point).  x86 still uses the
core active_mm refcounting scheme.  The result is a bit overcomplicated,
but it works, and it enabled massive improvements to the x86 arch code.

You are proposing a whole new simplification in which an arch can opt in
to telling the core code that it doesn't need to keep active_mm alive.
Great!  But now it's possible to get quite confused -- either an
elaborate dance is needed or current->active_mm could point to freed
memory.  This is poor design.

I'm entirely in favor of allowing arches to opt out of active_mm
refcounting.  As you've noticed, it's quite expensive on large systems.
x86 would opt out, too, if given the opportunity [1].  But I think you
should do it right.  If an arch opts out of active_mm refcounting, then
keeping a lazy mm (if any!) alive becomes entirely the arch code's
responsibility.  Once that happens, task->active_mm is not just a waste
of 4-8 bytes of memory per task, it's actively harmful -- some code,
somewhere in the kernel, might dereference it and access freed memory!

So please do your patches right.  By all means add a new config option,
but make that config option make active_mm go away entirely.  Then it
can't be misused.

NAK to the current set.

--Andy

[1] Hi Rik!  I think you had benchmarks that made mm refcounting look
quite bad.  If x86 can opt out of the core scheme, we can just rejigger
the exit_mm path to force-IPI everyone instead of allowing the paravirt
code to possibly optimize that path.  Or we can use a hazard pointer
scheme like my WIP patch.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-14 16:20                 ` Andy Lutomirski
  0 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-06-14 16:20 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton
  Cc: linux-arch, Rik van Riel, Linus Torvalds, Randy Dunlap,
	linux-kernel, linux-mm, linuxppc-dev

Replying to several emails at once...


On 6/13/21 10:21 PM, Nicholas Piggin wrote:
> Excerpts from Nicholas Piggin's message of June 14, 2021 2:47 pm:
>> Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
>>> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>>>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>>>> when it is context switched. This can be disabled by architectures that
>>>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>>>
>>>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>>>
>>>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>>>> with the resulting code if task->active_mm were at least better
>>>>>> documented and possibly even guarded by ifdefs.
>>>>>
>>>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>>>> I don't think anything has changed in 20 years, I don't know what more
>>>>> is needed, but if you can add to documentation that would be nice. Maybe
>>>>> moving a bit of that into .c and .h files?
>>>>>
>>>>
>>>> Quoting from that file:
>>>>
>>>>   - however, we obviously need to keep track of which address space we
>>>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>>>     which shows what the currently active address space is.
>>>>
>>>> This isn't even true right now on x86.
>>>
>>> From the perspective of core code, it is. x86 might do something crazy 
>>> with it, but it has to make it appear this way to non-arch code that
>>> uses active_mm.
>>>
>>> Is x86's scheme documented?

arch/x86/include/asm/tlbflush.h documents it a bit:

        /*
         * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
         * are on.  This means that it may not match current->active_mm,
         * which will contain the previous user mm when we're in lazy TLB
         * mode even if we've already switched back to swapper_pg_dir.
         *
         * During switch_mm_irqs_off(), loaded_mm will be set to
         * LOADED_MM_SWITCHING during the brief interrupts-off window
         * when CR3 and loaded_mm would otherwise be inconsistent.  This
         * is for nmi_uaccess_okay()'s benefit.
         */



>>>
>>>> With your patch applied:
>>>>
>>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>>  "mm_users" counter that is how many "real address space users" there are,
>>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>>  users) plus one if there are any real users.
>>>>
>>>> isn't even true any more.
>>>
>>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>>> change is hopefully reasonably documented?

active_mm is *only* refcounting in the core code.  See below.

>>>>
>>>> I looked through all active_mm references in core code.  We have:
>>>>
>>>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>>>> with membarrier.
>>>>
>>>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>>>>
>>>> kernel/exit.c: exit_mm() a BUG_ON().
>>>>
>>>> kernel/fork.c: initialization code and a warning.
>>>>
>>>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>>>>
>>>> fs/exec.c: nothing of interest
>>>
>>> I might not have been clear. Core code doesn't need active_mm if 
>>> active_mm somehow goes away. I'm saying active_mm can't go away because
>>> it's needed to support (most) archs that do lazy tlb mm switching.
>>>
>>> The part I don't understand is when you say it can just go away. How? 

#ifdef CONFIG_MMU_TLB_REFCOUNT
	struct mm_struct *active_mm;
#endif

>>>
>>>> I didn't go through drivers, but I maintain my point.  active_mm is
>>>> there for refcounting.  So please don't just make it even more confusing
>>>> -- do your performance improvement, but improve the code at the same
>>>> time: get rid of active_mm, at least on architectures that opt out of
>>>> the refcounting.
>>>
>>> powerpc opts out of the refcounting and can not "get rid of active_mm".
>>> Not even in theory.
>>
>> That is to say, it does do a type of reference management that requires 
>> active_mm so you can argue it has not entirely opted out of refcounting.
>> But we're not just doing refcounting for the sake of refcounting! That
>> would make no sense.
>>
>> active_mm is required because that's the mm that we have switched to 
>> (from core code's perspective), and it is integral to know when to 
>> switch to a different mm. See how active_mm is a fundamental concept
>> in core code? It's part of the contract between core code and the
>> arch mm context management calls. reference counting follows from there
>> but it's not the _reason_ for this code.

I don't understand what contract you're talking about.  The core code
maintains an active_mm counter and keeps that mm_struct from
disappearing.  That's *it*.  The core code does not care that active_mm
is active, and x86 provides evidence of that -- on x86,
current->active_mm may well be completely unused.

>>
>> Pretend the reference problem does not exit (whether by refcounting or 
>> shootdown or garbage collection or whatever). We still can't remove 
>> active_mm! We need it to know how to call into arch functions like 
>> switch_mm.

static inline void do_switch_mm(struct task_struct *prev_task, ...)
{
#ifdef CONFIG_MMU_TLB_REFCOUNT
	switch_mm(...);
#else
	switch_mm(fewer parameters);
	/* or pass NULL or whatever. */
#endif
}

>>
>> I don't know if you just forgot that critical requirement in your above 
>> list, or you actually are entirely using x86's mental model for this 
>> code which is doing something entirely different that does not need it 
>> at all. If that is the case I really don't mind some cleanup or wrapper 
>> functions for x86 do entirely do its own thing, but if that's the case
>> you can't criticize core code's use of active_mm due to the current
>> state of x86. It's x86 that needs documentation and cleaning up.
> 
> Ah, that must be where your confusion is coming from: x86's switch_mm 
> doesn't use prev anywhere, and the reference scheme it is using appears 
> to be under-documented, although vague references in changelogs suggest 
> it has not actually "opted out" of active_mm refcounting.

All of this is true, except I don't believe I'm confused.

> 
> That's understandable, but please redirect your objections to the proper 
> place. git blame suggests 3d28ebceaffab.

Thanks for the snark.

Here's the situation.

Before that patch, x86 more or less fully used the core scheme.  With
that patch, x86 has the property that loaded_mm (which is used) either
points to init_mm (which is permanently live) or to active_mm (which the
core code keeps alive, which is the whole point).  x86 still uses the
core active_mm refcounting scheme.  The result is a bit overcomplicated,
but it works, and it enabled massive improvements to the x86 arch code.

You are proposing a whole new simplification in which an arch can opt in
to telling the core code that it doesn't need to keep active_mm alive.
Great!  But now it's possible to get quite confused -- either an
elaborate dance is needed or current->active_mm could point to freed
memory.  This is poor design.

I'm entirely in favor of allowing arches to opt out of active_mm
refcounting.  As you've noticed, it's quite expensive on large systems.
x86 would opt out, too, if given the opportunity [1].  But I think you
should do it right.  If an arch opts out of active_mm refcounting, then
keeping a lazy mm (if any!) alive becomes entirely the arch code's
responsibility.  Once that happens, task->active_mm is not just a waste
of 4-8 bytes of memory per task, it's actively harmful -- some code,
somewhere in the kernel, might dereference it and access freed memory!

So please do your patches right.  By all means add a new config option,
but make that config option make active_mm go away entirely.  Then it
can't be misused.

NAK to the current set.

--Andy

[1] Hi Rik!  I think you had benchmarks that made mm refcounting look
quite bad.  If x86 can opt out of the core scheme, we can just rejigger
the exit_mm path to force-IPI everyone instead of allowing the paravirt
code to possibly optimize that path.  Or we can use a hazard pointer
scheme like my WIP patch.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-14 16:20                 ` Andy Lutomirski
@ 2021-06-15  0:55                   ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-15  0:55 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Randy Dunlap, Rik van Riel, Linus Torvalds

Excerpts from Andy Lutomirski's message of June 15, 2021 2:20 am:
> Replying to several emails at once...
> 
> 
> On 6/13/21 10:21 PM, Nicholas Piggin wrote:
>> Excerpts from Nicholas Piggin's message of June 14, 2021 2:47 pm:
>>> Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
>>>> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>>>>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>>>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>>>>> when it is context switched. This can be disabled by architectures that
>>>>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>>>>
>>>>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>>>>
>>>>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>>>>> with the resulting code if task->active_mm were at least better
>>>>>>> documented and possibly even guarded by ifdefs.
>>>>>>
>>>>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>>>>> I don't think anything has changed in 20 years, I don't know what more
>>>>>> is needed, but if you can add to documentation that would be nice. Maybe
>>>>>> moving a bit of that into .c and .h files?
>>>>>>
>>>>>
>>>>> Quoting from that file:
>>>>>
>>>>>   - however, we obviously need to keep track of which address space we
>>>>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>>>>     which shows what the currently active address space is.
>>>>>
>>>>> This isn't even true right now on x86.
>>>>
>>>> From the perspective of core code, it is. x86 might do something crazy 
>>>> with it, but it has to make it appear this way to non-arch code that
>>>> uses active_mm.
>>>>
>>>> Is x86's scheme documented?
> 
> arch/x86/include/asm/tlbflush.h documents it a bit:
> 
>         /*
>          * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
>          * are on.  This means that it may not match current->active_mm,
>          * which will contain the previous user mm when we're in lazy TLB
>          * mode even if we've already switched back to swapper_pg_dir.
>          *
>          * During switch_mm_irqs_off(), loaded_mm will be set to
>          * LOADED_MM_SWITCHING during the brief interrupts-off window
>          * when CR3 and loaded_mm would otherwise be inconsistent.  This
>          * is for nmi_uaccess_okay()'s benefit.
>          */

So the only documentation relating to the current active_mm value or 
refcounting is that it may not match what the x86 specific code is 
doing?

All this complexity you accuse me of adding is entirely in x86 code.
On other architectures, it's very simple and understandable, and 
documented. I don't know how else to explain this.

>>>>
>>>>> With your patch applied:
>>>>>
>>>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>>>  "mm_users" counter that is how many "real address space users" there are,
>>>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>>>  users) plus one if there are any real users.
>>>>>
>>>>> isn't even true any more.
>>>>
>>>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>>>> change is hopefully reasonably documented?
> 
> active_mm is *only* refcounting in the core code.  See below.

It's just not. It's passed in to switch_mm. Most architectures except 
for x86 require this.

>>>>>
>>>>> I looked through all active_mm references in core code.  We have:
>>>>>
>>>>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>>>>> with membarrier.
>>>>>
>>>>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>>>>>
>>>>> kernel/exit.c: exit_mm() a BUG_ON().
>>>>>
>>>>> kernel/fork.c: initialization code and a warning.
>>>>>
>>>>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>>>>>
>>>>> fs/exec.c: nothing of interest
>>>>
>>>> I might not have been clear. Core code doesn't need active_mm if 
>>>> active_mm somehow goes away. I'm saying active_mm can't go away because
>>>> it's needed to support (most) archs that do lazy tlb mm switching.
>>>>
>>>> The part I don't understand is when you say it can just go away. How? 
> 
> #ifdef CONFIG_MMU_TLB_REFCOUNT
> 	struct mm_struct *active_mm;
> #endif

Thanks for returning the snark.

>>>>
>>>>> I didn't go through drivers, but I maintain my point.  active_mm is
>>>>> there for refcounting.  So please don't just make it even more confusing
>>>>> -- do your performance improvement, but improve the code at the same
>>>>> time: get rid of active_mm, at least on architectures that opt out of
>>>>> the refcounting.
>>>>
>>>> powerpc opts out of the refcounting and can not "get rid of active_mm".
>>>> Not even in theory.
>>>
>>> That is to say, it does do a type of reference management that requires 
>>> active_mm so you can argue it has not entirely opted out of refcounting.
>>> But we're not just doing refcounting for the sake of refcounting! That
>>> would make no sense.
>>>
>>> active_mm is required because that's the mm that we have switched to 
>>> (from core code's perspective), and it is integral to know when to 
>>> switch to a different mm. See how active_mm is a fundamental concept
>>> in core code? It's part of the contract between core code and the
>>> arch mm context management calls. reference counting follows from there
>>> but it's not the _reason_ for this code.
> 
> I don't understand what contract you're talking about.  The core code
> maintains an active_mm counter and keeps that mm_struct from
> disappearing.  That's *it*.  The core code does not care that active_mm
> is active, and x86 provides evidence of that -- on x86,
> current->active_mm may well be completely unused.

I already acknowledged archs can do their own thing under the covers if 
they want.

> 
>>>
>>> Pretend the reference problem does not exit (whether by refcounting or 
>>> shootdown or garbage collection or whatever). We still can't remove 
>>> active_mm! We need it to know how to call into arch functions like 
>>> switch_mm.
> 
> static inline void do_switch_mm(struct task_struct *prev_task, ...)
> {
> #ifdef CONFIG_MMU_TLB_REFCOUNT
> 	switch_mm(...);
> #else
> 	switch_mm(fewer parameters);
> 	/* or pass NULL or whatever. */
> #endif
> }

And prev_task comes from active_mm, ergo core code requires the concept 
of active_mm.

I already said I'm quite happy for wrappers to be added so those can 
compile out of an arch like x86. That's not doing anything for 
complexity of core code, because it still needs to maintain the active 
mm wrappers.

>>> I don't know if you just forgot that critical requirement in your above 
>>> list, or you actually are entirely using x86's mental model for this 
>>> code which is doing something entirely different that does not need it 
>>> at all. If that is the case I really don't mind some cleanup or wrapper 
>>> functions for x86 do entirely do its own thing, but if that's the case
>>> you can't criticize core code's use of active_mm due to the current
>>> state of x86. It's x86 that needs documentation and cleaning up.
>> 
>> Ah, that must be where your confusion is coming from: x86's switch_mm 
>> doesn't use prev anywhere, and the reference scheme it is using appears 
>> to be under-documented, although vague references in changelogs suggest 
>> it has not actually "opted out" of active_mm refcounting.
> 
> All of this is true, except I don't believe I'm confused.

Well someone is. My patches don't change any of the fundamental 
complexity of core code's maintaining of active_mm, nor alleviate the
requirement to keep active_mm around under the new config introduced,
and yet you are saying:

  I am in favor of this approach, but I would be a lot more comfortable
  with the resulting code if task->active_mm were at least better
  documented and possibly even guarded by ifdefs.

It doesn't make sense. It can't be guarded by ifdefs even if we took
away the shootdown IPI's use of it as part of reference / lifetime
management.

And that you don't like the state of the code but I'm trying to wrangle 
specifics out of you on that and it's difficult.

>> 
>> That's understandable, but please redirect your objections to the proper 
>> place. git blame suggests 3d28ebceaffab.
> 
> Thanks for the snark.

Is it not true? I don't mean that one patch causing all the x86 
complexity or even making the situation worse itself. But you seem to be 
asking my series to do things that really apply to the x86 changes over
the past few years that got us here.

> 
> Here's the situation.
> 
> Before that patch, x86 more or less fully used the core scheme.  With
> that patch, x86 has the property that loaded_mm (which is used) either
> points to init_mm (which is permanently live) or to active_mm (which the
> core code keeps alive, which is the whole point).  x86 still uses the
> core active_mm refcounting scheme.  The result is a bit overcomplicated,
> but it works, and it enabled massive improvements to the x86 arch code.
> 
> You are proposing a whole new simplification in which an arch can opt in
> to telling the core code that it doesn't need to keep active_mm alive.
> Great!  But now it's possible to get quite confused -- either an
> elaborate dance is needed or current->active_mm could point to freed
> memory.  This is poor design.

powerpc still needs active_mm, there's zero point to configuring it away 
and maintaining exactly the same thing in a per-cpu variable in the arch
code because the code has to be there in the core for all other archs
anyway so that would be stupid.

I would be more than happy to try help review a patch that helps the
x86 situation, but my patch is not intended to do what x86 code wants.

> I'm entirely in favor of allowing arches to opt out of active_mm
> refcounting.  As you've noticed, it's quite expensive on large systems.
> x86 would opt out, too, if given the opportunity [1].  But I think you
> should do it right.  If an arch opts out of active_mm refcounting, then
> keeping a lazy mm (if any!) alive becomes entirely the arch code's
> responsibility.
>
> Once that happens, task->active_mm is not just a waste
> of 4-8 bytes of memory per task, it's actively harmful -- some code,
> somewhere in the kernel, might dereference it and access freed memory!

That's not an accurate description of my patches. powerpc still requires 
active_mm exactly the same, and the "elaborate dance" amounts to having
CPUs stop using the mm as a lazy tlb when the last user goes away.

> 
> So please do your patches right.  By all means add a new config option,
> but make that config option make active_mm go away entirely.  Then it
> can't be misused.

I already have a patch that's halfway there I pulled out of the series
for the last submission. As I said, feel free to build on it.

> 
> NAK to the current set.

I propose instead we do take the series, and write some patches on top 
of that which helps x86. Here's 1/n

2/n is to make wrappers for active_mm maintenance in core code.

3/n is adjust context_switch_nolazy() to just use init_mm (or NULL) directly
and put ifdefs around active_mm.

Is that what x86 wants?

Thanks,
Nick

commit 99f90520821b9717d4447872354c2933894a1100
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Thu Jul 9 15:01:25 2020 +1000

    lazy tlb: allow lazy tlb mm switching to be configurable
    
    Add CONFIG_MMU_LAZY_TLB which can be configured out to disable the lazy
    tlb mechanism entirely, and switches to init_mm when switching to a
    kernel thread.
    
    NOMMU systems could easily go without this and save a bit of code and
    the refcount atomics, because their mm switch is a no-op. They have not
    been switched over by default because the arch code needs to be audited
    and tested for lazy tlb mm refcounting and converted to _lazy_tlb
    refcounting if necessary.
    
    CONFIG_MMU_LAZY_TLB_REFCOUNT is also added, but it must always be
    enabled if CONFIG_MMU_LAZY_TLB is enabled until the next patch which
    provides an alternate scheme.
    
    Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

diff --git a/arch/Kconfig b/arch/Kconfig
index cf468c9777d8..3b80042bd0c2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -418,6 +418,26 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  irqs disabled over activate_mm. Architectures that do IPI based TLB
 	  shootdowns should enable this.
 
+# Enable "lazy TLB", which means a user->kernel thread context switch does not
+# switch the mm to init_mm and the kernel thread takes a reference to the user
+# mm to provide its kernel mapping. This is how Linux has traditionally worked
+# (see Documentation/vm/active_mm.rst), for performance. Switching to and from
+# idle thread is a performance-critical case.
+#
+# If mm context switches are inexpensive or free (in the case of NOMMU) then
+# this could be disabled.
+#
+# It would make sense to have this depend on MMU, but need to audit and test
+# the NOMMU architectures for lazy mm refcounting first.
+config MMU_LAZY_TLB
+	def_bool y
+	depends on !NO_MMU_LAZY_TLB
+
+# This allows archs to disable MMU_LAZY_TLB. mmgrab/mmdrop in arch/ code has
+# to be audited and switched to _lazy_tlb postfix as necessary.
+config NO_MMU_LAZY_TLB
+	def_bool n
+
 # Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
 # MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
 # to/from kernel threads when the same mm is running on a lot of CPUs (a large
@@ -431,6 +451,7 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # to a kthread ->active_mm (non-arch code has been converted already).
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
+	depends on MMU_LAZY_TLB
 	depends on !MMU_LAZY_TLB_SHOOTDOWN
 
 # This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
@@ -445,6 +466,7 @@ config MMU_LAZY_TLB_REFCOUNT
 # MMU_LAZY_TLB_REFCOUNT=n (see above).
 config MMU_LAZY_TLB_SHOOTDOWN
 	bool
+	depends on MMU_LAZY_TLB
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10cb712be3..5cb039c686a6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4285,22 +4285,10 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 	calculate_sigpending();
 }
 
-/*
- * context_switch - switch to the new MM and the new thread's register state.
- */
-static __always_inline struct rq *
-context_switch(struct rq *rq, struct task_struct *prev,
-	       struct task_struct *next, struct rq_flags *rf)
+static __always_inline void
+context_switch_mm(struct rq *rq, struct task_struct *prev,
+	       struct task_struct *next)
 {
-	prepare_task_switch(rq, prev, next);
-
-	/*
-	 * For paravirt, this is coupled with an exit in switch_to to
-	 * combine the page table reload and the switch backend into
-	 * one hypercall.
-	 */
-	arch_start_context_switch(prev);
-
 	/*
 	 * kernel -> kernel   lazy + transfer active
 	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
@@ -4345,6 +4333,40 @@ context_switch(struct rq *rq, struct task_struct *prev,
 #endif
 		}
 	}
+}
+
+static __always_inline void
+context_switch_mm_nolazy(struct rq *rq, struct task_struct *prev,
+	       struct task_struct *next)
+{
+	if (!next->mm)
+		next->active_mm = &init_mm;
+	membarrier_switch_mm(rq, prev->active_mm, next->active_mm);
+	switch_mm_irqs_off(prev->active_mm, next->active_mm, next);
+	if (!prev->mm)
+		prev->active_mm = NULL;
+}
+
+/*
+ * context_switch - switch to the new MM and the new thread's register state.
+ */
+static __always_inline struct rq *
+context_switch(struct rq *rq, struct task_struct *prev,
+	       struct task_struct *next, struct rq_flags *rf)
+{
+	prepare_task_switch(rq, prev, next);
+
+	/*
+	 * For paravirt, this is coupled with an exit in switch_to to
+	 * combine the page table reload and the switch backend into
+	 * one hypercall.
+	 */
+	arch_start_context_switch(prev);
+
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+		context_switch_mm(rq, prev, next);
+	else
+		context_switch_mm_nolazy(rq, prev, next);
 
 	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
 

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-15  0:55                   ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-15  0:55 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Rik van Riel, Linus Torvalds, Randy Dunlap,
	linux-kernel, linux-mm, linuxppc-dev

Excerpts from Andy Lutomirski's message of June 15, 2021 2:20 am:
> Replying to several emails at once...
> 
> 
> On 6/13/21 10:21 PM, Nicholas Piggin wrote:
>> Excerpts from Nicholas Piggin's message of June 14, 2021 2:47 pm:
>>> Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
>>>> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>>>>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>>>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>>>>> when it is context switched. This can be disabled by architectures that
>>>>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>>>>
>>>>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>>>>
>>>>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>>>>> with the resulting code if task->active_mm were at least better
>>>>>>> documented and possibly even guarded by ifdefs.
>>>>>>
>>>>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>>>>> I don't think anything has changed in 20 years, I don't know what more
>>>>>> is needed, but if you can add to documentation that would be nice. Maybe
>>>>>> moving a bit of that into .c and .h files?
>>>>>>
>>>>>
>>>>> Quoting from that file:
>>>>>
>>>>>   - however, we obviously need to keep track of which address space we
>>>>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>>>>     which shows what the currently active address space is.
>>>>>
>>>>> This isn't even true right now on x86.
>>>>
>>>> From the perspective of core code, it is. x86 might do something crazy 
>>>> with it, but it has to make it appear this way to non-arch code that
>>>> uses active_mm.
>>>>
>>>> Is x86's scheme documented?
> 
> arch/x86/include/asm/tlbflush.h documents it a bit:
> 
>         /*
>          * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
>          * are on.  This means that it may not match current->active_mm,
>          * which will contain the previous user mm when we're in lazy TLB
>          * mode even if we've already switched back to swapper_pg_dir.
>          *
>          * During switch_mm_irqs_off(), loaded_mm will be set to
>          * LOADED_MM_SWITCHING during the brief interrupts-off window
>          * when CR3 and loaded_mm would otherwise be inconsistent.  This
>          * is for nmi_uaccess_okay()'s benefit.
>          */

So the only documentation relating to the current active_mm value or 
refcounting is that it may not match what the x86 specific code is 
doing?

All this complexity you accuse me of adding is entirely in x86 code.
On other architectures, it's very simple and understandable, and 
documented. I don't know how else to explain this.

>>>>
>>>>> With your patch applied:
>>>>>
>>>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>>>  "mm_users" counter that is how many "real address space users" there are,
>>>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>>>  users) plus one if there are any real users.
>>>>>
>>>>> isn't even true any more.
>>>>
>>>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>>>> change is hopefully reasonably documented?
> 
> active_mm is *only* refcounting in the core code.  See below.

It's just not. It's passed in to switch_mm. Most architectures except 
for x86 require this.

>>>>>
>>>>> I looked through all active_mm references in core code.  We have:
>>>>>
>>>>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>>>>> with membarrier.
>>>>>
>>>>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>>>>>
>>>>> kernel/exit.c: exit_mm() a BUG_ON().
>>>>>
>>>>> kernel/fork.c: initialization code and a warning.
>>>>>
>>>>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>>>>>
>>>>> fs/exec.c: nothing of interest
>>>>
>>>> I might not have been clear. Core code doesn't need active_mm if 
>>>> active_mm somehow goes away. I'm saying active_mm can't go away because
>>>> it's needed to support (most) archs that do lazy tlb mm switching.
>>>>
>>>> The part I don't understand is when you say it can just go away. How? 
> 
> #ifdef CONFIG_MMU_TLB_REFCOUNT
> 	struct mm_struct *active_mm;
> #endif

Thanks for returning the snark.

>>>>
>>>>> I didn't go through drivers, but I maintain my point.  active_mm is
>>>>> there for refcounting.  So please don't just make it even more confusing
>>>>> -- do your performance improvement, but improve the code at the same
>>>>> time: get rid of active_mm, at least on architectures that opt out of
>>>>> the refcounting.
>>>>
>>>> powerpc opts out of the refcounting and can not "get rid of active_mm".
>>>> Not even in theory.
>>>
>>> That is to say, it does do a type of reference management that requires 
>>> active_mm so you can argue it has not entirely opted out of refcounting.
>>> But we're not just doing refcounting for the sake of refcounting! That
>>> would make no sense.
>>>
>>> active_mm is required because that's the mm that we have switched to 
>>> (from core code's perspective), and it is integral to know when to 
>>> switch to a different mm. See how active_mm is a fundamental concept
>>> in core code? It's part of the contract between core code and the
>>> arch mm context management calls. reference counting follows from there
>>> but it's not the _reason_ for this code.
> 
> I don't understand what contract you're talking about.  The core code
> maintains an active_mm counter and keeps that mm_struct from
> disappearing.  That's *it*.  The core code does not care that active_mm
> is active, and x86 provides evidence of that -- on x86,
> current->active_mm may well be completely unused.

I already acknowledged archs can do their own thing under the covers if 
they want.

> 
>>>
>>> Pretend the reference problem does not exit (whether by refcounting or 
>>> shootdown or garbage collection or whatever). We still can't remove 
>>> active_mm! We need it to know how to call into arch functions like 
>>> switch_mm.
> 
> static inline void do_switch_mm(struct task_struct *prev_task, ...)
> {
> #ifdef CONFIG_MMU_TLB_REFCOUNT
> 	switch_mm(...);
> #else
> 	switch_mm(fewer parameters);
> 	/* or pass NULL or whatever. */
> #endif
> }

And prev_task comes from active_mm, ergo core code requires the concept 
of active_mm.

I already said I'm quite happy for wrappers to be added so those can 
compile out of an arch like x86. That's not doing anything for 
complexity of core code, because it still needs to maintain the active 
mm wrappers.

>>> I don't know if you just forgot that critical requirement in your above 
>>> list, or you actually are entirely using x86's mental model for this 
>>> code which is doing something entirely different that does not need it 
>>> at all. If that is the case I really don't mind some cleanup or wrapper 
>>> functions for x86 do entirely do its own thing, but if that's the case
>>> you can't criticize core code's use of active_mm due to the current
>>> state of x86. It's x86 that needs documentation and cleaning up.
>> 
>> Ah, that must be where your confusion is coming from: x86's switch_mm 
>> doesn't use prev anywhere, and the reference scheme it is using appears 
>> to be under-documented, although vague references in changelogs suggest 
>> it has not actually "opted out" of active_mm refcounting.
> 
> All of this is true, except I don't believe I'm confused.

Well someone is. My patches don't change any of the fundamental 
complexity of core code's maintaining of active_mm, nor alleviate the
requirement to keep active_mm around under the new config introduced,
and yet you are saying:

  I am in favor of this approach, but I would be a lot more comfortable
  with the resulting code if task->active_mm were at least better
  documented and possibly even guarded by ifdefs.

It doesn't make sense. It can't be guarded by ifdefs even if we took
away the shootdown IPI's use of it as part of reference / lifetime
management.

And that you don't like the state of the code but I'm trying to wrangle 
specifics out of you on that and it's difficult.

>> 
>> That's understandable, but please redirect your objections to the proper 
>> place. git blame suggests 3d28ebceaffab.
> 
> Thanks for the snark.

Is it not true? I don't mean that one patch causing all the x86 
complexity or even making the situation worse itself. But you seem to be 
asking my series to do things that really apply to the x86 changes over
the past few years that got us here.

> 
> Here's the situation.
> 
> Before that patch, x86 more or less fully used the core scheme.  With
> that patch, x86 has the property that loaded_mm (which is used) either
> points to init_mm (which is permanently live) or to active_mm (which the
> core code keeps alive, which is the whole point).  x86 still uses the
> core active_mm refcounting scheme.  The result is a bit overcomplicated,
> but it works, and it enabled massive improvements to the x86 arch code.
> 
> You are proposing a whole new simplification in which an arch can opt in
> to telling the core code that it doesn't need to keep active_mm alive.
> Great!  But now it's possible to get quite confused -- either an
> elaborate dance is needed or current->active_mm could point to freed
> memory.  This is poor design.

powerpc still needs active_mm, there's zero point to configuring it away 
and maintaining exactly the same thing in a per-cpu variable in the arch
code because the code has to be there in the core for all other archs
anyway so that would be stupid.

I would be more than happy to try help review a patch that helps the
x86 situation, but my patch is not intended to do what x86 code wants.

> I'm entirely in favor of allowing arches to opt out of active_mm
> refcounting.  As you've noticed, it's quite expensive on large systems.
> x86 would opt out, too, if given the opportunity [1].  But I think you
> should do it right.  If an arch opts out of active_mm refcounting, then
> keeping a lazy mm (if any!) alive becomes entirely the arch code's
> responsibility.
>
> Once that happens, task->active_mm is not just a waste
> of 4-8 bytes of memory per task, it's actively harmful -- some code,
> somewhere in the kernel, might dereference it and access freed memory!

That's not an accurate description of my patches. powerpc still requires 
active_mm exactly the same, and the "elaborate dance" amounts to having
CPUs stop using the mm as a lazy tlb when the last user goes away.

> 
> So please do your patches right.  By all means add a new config option,
> but make that config option make active_mm go away entirely.  Then it
> can't be misused.

I already have a patch that's halfway there I pulled out of the series
for the last submission. As I said, feel free to build on it.

> 
> NAK to the current set.

I propose instead we do take the series, and write some patches on top 
of that which helps x86. Here's 1/n

2/n is to make wrappers for active_mm maintenance in core code.

3/n is adjust context_switch_nolazy() to just use init_mm (or NULL) directly
and put ifdefs around active_mm.

Is that what x86 wants?

Thanks,
Nick

commit 99f90520821b9717d4447872354c2933894a1100
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Thu Jul 9 15:01:25 2020 +1000

    lazy tlb: allow lazy tlb mm switching to be configurable
    
    Add CONFIG_MMU_LAZY_TLB which can be configured out to disable the lazy
    tlb mechanism entirely, and switches to init_mm when switching to a
    kernel thread.
    
    NOMMU systems could easily go without this and save a bit of code and
    the refcount atomics, because their mm switch is a no-op. They have not
    been switched over by default because the arch code needs to be audited
    and tested for lazy tlb mm refcounting and converted to _lazy_tlb
    refcounting if necessary.
    
    CONFIG_MMU_LAZY_TLB_REFCOUNT is also added, but it must always be
    enabled if CONFIG_MMU_LAZY_TLB is enabled until the next patch which
    provides an alternate scheme.
    
    Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

diff --git a/arch/Kconfig b/arch/Kconfig
index cf468c9777d8..3b80042bd0c2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -418,6 +418,26 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  irqs disabled over activate_mm. Architectures that do IPI based TLB
 	  shootdowns should enable this.
 
+# Enable "lazy TLB", which means a user->kernel thread context switch does not
+# switch the mm to init_mm and the kernel thread takes a reference to the user
+# mm to provide its kernel mapping. This is how Linux has traditionally worked
+# (see Documentation/vm/active_mm.rst), for performance. Switching to and from
+# idle thread is a performance-critical case.
+#
+# If mm context switches are inexpensive or free (in the case of NOMMU) then
+# this could be disabled.
+#
+# It would make sense to have this depend on MMU, but need to audit and test
+# the NOMMU architectures for lazy mm refcounting first.
+config MMU_LAZY_TLB
+	def_bool y
+	depends on !NO_MMU_LAZY_TLB
+
+# This allows archs to disable MMU_LAZY_TLB. mmgrab/mmdrop in arch/ code has
+# to be audited and switched to _lazy_tlb postfix as necessary.
+config NO_MMU_LAZY_TLB
+	def_bool n
+
 # Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
 # MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
 # to/from kernel threads when the same mm is running on a lot of CPUs (a large
@@ -431,6 +451,7 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # to a kthread ->active_mm (non-arch code has been converted already).
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
+	depends on MMU_LAZY_TLB
 	depends on !MMU_LAZY_TLB_SHOOTDOWN
 
 # This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
@@ -445,6 +466,7 @@ config MMU_LAZY_TLB_REFCOUNT
 # MMU_LAZY_TLB_REFCOUNT=n (see above).
 config MMU_LAZY_TLB_SHOOTDOWN
 	bool
+	depends on MMU_LAZY_TLB
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10cb712be3..5cb039c686a6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4285,22 +4285,10 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 	calculate_sigpending();
 }
 
-/*
- * context_switch - switch to the new MM and the new thread's register state.
- */
-static __always_inline struct rq *
-context_switch(struct rq *rq, struct task_struct *prev,
-	       struct task_struct *next, struct rq_flags *rf)
+static __always_inline void
+context_switch_mm(struct rq *rq, struct task_struct *prev,
+	       struct task_struct *next)
 {
-	prepare_task_switch(rq, prev, next);
-
-	/*
-	 * For paravirt, this is coupled with an exit in switch_to to
-	 * combine the page table reload and the switch backend into
-	 * one hypercall.
-	 */
-	arch_start_context_switch(prev);
-
 	/*
 	 * kernel -> kernel   lazy + transfer active
 	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
@@ -4345,6 +4333,40 @@ context_switch(struct rq *rq, struct task_struct *prev,
 #endif
 		}
 	}
+}
+
+static __always_inline void
+context_switch_mm_nolazy(struct rq *rq, struct task_struct *prev,
+	       struct task_struct *next)
+{
+	if (!next->mm)
+		next->active_mm = &init_mm;
+	membarrier_switch_mm(rq, prev->active_mm, next->active_mm);
+	switch_mm_irqs_off(prev->active_mm, next->active_mm, next);
+	if (!prev->mm)
+		prev->active_mm = NULL;
+}
+
+/*
+ * context_switch - switch to the new MM and the new thread's register state.
+ */
+static __always_inline struct rq *
+context_switch(struct rq *rq, struct task_struct *prev,
+	       struct task_struct *next, struct rq_flags *rf)
+{
+	prepare_task_switch(rq, prev, next);
+
+	/*
+	 * For paravirt, this is coupled with an exit in switch_to to
+	 * combine the page table reload and the switch backend into
+	 * one hypercall.
+	 */
+	arch_start_context_switch(prev);
+
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+		context_switch_mm(rq, prev, next);
+	else
+		context_switch_mm_nolazy(rq, prev, next);
 
 	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
 

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-15  0:55                   ` Nicholas Piggin
@ 2021-06-16  0:14                     ` Andy Lutomirski
  -1 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-06-16  0:14 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Randy Dunlap, Rik van Riel, Linus Torvalds

On 6/14/21 5:55 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 15, 2021 2:20 am:
>> Replying to several emails at once...
>>

> 
> So the only documentation relating to the current active_mm value or 
> refcounting is that it may not match what the x86 specific code is 
> doing?
> 
> All this complexity you accuse me of adding is entirely in x86 code.
> On other architectures, it's very simple and understandable, and 
> documented. I don't know how else to explain this.

And the docs you referred me to will be *wrong* with your patches
applied.  They are your patches, and they break the semantics.

> 
>>>>>
>>>>>> With your patch applied:
>>>>>>
>>>>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>>>>  "mm_users" counter that is how many "real address space users" there are,
>>>>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>>>>  users) plus one if there are any real users.
>>>>>>
>>>>>> isn't even true any more.
>>>>>
>>>>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>>>>> change is hopefully reasonably documented?
>>
>> active_mm is *only* refcounting in the core code.  See below.
> 
> It's just not. It's passed in to switch_mm. Most architectures except 
> for x86 require this.
> 

Sorry, I was obviously blatantly wrong.  Let me say it differently.
active_mm does two things:

1. It keeps an mm alive via a refcounting scheme.

2. It passes a parameter to switch_mm() to remind the arch code what the
most recently switch-to mm was.

#2 is basically useless.  An architecture can handle *that* with a
percpu variable and two lines of code.

If you are getting rid of functionality #1 in the core code via a new
arch opt-out, please get rid of #2 as well.  *Especially* because, when
the arch asks the core code to stop refcounting active_mm, there is
absolutely nothing guaranteeing that the parameter that the core code
will pass to switch_mm() points to memory that hasn't been freed and
reused for another purpose.

>>>>> I might not have been clear. Core code doesn't need active_mm if 
>>>>> active_mm somehow goes away. I'm saying active_mm can't go away because
>>>>> it's needed to support (most) archs that do lazy tlb mm switching.
>>>>>
>>>>> The part I don't understand is when you say it can just go away. How? 
>>
>> #ifdef CONFIG_MMU_TLB_REFCOUNT
>> 	struct mm_struct *active_mm;
>> #endif
> 
> Thanks for returning the snark.

That wasn't intended to be snark.  It was a literal suggestion, and, in
fact, it's *exactly* what I'm asking you to do to fix your patches.

>> I don't understand what contract you're talking about.  The core code
>> maintains an active_mm counter and keeps that mm_struct from
>> disappearing.  That's *it*.  The core code does not care that active_mm
>> is active, and x86 provides evidence of that -- on x86,
>> current->active_mm may well be completely unused.
> 
> I already acknowledged archs can do their own thing under the covers if 
> they want.

No.

I am *not* going to write x86 patches that use your feature in a way
that will cause the core code to pass around a complete garbage pointer
to an mm_struct that is completely unreferenced and may well be deleted.
 Doing something private in arch code is one thing.  Doing something
that causes basic common sense to be violated in core code is another
thing entirely.

>>
>> static inline void do_switch_mm(struct task_struct *prev_task, ...)
>> {
>> #ifdef CONFIG_MMU_TLB_REFCOUNT
>> 	switch_mm(...);
>> #else
>> 	switch_mm(fewer parameters);
>> 	/* or pass NULL or whatever. */
>> #endif
>> }
> 
> And prev_task comes from active_mm, ergo core code requires the concept 
> of active_mm.

I don't see why this concept is hard.  We are literally quibbling about
this single line of core code in kernel/sched/core.c:

switch_mm_irqs_off(prev->active_mm, next->mm, next);

This is not rocket science.  There are any number of ways to fix it.
For example:

#ifdef CONFIG_MMU_TLB_REFCOUNT
	switch_mm_irqs_off(prev->active_mm, next->mm, next);
#else
	switch_mm_irqs_off(NULL, next->mm, next);
#endif

If you don't like the NULL, then make the signature depend on the config
option.

What you may not do is what your patch actually does:

switch_mm_irqs_off(random invalid pointer, next->mm, next);

Now maybe it works because powerpc's lifecycle rules happen to keep
active_mm alive, but I haven't verified it.  x86's lifecycle rules *do not*.

>>>
>>> That's understandable, but please redirect your objections to the proper 
>>> place. git blame suggests 3d28ebceaffab.
>>
>> Thanks for the snark.
> 
> Is it not true? I don't mean that one patch causing all the x86 
> complexity or even making the situation worse itself. But you seem to be 
> asking my series to do things that really apply to the x86 changes over
> the past few years that got us here.

With just my patch from 4.15 applied, task->active_mm points to an
actual mm_struct, and that mm_struct will not be freed early.  If I opt
x86 into your patch's new behavior, then task->active_mm may be freed.

akpm, please drop this series until it's fixed.  It's a core change to
better support arch usecases, but it's unnecessarily fragile, and there
is already an arch maintainer pointing out that it's inadequate to
robustly support arch usecases.  There is no reason to merge it in its
present state.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-16  0:14                     ` Andy Lutomirski
  0 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-06-16  0:14 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton
  Cc: linux-arch, Rik van Riel, Linus Torvalds, Randy Dunlap,
	linux-kernel, linux-mm, linuxppc-dev

On 6/14/21 5:55 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 15, 2021 2:20 am:
>> Replying to several emails at once...
>>

> 
> So the only documentation relating to the current active_mm value or 
> refcounting is that it may not match what the x86 specific code is 
> doing?
> 
> All this complexity you accuse me of adding is entirely in x86 code.
> On other architectures, it's very simple and understandable, and 
> documented. I don't know how else to explain this.

And the docs you referred me to will be *wrong* with your patches
applied.  They are your patches, and they break the semantics.

> 
>>>>>
>>>>>> With your patch applied:
>>>>>>
>>>>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>>>>  "mm_users" counter that is how many "real address space users" there are,
>>>>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>>>>  users) plus one if there are any real users.
>>>>>>
>>>>>> isn't even true any more.
>>>>>
>>>>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>>>>> change is hopefully reasonably documented?
>>
>> active_mm is *only* refcounting in the core code.  See below.
> 
> It's just not. It's passed in to switch_mm. Most architectures except 
> for x86 require this.
> 

Sorry, I was obviously blatantly wrong.  Let me say it differently.
active_mm does two things:

1. It keeps an mm alive via a refcounting scheme.

2. It passes a parameter to switch_mm() to remind the arch code what the
most recently switch-to mm was.

#2 is basically useless.  An architecture can handle *that* with a
percpu variable and two lines of code.

If you are getting rid of functionality #1 in the core code via a new
arch opt-out, please get rid of #2 as well.  *Especially* because, when
the arch asks the core code to stop refcounting active_mm, there is
absolutely nothing guaranteeing that the parameter that the core code
will pass to switch_mm() points to memory that hasn't been freed and
reused for another purpose.

>>>>> I might not have been clear. Core code doesn't need active_mm if 
>>>>> active_mm somehow goes away. I'm saying active_mm can't go away because
>>>>> it's needed to support (most) archs that do lazy tlb mm switching.
>>>>>
>>>>> The part I don't understand is when you say it can just go away. How? 
>>
>> #ifdef CONFIG_MMU_TLB_REFCOUNT
>> 	struct mm_struct *active_mm;
>> #endif
> 
> Thanks for returning the snark.

That wasn't intended to be snark.  It was a literal suggestion, and, in
fact, it's *exactly* what I'm asking you to do to fix your patches.

>> I don't understand what contract you're talking about.  The core code
>> maintains an active_mm counter and keeps that mm_struct from
>> disappearing.  That's *it*.  The core code does not care that active_mm
>> is active, and x86 provides evidence of that -- on x86,
>> current->active_mm may well be completely unused.
> 
> I already acknowledged archs can do their own thing under the covers if 
> they want.

No.

I am *not* going to write x86 patches that use your feature in a way
that will cause the core code to pass around a complete garbage pointer
to an mm_struct that is completely unreferenced and may well be deleted.
 Doing something private in arch code is one thing.  Doing something
that causes basic common sense to be violated in core code is another
thing entirely.

>>
>> static inline void do_switch_mm(struct task_struct *prev_task, ...)
>> {
>> #ifdef CONFIG_MMU_TLB_REFCOUNT
>> 	switch_mm(...);
>> #else
>> 	switch_mm(fewer parameters);
>> 	/* or pass NULL or whatever. */
>> #endif
>> }
> 
> And prev_task comes from active_mm, ergo core code requires the concept 
> of active_mm.

I don't see why this concept is hard.  We are literally quibbling about
this single line of core code in kernel/sched/core.c:

switch_mm_irqs_off(prev->active_mm, next->mm, next);

This is not rocket science.  There are any number of ways to fix it.
For example:

#ifdef CONFIG_MMU_TLB_REFCOUNT
	switch_mm_irqs_off(prev->active_mm, next->mm, next);
#else
	switch_mm_irqs_off(NULL, next->mm, next);
#endif

If you don't like the NULL, then make the signature depend on the config
option.

What you may not do is what your patch actually does:

switch_mm_irqs_off(random invalid pointer, next->mm, next);

Now maybe it works because powerpc's lifecycle rules happen to keep
active_mm alive, but I haven't verified it.  x86's lifecycle rules *do not*.

>>>
>>> That's understandable, but please redirect your objections to the proper 
>>> place. git blame suggests 3d28ebceaffab.
>>
>> Thanks for the snark.
> 
> Is it not true? I don't mean that one patch causing all the x86 
> complexity or even making the situation worse itself. But you seem to be 
> asking my series to do things that really apply to the x86 changes over
> the past few years that got us here.

With just my patch from 4.15 applied, task->active_mm points to an
actual mm_struct, and that mm_struct will not be freed early.  If I opt
x86 into your patch's new behavior, then task->active_mm may be freed.

akpm, please drop this series until it's fixed.  It's a core change to
better support arch usecases, but it's unnecessarily fragile, and there
is already an arch maintainer pointing out that it's inadequate to
robustly support arch usecases.  There is no reason to merge it in its
present state.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-16  0:14                     ` Andy Lutomirski
@ 2021-06-16  1:02                       ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-16  1:02 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Randy Dunlap, Rik van Riel, Linus Torvalds

Excerpts from Andy Lutomirski's message of June 16, 2021 10:14 am:
> On 6/14/21 5:55 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 15, 2021 2:20 am:
>>> Replying to several emails at once...
>>>
> 
>> 
>> So the only documentation relating to the current active_mm value or 
>> refcounting is that it may not match what the x86 specific code is 
>> doing?
>> 
>> All this complexity you accuse me of adding is entirely in x86 code.
>> On other architectures, it's very simple and understandable, and 
>> documented. I don't know how else to explain this.
> 
> And the docs you referred me to will be *wrong* with your patches
> applied.  They are your patches, and they break the semantics.

No they aren't wrong, I've documented the shootdown refcounting scheme 
and Linus' historical email is still right about the active_mm concept,
and the refcounting details still match the CONFIG_MMU_TLB_REFCOUNT=y
case.

If you have some particular place you would like to see more 
documentation added, please tell me where I'll add something before
the series gets upstreamed.

>>>>>>
>>>>>>> With your patch applied:
>>>>>>>
>>>>>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>>>>>  "mm_users" counter that is how many "real address space users" there are,
>>>>>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>>>>>  users) plus one if there are any real users.
>>>>>>>
>>>>>>> isn't even true any more.
>>>>>>
>>>>>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>>>>>> change is hopefully reasonably documented?
>>>
>>> active_mm is *only* refcounting in the core code.  See below.
>> 
>> It's just not. It's passed in to switch_mm. Most architectures except 
>> for x86 require this.
>> 
> 
> Sorry, I was obviously blatantly wrong.  Let me say it differently.
> active_mm does two things:
> 
> 1. It keeps an mm alive via a refcounting scheme.
> 
> 2. It passes a parameter to switch_mm() to remind the arch code what the
> most recently switch-to mm was.
> 
> #2 is basically useless.  An architecture can handle *that* with a
> percpu variable and two lines of code.
> 
> If you are getting rid of functionality #1 in the core code via a new
> arch opt-out, please get rid of #2 as well.  *Especially* because, when
> the arch asks the core code to stop refcounting active_mm, there is
> absolutely nothing guaranteeing that the parameter that the core code
> will pass to switch_mm() points to memory that hasn't been freed and
> reused for another purpose.

You're confused about the patch. It's not opting out of lazy tlb mm or 
opting out of active_mm, it is changing how the refcounting is done.

You were just before trying to tell me it would make the code simpler to 
remove it, but it clearly wouldn't if powerpc just had to reimplement it 
in arch code anyway, would it?

powerpc is not going to add code to reimplement the exact same thing as 
active_mm, because it uses active_mm and so does everyone else. For the
last time I'm not going to change that.

That is something x86 wants. This series is not for x86. It doesn't 
change anything that x86 does. I have kindly tried to give you 
suggestions and patches about what you might do with x86, and it can 
easily be changed, but that is not the purpose of my patch. Please don't 
reply again telling me to get rid of active_mm. We'll draw a line under
this and move on.

>>> I don't understand what contract you're talking about.  The core code
>>> maintains an active_mm counter and keeps that mm_struct from
>>> disappearing.  That's *it*.  The core code does not care that active_mm
>>> is active, and x86 provides evidence of that -- on x86,
>>> current->active_mm may well be completely unused.
>> 
>> I already acknowledged archs can do their own thing under the covers if 
>> they want.
> 
> No.
> 
> I am *not* going to write x86 patches that use your feature in a way
> that will cause the core code to pass around a complete garbage pointer
> to an mm_struct that is completely unreferenced and may well be deleted.
>  Doing something private in arch code is one thing.  Doing something
> that causes basic common sense to be violated in core code is another
> thing entirely.

I'm talking about the relationship between core code's idea of active_mm 
and the arch's idea as it stands now. I'm not talking about what you might
do with x86.

The whole point of this was that you have been operating under the 
misconception that active_mm is not a core kernel concept, because x86 
has gone and done its own thing and doesn't use it. That does not mean 
it is not a core kernel concept!

> 
>>>
>>> static inline void do_switch_mm(struct task_struct *prev_task, ...)
>>> {
>>> #ifdef CONFIG_MMU_TLB_REFCOUNT
>>> 	switch_mm(...);
>>> #else
>>> 	switch_mm(fewer parameters);
>>> 	/* or pass NULL or whatever. */
>>> #endif
>>> }
>> 
>> And prev_task comes from active_mm, ergo core code requires the concept 
>> of active_mm.
> 
> I don't see why this concept is hard.  We are literally quibbling about
> this single line of core code in kernel/sched/core.c:
> 
> switch_mm_irqs_off(prev->active_mm, next->mm, next);
> 
> This is not rocket science.  There are any number of ways to fix it.
> For example:
> 
> #ifdef CONFIG_MMU_TLB_REFCOUNT
> 	switch_mm_irqs_off(prev->active_mm, next->mm, next);
> #else
> 	switch_mm_irqs_off(NULL, next->mm, next);
> #endif
> 
> If you don't like the NULL, then make the signature depend on the config
> option.
> 
> What you may not do is what your patch actually does:
> 
> switch_mm_irqs_off(random invalid pointer, next->mm, next);

You're totally confused about the code, and my patch series. That's not 
what it does at all. Unless you have spotted a bug, in which case point 
it out.

> 
> Now maybe it works because powerpc's lifecycle rules happen to keep
> active_mm alive, but I haven't verified it.  x86's lifecycle rules *do not*.
> 
>>>>
>>>> That's understandable, but please redirect your objections to the proper 
>>>> place. git blame suggests 3d28ebceaffab.
>>>
>>> Thanks for the snark.
>> 
>> Is it not true? I don't mean that one patch causing all the x86 
>> complexity or even making the situation worse itself. But you seem to be 
>> asking my series to do things that really apply to the x86 changes over
>> the past few years that got us here.
> 
> With just my patch from 4.15 applied, task->active_mm points to an
> actual mm_struct, and that mm_struct will not be freed early.  If I opt
> x86 into your patch's new behavior, then task->active_mm may be freed.
> 
> akpm, please drop this series until it's fixed.  It's a core change to
> better support arch usecases, but it's unnecessarily fragile, and there
> is already an arch maintainer pointing out that it's inadequate to
> robustly support arch usecases.  There is no reason to merge it in its
> present state.
> 

No, you've been sniping these for nearly a year now, and I've been pretty 
accommodating not pushing it upstream, and trying to get anything out of 
you, a patch or an actual description of the problem, but it's like 
getting blood from a stone. I've tried to ask what help x86 needs, but 
nothing. You didn't even reply to my (actual working) patch or questions 
about it in the last comment! Doing that then throwing out NAK like it's 
a magic word is pretty rude really.

The series is fine, it's documented, it works well, it solves a problem
the core code changes are small and reusable, and nobody has pointed out
a bug.

So at this point, just get over it. Do some patches for x86 on top of it 
later when you have some time. Or Rik might. None of the patches in my 
series prevents that. I'll be happy to review your changes or adjust 
some of the code added in this series so it's usable by x86 if you need.
The x86 work could have bee done already with the time bickering about 
this.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-16  1:02                       ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-16  1:02 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Rik van Riel, Linus Torvalds, Randy Dunlap,
	linux-kernel, linux-mm, linuxppc-dev

Excerpts from Andy Lutomirski's message of June 16, 2021 10:14 am:
> On 6/14/21 5:55 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 15, 2021 2:20 am:
>>> Replying to several emails at once...
>>>
> 
>> 
>> So the only documentation relating to the current active_mm value or 
>> refcounting is that it may not match what the x86 specific code is 
>> doing?
>> 
>> All this complexity you accuse me of adding is entirely in x86 code.
>> On other architectures, it's very simple and understandable, and 
>> documented. I don't know how else to explain this.
> 
> And the docs you referred me to will be *wrong* with your patches
> applied.  They are your patches, and they break the semantics.

No they aren't wrong, I've documented the shootdown refcounting scheme 
and Linus' historical email is still right about the active_mm concept,
and the refcounting details still match the CONFIG_MMU_TLB_REFCOUNT=y
case.

If you have some particular place you would like to see more 
documentation added, please tell me where I'll add something before
the series gets upstreamed.

>>>>>>
>>>>>>> With your patch applied:
>>>>>>>
>>>>>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>>>>>  "mm_users" counter that is how many "real address space users" there are,
>>>>>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>>>>>  users) plus one if there are any real users.
>>>>>>>
>>>>>>> isn't even true any more.
>>>>>>
>>>>>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>>>>>> change is hopefully reasonably documented?
>>>
>>> active_mm is *only* refcounting in the core code.  See below.
>> 
>> It's just not. It's passed in to switch_mm. Most architectures except 
>> for x86 require this.
>> 
> 
> Sorry, I was obviously blatantly wrong.  Let me say it differently.
> active_mm does two things:
> 
> 1. It keeps an mm alive via a refcounting scheme.
> 
> 2. It passes a parameter to switch_mm() to remind the arch code what the
> most recently switch-to mm was.
> 
> #2 is basically useless.  An architecture can handle *that* with a
> percpu variable and two lines of code.
> 
> If you are getting rid of functionality #1 in the core code via a new
> arch opt-out, please get rid of #2 as well.  *Especially* because, when
> the arch asks the core code to stop refcounting active_mm, there is
> absolutely nothing guaranteeing that the parameter that the core code
> will pass to switch_mm() points to memory that hasn't been freed and
> reused for another purpose.

You're confused about the patch. It's not opting out of lazy tlb mm or 
opting out of active_mm, it is changing how the refcounting is done.

You were just before trying to tell me it would make the code simpler to 
remove it, but it clearly wouldn't if powerpc just had to reimplement it 
in arch code anyway, would it?

powerpc is not going to add code to reimplement the exact same thing as 
active_mm, because it uses active_mm and so does everyone else. For the
last time I'm not going to change that.

That is something x86 wants. This series is not for x86. It doesn't 
change anything that x86 does. I have kindly tried to give you 
suggestions and patches about what you might do with x86, and it can 
easily be changed, but that is not the purpose of my patch. Please don't 
reply again telling me to get rid of active_mm. We'll draw a line under
this and move on.

>>> I don't understand what contract you're talking about.  The core code
>>> maintains an active_mm counter and keeps that mm_struct from
>>> disappearing.  That's *it*.  The core code does not care that active_mm
>>> is active, and x86 provides evidence of that -- on x86,
>>> current->active_mm may well be completely unused.
>> 
>> I already acknowledged archs can do their own thing under the covers if 
>> they want.
> 
> No.
> 
> I am *not* going to write x86 patches that use your feature in a way
> that will cause the core code to pass around a complete garbage pointer
> to an mm_struct that is completely unreferenced and may well be deleted.
>  Doing something private in arch code is one thing.  Doing something
> that causes basic common sense to be violated in core code is another
> thing entirely.

I'm talking about the relationship between core code's idea of active_mm 
and the arch's idea as it stands now. I'm not talking about what you might
do with x86.

The whole point of this was that you have been operating under the 
misconception that active_mm is not a core kernel concept, because x86 
has gone and done its own thing and doesn't use it. That does not mean 
it is not a core kernel concept!

> 
>>>
>>> static inline void do_switch_mm(struct task_struct *prev_task, ...)
>>> {
>>> #ifdef CONFIG_MMU_TLB_REFCOUNT
>>> 	switch_mm(...);
>>> #else
>>> 	switch_mm(fewer parameters);
>>> 	/* or pass NULL or whatever. */
>>> #endif
>>> }
>> 
>> And prev_task comes from active_mm, ergo core code requires the concept 
>> of active_mm.
> 
> I don't see why this concept is hard.  We are literally quibbling about
> this single line of core code in kernel/sched/core.c:
> 
> switch_mm_irqs_off(prev->active_mm, next->mm, next);
> 
> This is not rocket science.  There are any number of ways to fix it.
> For example:
> 
> #ifdef CONFIG_MMU_TLB_REFCOUNT
> 	switch_mm_irqs_off(prev->active_mm, next->mm, next);
> #else
> 	switch_mm_irqs_off(NULL, next->mm, next);
> #endif
> 
> If you don't like the NULL, then make the signature depend on the config
> option.
> 
> What you may not do is what your patch actually does:
> 
> switch_mm_irqs_off(random invalid pointer, next->mm, next);

You're totally confused about the code, and my patch series. That's not 
what it does at all. Unless you have spotted a bug, in which case point 
it out.

> 
> Now maybe it works because powerpc's lifecycle rules happen to keep
> active_mm alive, but I haven't verified it.  x86's lifecycle rules *do not*.
> 
>>>>
>>>> That's understandable, but please redirect your objections to the proper 
>>>> place. git blame suggests 3d28ebceaffab.
>>>
>>> Thanks for the snark.
>> 
>> Is it not true? I don't mean that one patch causing all the x86 
>> complexity or even making the situation worse itself. But you seem to be 
>> asking my series to do things that really apply to the x86 changes over
>> the past few years that got us here.
> 
> With just my patch from 4.15 applied, task->active_mm points to an
> actual mm_struct, and that mm_struct will not be freed early.  If I opt
> x86 into your patch's new behavior, then task->active_mm may be freed.
> 
> akpm, please drop this series until it's fixed.  It's a core change to
> better support arch usecases, but it's unnecessarily fragile, and there
> is already an arch maintainer pointing out that it's inadequate to
> robustly support arch usecases.  There is no reason to merge it in its
> present state.
> 

No, you've been sniping these for nearly a year now, and I've been pretty 
accommodating not pushing it upstream, and trying to get anything out of 
you, a patch or an actual description of the problem, but it's like 
getting blood from a stone. I've tried to ask what help x86 needs, but 
nothing. You didn't even reply to my (actual working) patch or questions 
about it in the last comment! Doing that then throwing out NAK like it's 
a magic word is pretty rude really.

The series is fine, it's documented, it works well, it solves a problem
the core code changes are small and reusable, and nobody has pointed out
a bug.

So at this point, just get over it. Do some patches for x86 on top of it 
later when you have some time. Or Rik might. None of the patches in my 
series prevents that. I'll be happy to review your changes or adjust 
some of the code added in this series so it's usable by x86 if you need.
The x86 work could have bee done already with the time bickering about 
this.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-06-16  1:02                       ` Nicholas Piggin
@ 2021-06-17  0:32                         ` Nicholas Piggin
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-17  0:32 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: Anton Blanchard, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, Randy Dunlap, Rik van Riel, Linus Torvalds

Excerpts from Nicholas Piggin's message of June 16, 2021 11:02 am:
> Excerpts from Andy Lutomirski's message of June 16, 2021 10:14 am:
>> akpm, please drop this series until it's fixed.  It's a core change to
>> better support arch usecases, but it's unnecessarily fragile, and there
>> is already an arch maintainer pointing out that it's inadequate to
>> robustly support arch usecases.  There is no reason to merge it in its
>> present state.

Just to make sure I'm not doing anything stupid or fragile for other 
archs, I had a closer look at a few. sparc32 is the only one I have a 
SMP capable qemu and initramfs at hand for, took about 5 minutes to 
convert after fixing 2 other sparc32/mm bugs (patches on linux-sparc),
one of them found by the DEBUG_VM code my series added. It seems to work 
fine, with what little stressing my qemu setup can muster.

Simple. Robust. Pretty mechanical conversion follows the documented 
reciple. Re-uses every single line of code I added outside 
arch/powerpc/. Requires no elaborate dances.

alpha and arm64 are both 4-liners by the looks, sparc64 might reqiure a 
bit of actual code but doesn't look too hard.

So I'm satisfied the code added outside arch/powerpc/ is not some 
fragile powerpc specific hack. I don't know if other archs will use 
it, but they easily can use it[*].

And we can make changes to help x86 whenever its needed -- I already 
posted patch 1/n for configuring out lazy tlb and active_mm from core 
code rebased on top of mmotm so the series is not preventing such 
changes.

Hopefully this allays some concerns.

[*] I do think mmgrab_lazy_tlb is a nice change that self-documents the 
    active_mm refcounting, so I will try to get all the arch code 
    converted to use it over the next few releases, even if they never
    switch to use lazy tlb shootdown.

Thanks,
Nick

---
 arch/sparc/Kconfig            | 1 +
 arch/sparc/kernel/leon_smp.c  | 2 +-
 arch/sparc/kernel/smp_64.c    | 2 +-
 arch/sparc/kernel/sun4d_smp.c | 2 +-
 arch/sparc/kernel/sun4m_smp.c | 2 +-
 arch/sparc/kernel/traps_32.c  | 2 +-
 arch/sparc/kernel/traps_64.c  | 2 +-
 7 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 164a5254c91c..db9954af57a2 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -58,6 +58,7 @@ config SPARC32
 	select GENERIC_ATOMIC64
 	select CLZ_TAB
 	select HAVE_UID16
+	select MMU_LAZY_TLB_SHOOTDOWN
 	select OLD_SIGACTION
 
 config SPARC64
diff --git a/arch/sparc/kernel/leon_smp.c b/arch/sparc/kernel/leon_smp.c
index 1eed26d423fb..d00460788048 100644
--- a/arch/sparc/kernel/leon_smp.c
+++ b/arch/sparc/kernel/leon_smp.c
@@ -92,7 +92,7 @@ void leon_cpu_pre_online(void *arg)
 			     : "memory" /* paranoid */);
 
 	/* Attach to the address space of init_task. */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	while (!cpumask_test_cpu(cpuid, &smp_commenced_mask))
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index e38d8bf454e8..19aa12991f2b 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -127,7 +127,7 @@ void smp_callin(void)
 	current_thread_info()->new_child = 0;
 
 	/* Attach to the address space of init_task. */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	/* inform the notifiers about the new cpu */
diff --git a/arch/sparc/kernel/sun4d_smp.c b/arch/sparc/kernel/sun4d_smp.c
index ff30f03beb7c..a6f392dcfeaf 100644
--- a/arch/sparc/kernel/sun4d_smp.c
+++ b/arch/sparc/kernel/sun4d_smp.c
@@ -94,7 +94,7 @@ void sun4d_cpu_pre_online(void *arg)
 	show_leds(cpuid);
 
 	/* Attach to the address space of init_task. */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	local_ops->cache_all();
diff --git a/arch/sparc/kernel/sun4m_smp.c b/arch/sparc/kernel/sun4m_smp.c
index 228a6527082d..0ee77f066c9e 100644
--- a/arch/sparc/kernel/sun4m_smp.c
+++ b/arch/sparc/kernel/sun4m_smp.c
@@ -60,7 +60,7 @@ void sun4m_cpu_pre_online(void *arg)
 			     : "memory" /* paranoid */);
 
 	/* Attach to the address space of init_task. */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	while (!cpumask_test_cpu(cpuid, &smp_commenced_mask))
diff --git a/arch/sparc/kernel/traps_32.c b/arch/sparc/kernel/traps_32.c
index 247a0d9683b2..a3186bb30109 100644
--- a/arch/sparc/kernel/traps_32.c
+++ b/arch/sparc/kernel/traps_32.c
@@ -387,7 +387,7 @@ void trap_init(void)
 		thread_info_offsets_are_bolixed_pete();
 
 	/* Attach to the address space of init_task. */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	/* NOTE: Other cpus have this done as they are started
diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
index a850dccd78ea..b6e46732fa69 100644
--- a/arch/sparc/kernel/traps_64.c
+++ b/arch/sparc/kernel/traps_64.c
@@ -2929,6 +2929,6 @@ void __init trap_init(void)
 	/* Attach to the address space of init_task.  On SMP we
 	 * do this in smp.c:smp_callin for other cpus.
 	 */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
@ 2021-06-17  0:32                         ` Nicholas Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nicholas Piggin @ 2021-06-17  0:32 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Rik van Riel, Linus Torvalds, Randy Dunlap,
	linux-kernel, linux-mm, linuxppc-dev

Excerpts from Nicholas Piggin's message of June 16, 2021 11:02 am:
> Excerpts from Andy Lutomirski's message of June 16, 2021 10:14 am:
>> akpm, please drop this series until it's fixed.  It's a core change to
>> better support arch usecases, but it's unnecessarily fragile, and there
>> is already an arch maintainer pointing out that it's inadequate to
>> robustly support arch usecases.  There is no reason to merge it in its
>> present state.

Just to make sure I'm not doing anything stupid or fragile for other 
archs, I had a closer look at a few. sparc32 is the only one I have a 
SMP capable qemu and initramfs at hand for, took about 5 minutes to 
convert after fixing 2 other sparc32/mm bugs (patches on linux-sparc),
one of them found by the DEBUG_VM code my series added. It seems to work 
fine, with what little stressing my qemu setup can muster.

Simple. Robust. Pretty mechanical conversion follows the documented 
reciple. Re-uses every single line of code I added outside 
arch/powerpc/. Requires no elaborate dances.

alpha and arm64 are both 4-liners by the looks, sparc64 might reqiure a 
bit of actual code but doesn't look too hard.

So I'm satisfied the code added outside arch/powerpc/ is not some 
fragile powerpc specific hack. I don't know if other archs will use 
it, but they easily can use it[*].

And we can make changes to help x86 whenever its needed -- I already 
posted patch 1/n for configuring out lazy tlb and active_mm from core 
code rebased on top of mmotm so the series is not preventing such 
changes.

Hopefully this allays some concerns.

[*] I do think mmgrab_lazy_tlb is a nice change that self-documents the 
    active_mm refcounting, so I will try to get all the arch code 
    converted to use it over the next few releases, even if they never
    switch to use lazy tlb shootdown.

Thanks,
Nick

---
 arch/sparc/Kconfig            | 1 +
 arch/sparc/kernel/leon_smp.c  | 2 +-
 arch/sparc/kernel/smp_64.c    | 2 +-
 arch/sparc/kernel/sun4d_smp.c | 2 +-
 arch/sparc/kernel/sun4m_smp.c | 2 +-
 arch/sparc/kernel/traps_32.c  | 2 +-
 arch/sparc/kernel/traps_64.c  | 2 +-
 7 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 164a5254c91c..db9954af57a2 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -58,6 +58,7 @@ config SPARC32
 	select GENERIC_ATOMIC64
 	select CLZ_TAB
 	select HAVE_UID16
+	select MMU_LAZY_TLB_SHOOTDOWN
 	select OLD_SIGACTION
 
 config SPARC64
diff --git a/arch/sparc/kernel/leon_smp.c b/arch/sparc/kernel/leon_smp.c
index 1eed26d423fb..d00460788048 100644
--- a/arch/sparc/kernel/leon_smp.c
+++ b/arch/sparc/kernel/leon_smp.c
@@ -92,7 +92,7 @@ void leon_cpu_pre_online(void *arg)
 			     : "memory" /* paranoid */);
 
 	/* Attach to the address space of init_task. */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	while (!cpumask_test_cpu(cpuid, &smp_commenced_mask))
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index e38d8bf454e8..19aa12991f2b 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -127,7 +127,7 @@ void smp_callin(void)
 	current_thread_info()->new_child = 0;
 
 	/* Attach to the address space of init_task. */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	/* inform the notifiers about the new cpu */
diff --git a/arch/sparc/kernel/sun4d_smp.c b/arch/sparc/kernel/sun4d_smp.c
index ff30f03beb7c..a6f392dcfeaf 100644
--- a/arch/sparc/kernel/sun4d_smp.c
+++ b/arch/sparc/kernel/sun4d_smp.c
@@ -94,7 +94,7 @@ void sun4d_cpu_pre_online(void *arg)
 	show_leds(cpuid);
 
 	/* Attach to the address space of init_task. */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	local_ops->cache_all();
diff --git a/arch/sparc/kernel/sun4m_smp.c b/arch/sparc/kernel/sun4m_smp.c
index 228a6527082d..0ee77f066c9e 100644
--- a/arch/sparc/kernel/sun4m_smp.c
+++ b/arch/sparc/kernel/sun4m_smp.c
@@ -60,7 +60,7 @@ void sun4m_cpu_pre_online(void *arg)
 			     : "memory" /* paranoid */);
 
 	/* Attach to the address space of init_task. */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	while (!cpumask_test_cpu(cpuid, &smp_commenced_mask))
diff --git a/arch/sparc/kernel/traps_32.c b/arch/sparc/kernel/traps_32.c
index 247a0d9683b2..a3186bb30109 100644
--- a/arch/sparc/kernel/traps_32.c
+++ b/arch/sparc/kernel/traps_32.c
@@ -387,7 +387,7 @@ void trap_init(void)
 		thread_info_offsets_are_bolixed_pete();
 
 	/* Attach to the address space of init_task. */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	/* NOTE: Other cpus have this done as they are started
diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
index a850dccd78ea..b6e46732fa69 100644
--- a/arch/sparc/kernel/traps_64.c
+++ b/arch/sparc/kernel/traps_64.c
@@ -2929,6 +2929,6 @@ void __init trap_init(void)
 	/* Attach to the address space of init_task.  On SMP we
 	 * do this in smp.c:smp_callin for other cpus.
 	 */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-06-17  0:32 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-05  1:42 [PATCH v4 0/4] shoot lazy tlbs Nicholas Piggin
2021-06-05  1:42 ` Nicholas Piggin
2021-06-05  1:42 ` [PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions Nicholas Piggin
2021-06-05  1:42   ` Nicholas Piggin
2021-06-07 23:49   ` Andrew Morton
2021-06-07 23:49     ` Andrew Morton
2021-06-08  1:39     ` Nicholas Piggin
2021-06-08  1:39       ` Nicholas Piggin
2021-06-08  1:48       ` Andrew Morton
2021-06-08  1:48         ` Andrew Morton
2021-06-08  4:11         ` Nicholas Piggin
2021-06-08  4:11           ` Nicholas Piggin
2021-06-05  1:42 ` [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable Nicholas Piggin
2021-06-05  1:42   ` Nicholas Piggin
2021-06-08  3:11   ` Nicholas Piggin
2021-06-08  3:11     ` Nicholas Piggin
2021-06-08 16:20   ` Andy Lutomirski
2021-06-08 16:20     ` Andy Lutomirski
2021-06-14  0:45     ` Nicholas Piggin
2021-06-14  0:45       ` Nicholas Piggin
2021-06-14  3:52       ` Andy Lutomirski
2021-06-14  3:52         ` Andy Lutomirski
2021-06-14  4:14         ` Nicholas Piggin
2021-06-14  4:14           ` Nicholas Piggin
2021-06-14  4:47           ` Nicholas Piggin
2021-06-14  4:47             ` Nicholas Piggin
2021-06-14  5:21             ` Nicholas Piggin
2021-06-14  5:21               ` Nicholas Piggin
2021-06-14 16:20               ` Andy Lutomirski
2021-06-14 16:20                 ` Andy Lutomirski
2021-06-15  0:55                 ` Nicholas Piggin
2021-06-15  0:55                   ` Nicholas Piggin
2021-06-16  0:14                   ` Andy Lutomirski
2021-06-16  0:14                     ` Andy Lutomirski
2021-06-16  1:02                     ` Nicholas Piggin
2021-06-16  1:02                       ` Nicholas Piggin
2021-06-17  0:32                       ` Nicholas Piggin
2021-06-17  0:32                         ` Nicholas Piggin
2021-06-05  1:42 ` [PATCH v4 3/4] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Nicholas Piggin
2021-06-05  1:42   ` Nicholas Piggin
2021-06-08  3:15   ` Nicholas Piggin
2021-06-08  3:15     ` Nicholas Piggin
2021-06-05  1:42 ` [PATCH v4 4/4] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Nicholas Piggin
2021-06-05  1:42   ` Nicholas Piggin
2021-06-07 23:52   ` Andrew Morton
2021-06-07 23:52     ` Andrew Morton
2021-06-08  2:13     ` Nicholas Piggin
2021-06-08  2:13       ` Nicholas Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.