[PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes
@ 2020-09-14  4:52 Nicholas Piggin
  2020-09-14  4:52 ` [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race Nicholas Piggin
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Nicholas Piggin @ 2020-09-14  4:52 UTC (permalink / raw)
  To: linux-mm @ kvack . org
  Cc: Nicholas Piggin, linux-arch, linux-kernel, linuxppc-dev,
	sparclinux, Aneesh Kumar K . V, Andrew Morton, Jens Axboe,
	Peter Zijlstra, David S . Miller

This is an attempt to fix a few different related issues around
switching mm, TLB flushing, and lazy tlb mm handling.

This will require all architectures to eventually move to disabling
irqs over activate_mm, but it's possible we could add another arch
call after irqs are re-enabled for those few which can't do their
entire activation with irqs disabled.

Testing so far indicates this has fixed a mm refcounting bug that
powerpc was running into (via distro report and backport). I haven't
had any real feedback on this series outside powerpc (and it doesn't
really affect other archs), so I propose patches 1,2,4 go via the
powerpc tree.

There is no dependency between them and patch 3, I put it there only
because it follows the history of the code (powerpc code was written
using the sparc64 logic), but I guess they have to go via different arch
trees. Dave, I'll leave patch 3 with you.

Thanks,
Nick

Since v1:
- Updates from Michael Ellerman's review comments.

Nicholas Piggin (4):
  mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

 arch/Kconfig                           |  7 +++
 arch/powerpc/Kconfig                   |  1 +
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/include/asm/tlb.h         | 13 ------
 arch/powerpc/mm/book3s64/radix_tlb.c   | 23 ++++++---
 arch/sparc/kernel/smp_64.c             | 65 ++++++--------------------
 fs/exec.c                              | 17 ++++++-
 7 files changed, 54 insertions(+), 74 deletions(-)

-- 
2.23.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  2020-09-14  4:52 [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes Nicholas Piggin
@ 2020-09-14  4:52 ` Nicholas Piggin
  2020-09-14 10:56   ` peterz
  2020-09-14  4:52 ` [PATCH v2 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM Nicholas Piggin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Nicholas Piggin @ 2020-09-14  4:52 UTC (permalink / raw)
  To: linux-mm @ kvack . org
  Cc: Nicholas Piggin, linux-arch, linux-kernel, linuxppc-dev,
	sparclinux, Aneesh Kumar K . V, Andrew Morton, Jens Axboe,
	Peter Zijlstra, David S . Miller

Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.

This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).

Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:

  call_usermodehelper()
    kernel_execve()
      old_mm = current->mm;
      active_mm = current->active_mm;
      *** preempt *** -------------------->  schedule()
                                               prev->active_mm = NULL;
                                               mmdrop(prev active_mm);
                                             ...
                      <--------------------  schedule()
      current->mm = mm;
      current->active_mm = mm;
      if (!old_mm)
          mmdrop(active_mm);

If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.

Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().

So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.

This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig |  7 +++++++
 fs/exec.c    | 17 +++++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..94821e3f94d1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
 	bool
 	depends on MMU_GATHER_TABLE_FREE

+config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
+	bool
+	help
+	  Temporary select until all architectures can be converted to have
+	  irqs disabled over activate_mm. Architectures that do IPI based TLB
+	  shootdowns should enable this.
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool

diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..d4fb18baf1fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
 	}

 	task_lock(tsk);
-	active_mm = tsk->active_mm;
 	membarrier_exec_mmap(mm);
-	tsk->mm = mm;
+
+	local_irq_disable();
+	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
+	tsk->mm = mm;
+	/*
+	 * This prevents preemption while active_mm is being loaded and
+	 * it and mm are being updated, which could cause problems for
+	 * lazy tlb mm refcounting when these are updated by context
+	 * switches. Not all architectures can handle irqs off over
+	 * activate_mm yet.
+	 */
+	if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+		local_irq_enable();
 	activate_mm(active_mm, mm);
+	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+		local_irq_enable();
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
 	task_unlock(tsk);
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  2020-09-14  4:52 [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes Nicholas Piggin
  2020-09-14  4:52 ` [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race Nicholas Piggin
@ 2020-09-14  4:52 ` Nicholas Piggin
  2020-09-14  4:52 ` [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race Nicholas Piggin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Nicholas Piggin @ 2020-09-14  4:52 UTC (permalink / raw)
  To: linux-mm @ kvack . org
  Cc: Nicholas Piggin, linux-arch, linux-kernel, linuxppc-dev,
	sparclinux, Aneesh Kumar K . V, Andrew Morton, Jens Axboe,
	Peter Zijlstra, David S . Miller

powerpc uses IPIs in some situations to switch a kernel thread away
from a lazy tlb mm, which is subject to the TLB flushing race
described in the changelog introducing ARCH_WANT_IRQS_OFF_ACTIVATE_MM.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/Kconfig                   | 1 +
 arch/powerpc/include/asm/mmu_context.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 65bed1fdeaad..587ba8352d01 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -149,6 +149,7 @@ config PPC
 	select ARCH_USE_QUEUED_RWLOCKS		if PPC_QUEUED_SPINLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS	if PPC_QUEUED_SPINLOCKS
 	select ARCH_WANT_IPC_PARSE_VERSION
+	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select ARCH_WEAK_RELEASE_ACQUIRE
 	select BINFMT_ELF
 	select BUILDTIME_TABLE_SORT
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index a3a12a8341b2..b42813359f49 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -244,7 +244,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 #define activate_mm activate_mm
 static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
 {
-	switch_mm(prev, next, current);
+	switch_mm_irqs_off(prev, next, current);
 }
 
 /* We don't currently use enter_lazy_tlb() for anything */
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  2020-09-14  4:52 [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes Nicholas Piggin
  2020-09-14  4:52 ` [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race Nicholas Piggin
  2020-09-14  4:52 ` [PATCH v2 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM Nicholas Piggin
@ 2020-09-14  4:52 ` Nicholas Piggin
  2020-09-14  7:00   ` Nicholas Piggin
  2020-09-14 19:59   ` David Miller
  2020-09-14  4:52 ` [PATCH v2 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm Nicholas Piggin
  2020-09-24 12:28 ` [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes Michael Ellerman
  4 siblings, 2 replies; 16+ messages in thread
From: Nicholas Piggin @ 2020-09-14  4:52 UTC (permalink / raw)
  To: linux-mm @ kvack . org
  Cc: Nicholas Piggin, linux-arch, linux-kernel, linuxppc-dev,
	sparclinux, Aneesh Kumar K . V, Andrew Morton, Jens Axboe,
	Peter Zijlstra, David S . Miller

The de facto (and apparently uncommented) standard for using an mm had,
thanks to this code in sparc if nothing else, been that you must have a
reference on mm_users *and that reference must have been obtained with
mmget()*, i.e., from a thread with a reference to mm_users that had used
the mm.

The introduction of mmget_not_zero() in commit d2005e3f41d4
("userfaultfd: don't pin the user memory in userfaultfd_file_create()")
allowed mm_count holders to aoperate on user mappings asynchronously
from the actual threads using the mm, but they were not to load those
mappings into their TLB (i.e., walking vmas and page tables is okay,
kthread_use_mm() is not).

io_uring 2b188cc1bb857 ("Add io_uring IO interface") added code which
does a kthread_use_mm() from a mmget_not_zero() refcount.

The problem with this is code which previously assumed mm == current->mm
and mm->mm_users == 1 implies the mm will remain single-threaded at
least until this thread creates another mm_users reference, has now
broken.

arch/sparc/kernel/smp_64.c:

    if (atomic_read(&mm->mm_users) == 1) {
        cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
        goto local_flush_and_out;
    }

vs fs/io_uring.c

    if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) ||
                 !mmget_not_zero(ctx->sqo_mm)))
        return -EFAULT;
    kthread_use_mm(ctx->sqo_mm);

mmget_not_zero() could come in right after the mm_users == 1 test, then
kthread_use_mm() which sets its CPU in the mm_cpumask. That update could
be lost if cpumask_copy() occurs afterward.

I propose we fix this by allowing mmget_not_zero() to be a first-class
reference, and not have this obscure undocumented and unchecked
restriction.

The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
optimisation could be effectively restored by sending IPIs to mm_cpumask
members and having them remove themselves from mm_cpumask. This is more
tricky so I leave it as an exercise for someone with a sparc64 SMP.
powerpc has a (currently similarly broken) example.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/sparc/kernel/smp_64.c | 65 ++++++++------------------------------
 1 file changed, 14 insertions(+), 51 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index e286e2badc8a..e38d8bf454e8 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
  * are flush_tlb_*() routines, and these run after flush_cache_*()
  * which performs the flushw.
  *
- * The SMP TLB coherency scheme we use works as follows:
- *
- * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
- *    space has (potentially) executed on, this is the heuristic
- *    we use to avoid doing cross calls.
- *
- *    Also, for flushing from kswapd and also for clones, we
- *    use cpu_vm_mask as the list of cpus to make run the TLB.
- *
- * 2) TLB context numbers are shared globally across all processors
- *    in the system, this allows us to play several games to avoid
- *    cross calls.
- *
- *    One invariant is that when a cpu switches to a process, and
- *    that processes tsk->active_mm->cpu_vm_mask does not have the
- *    current cpu's bit set, that tlb context is flushed locally.
- *
- *    If the address space is non-shared (ie. mm->count == 1) we avoid
- *    cross calls when we want to flush the currently running process's
- *    tlb state.  This is done by clearing all cpu bits except the current
- *    processor's in current->mm->cpu_vm_mask and performing the
- *    flush locally only.  This will force any subsequent cpus which run
- *    this task to flush the context from the local tlb if the process
- *    migrates to another cpu (again).
- *
- * 3) For shared address spaces (threads) and swapping we bite the
- *    bullet for most cases and perform the cross call (but only to
- *    the cpus listed in cpu_vm_mask).
- *
- *    The performance gain from "optimizing" away the cross call for threads is
- *    questionable (in theory the big win for threads is the massive sharing of
- *    address space state across processors).
+ * mm->cpu_vm_mask is a bit mask of which cpus an address
+ * space has (potentially) executed on, this is the heuristic
+ * we use to limit cross calls.
  */
 
 /* This currently is only used by the hugetlb arch pre-fault
@@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
 void smp_flush_tlb_mm(struct mm_struct *mm)
 {
 	u32 ctx = CTX_HWBITS(mm->context);
-	int cpu = get_cpu();
 
-	if (atomic_read(&mm->mm_users) == 1) {
-		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-		goto local_flush_and_out;
-	}
+	get_cpu();
 
 	smp_cross_call_masked(&xcall_flush_tlb_mm,
 			      ctx, 0, 0,
 			      mm_cpumask(mm));
 
-local_flush_and_out:
 	__flush_tlb_mm(ctx, SECONDARY_CONTEXT);
 
 	put_cpu();
@@ -1114,17 +1080,15 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
 {
 	u32 ctx = CTX_HWBITS(mm->context);
 	struct tlb_pending_info info;
-	int cpu = get_cpu();
+
+	get_cpu();
 
 	info.ctx = ctx;
 	info.nr = nr;
 	info.vaddrs = vaddrs;
 
-	if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
-		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-	else
-		smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
-				       &info, 1);
+	smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
+			       &info, 1);
 
 	__flush_tlb_pending(ctx, nr, vaddrs);
 
@@ -1134,14 +1098,13 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
 void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
 {
 	unsigned long context = CTX_HWBITS(mm->context);
-	int cpu = get_cpu();
 
-	if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
-		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-	else
-		smp_cross_call_masked(&xcall_flush_tlb_page,
-				      context, vaddr, 0,
-				      mm_cpumask(mm));
+	get_cpu();
+
+	smp_cross_call_masked(&xcall_flush_tlb_page,
+			      context, vaddr, 0,
+			      mm_cpumask(mm));
+
 	__flush_tlb_page(context, vaddr);
 
 	put_cpu();
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
  2020-09-14  4:52 [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes Nicholas Piggin
                   ` (2 preceding siblings ...)
  2020-09-14  4:52 ` [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race Nicholas Piggin
@ 2020-09-14  4:52 ` Nicholas Piggin
  2020-09-24 12:28 ` [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes Michael Ellerman
  4 siblings, 0 replies; 16+ messages in thread
From: Nicholas Piggin @ 2020-09-14  4:52 UTC (permalink / raw)
  To: linux-mm @ kvack . org
  Cc: Nicholas Piggin, linux-arch, linux-kernel, linuxppc-dev,
	sparclinux, Aneesh Kumar K . V, Andrew Morton, Jens Axboe,
	Peter Zijlstra, David S . Miller, Michael Ellerman

Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.

That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.

Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
only if it hasn't been switched-to by the time the IPI is processed.

This relies on commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
switch sequences.

Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")
Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/tlb.h       | 13 -------------
 arch/powerpc/mm/book3s64/radix_tlb.c | 23 ++++++++++++++++-------
 2 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index fbc6f3002f23..d97f061fecac 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
 		return false;
 	return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
 }
-static inline void mm_reset_thread_local(struct mm_struct *mm)
-{
-	WARN_ON(atomic_read(&mm->context.copros) > 0);
-	/*
-	 * It's possible for mm_access to take a reference on mm_users to
-	 * access the remote mm from another thread, but it's not allowed
-	 * to set mm_cpumask, so mm_users may be > 1 here.
-	 */
-	WARN_ON(current->mm != mm);
-	atomic_set(&mm->context.active_cpus, 1);
-	cpumask_clear(mm_cpumask(mm));
-	cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
-}
 #else /* CONFIG_PPC_BOOK3S_64 */
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..143b4fd396f0 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
 	struct mm_struct *mm = arg;
 	unsigned long pid = mm->context.id;
 
+	/*
+	 * A kthread could have done a mmget_not_zero() after the flushing CPU
+	 * checked mm_is_singlethreaded, and be in the process of
+	 * kthread_use_mm when interrupted here. In that case, current->mm will
+	 * be set to mm, because kthread_use_mm() setting ->mm and switching to
+	 * the mm is done with interrupts off.
+	 */
 	if (current->mm == mm)
-		return; /* Local CPU */
+		goto out_flush;
 
 	if (current->active_mm == mm) {
-		/*
-		 * Must be a kernel thread because sender is single-threaded.
-		 */
-		BUG_ON(current->mm);
+		WARN_ON_ONCE(current->mm != NULL);
+		/* Is a kernel thread and is using mm as the lazy tlb */
 		mmgrab(&init_mm);
-		switch_mm(mm, &init_mm, current);
 		current->active_mm = &init_mm;
+		switch_mm_irqs_off(mm, &init_mm, current);
 		mmdrop(mm);
 	}
+
+	atomic_dec(&mm->context.active_cpus);
+	cpumask_clear_cpu(smp_processor_id(), mm_cpumask(mm));
+
+out_flush:
 	_tlbiel_pid(pid, RIC_FLUSH_ALL);
 }
 
@@ -672,7 +682,6 @@ static void exit_flush_lazy_tlbs(struct mm_struct *mm)
 	 */
 	smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
 				(void *)mm, 1);
-	mm_reset_thread_local(mm);
 }
 
 void radix__flush_tlb_mm(struct mm_struct *mm)
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  2020-09-14  4:52 ` [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race Nicholas Piggin
@ 2020-09-14  7:00   ` Nicholas Piggin
  2020-09-14 10:23     ` Anatoly Pugachev
  2020-09-14 19:59   ` David Miller
  1 sibling, 1 reply; 16+ messages in thread
From: Nicholas Piggin @ 2020-09-14  7:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Aneesh Kumar K . V, Jens Axboe, David S . Miller,
	linux-arch, linux-kernel, linuxppc-dev, Peter Zijlstra,
	sparclinux

Excerpts from Nicholas Piggin's message of September 14, 2020 2:52 pm:

[...]

> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
> optimisation could be effectively restored by sending IPIs to mm_cpumask
> members and having them remove themselves from mm_cpumask. This is more
> tricky so I leave it as an exercise for someone with a sparc64 SMP.
> powerpc has a (currently similarly broken) example.

So this compiles and boots on qemu, but qemu does not support any
sparc64 machines with SMP. Attempting some simple hacks doesn't get
me far because openbios isn't populating an SMP device tree, which
blows up everywhere.

The patch is _relatively_ simple, hopefully it shouldn't explode, so
it's probably ready for testing on real SMP hardware, if someone has
a few cycles.

Thanks,
Nick

> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/sparc/kernel/smp_64.c | 65 ++++++++------------------------------
>  1 file changed, 14 insertions(+), 51 deletions(-)
> 
> diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
> index e286e2badc8a..e38d8bf454e8 100644
> --- a/arch/sparc/kernel/smp_64.c
> +++ b/arch/sparc/kernel/smp_64.c
> @@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
>   * are flush_tlb_*() routines, and these run after flush_cache_*()
>   * which performs the flushw.
>   *
> - * The SMP TLB coherency scheme we use works as follows:
> - *
> - * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
> - *    space has (potentially) executed on, this is the heuristic
> - *    we use to avoid doing cross calls.
> - *
> - *    Also, for flushing from kswapd and also for clones, we
> - *    use cpu_vm_mask as the list of cpus to make run the TLB.
> - *
> - * 2) TLB context numbers are shared globally across all processors
> - *    in the system, this allows us to play several games to avoid
> - *    cross calls.
> - *
> - *    One invariant is that when a cpu switches to a process, and
> - *    that processes tsk->active_mm->cpu_vm_mask does not have the
> - *    current cpu's bit set, that tlb context is flushed locally.
> - *
> - *    If the address space is non-shared (ie. mm->count == 1) we avoid
> - *    cross calls when we want to flush the currently running process's
> - *    tlb state.  This is done by clearing all cpu bits except the current
> - *    processor's in current->mm->cpu_vm_mask and performing the
> - *    flush locally only.  This will force any subsequent cpus which run
> - *    this task to flush the context from the local tlb if the process
> - *    migrates to another cpu (again).
> - *
> - * 3) For shared address spaces (threads) and swapping we bite the
> - *    bullet for most cases and perform the cross call (but only to
> - *    the cpus listed in cpu_vm_mask).
> - *
> - *    The performance gain from "optimizing" away the cross call for threads is
> - *    questionable (in theory the big win for threads is the massive sharing of
> - *    address space state across processors).
> + * mm->cpu_vm_mask is a bit mask of which cpus an address
> + * space has (potentially) executed on, this is the heuristic
> + * we use to limit cross calls.
>   */
>  
>  /* This currently is only used by the hugetlb arch pre-fault
> @@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
>  void smp_flush_tlb_mm(struct mm_struct *mm)
>  {
>  	u32 ctx = CTX_HWBITS(mm->context);
> -	int cpu = get_cpu();
>  
> -	if (atomic_read(&mm->mm_users) == 1) {
> -		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
> -		goto local_flush_and_out;
> -	}
> +	get_cpu();
>  
>  	smp_cross_call_masked(&xcall_flush_tlb_mm,
>  			      ctx, 0, 0,
>  			      mm_cpumask(mm));
>  
> -local_flush_and_out:
>  	__flush_tlb_mm(ctx, SECONDARY_CONTEXT);
>  
>  	put_cpu();
> @@ -1114,17 +1080,15 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
>  {
>  	u32 ctx = CTX_HWBITS(mm->context);
>  	struct tlb_pending_info info;
> -	int cpu = get_cpu();
> +
> +	get_cpu();
>  
>  	info.ctx = ctx;
>  	info.nr = nr;
>  	info.vaddrs = vaddrs;
>  
> -	if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
> -		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
> -	else
> -		smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
> -				       &info, 1);
> +	smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
> +			       &info, 1);
>  
>  	__flush_tlb_pending(ctx, nr, vaddrs);
>  
> @@ -1134,14 +1098,13 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
>  void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
>  {
>  	unsigned long context = CTX_HWBITS(mm->context);
> -	int cpu = get_cpu();
>  
> -	if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
> -		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
> -	else
> -		smp_cross_call_masked(&xcall_flush_tlb_page,
> -				      context, vaddr, 0,
> -				      mm_cpumask(mm));
> +	get_cpu();
> +
> +	smp_cross_call_masked(&xcall_flush_tlb_page,
> +			      context, vaddr, 0,
> +			      mm_cpumask(mm));
> +
>  	__flush_tlb_page(context, vaddr);
>  
>  	put_cpu();
> -- 
> 2.23.0
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  2020-09-14  7:00   ` Nicholas Piggin
@ 2020-09-14 10:23     ` Anatoly Pugachev
  2020-09-15  2:49       ` Nicholas Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Anatoly Pugachev @ 2020-09-14 10:23 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-mm, Andrew Morton, Aneesh Kumar K . V, Jens Axboe,
	David S . Miller, linux-arch, Linux Kernel list, linuxppc-dev,
	Peter Zijlstra, Sparc kernel list

On Mon, Sep 14, 2020 at 10:00 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Excerpts from Nicholas Piggin's message of September 14, 2020 2:52 pm:
>
> [...]
>
> > The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
> > optimisation could be effectively restored by sending IPIs to mm_cpumask
> > members and having them remove themselves from mm_cpumask. This is more
> > tricky so I leave it as an exercise for someone with a sparc64 SMP.
> > powerpc has a (currently similarly broken) example.
>
> So this compiles and boots on qemu, but qemu does not support any
> sparc64 machines with SMP. Attempting some simple hacks doesn't get
> me far because openbios isn't populating an SMP device tree, which
> blows up everywhere.
>
> The patch is _relatively_ simple, hopefully it shouldn't explode, so
> it's probably ready for testing on real SMP hardware, if someone has
> a few cycles.

Nick,

applied this patch to over 'v5.9-rc5' tag , used my test VM (ldom)
with 32 vcpus.
Machine boot, stress-ng test ( run as
"stress-ng --cpu 8 --io 8 --vm 8 --vm-bytes 2G --fork 8 --timeout 15m" )
finishes without errors.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  2020-09-14  4:52 ` [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race Nicholas Piggin
@ 2020-09-14 10:56   ` peterz
  2020-09-15  2:48     ` Nicholas Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: peterz @ 2020-09-14 10:56 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-mm @ kvack . org, linux-arch, linux-kernel, linuxppc-dev,
	sparclinux, Aneesh Kumar K . V, Andrew Morton, Jens Axboe,
	David S . Miller, Andy Lutomirski, Dave Hansen

On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote:
> Reading and modifying current->mm and current->active_mm and switching
> mm should be done with irqs off, to prevent races seeing an intermediate
> state.
> 
> This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
> invalidate"). At exec-time when the new mm is activated, the old one
> should usually be single-threaded and no longer used, unless something
> else is holding an mm_users reference (which may be possible).
> 
> Absent other mm_users, there is also a race with preemption and lazy tlb
> switching. Consider the kernel_execve case where the current thread is
> using a lazy tlb active mm:
> 
>   call_usermodehelper()
>     kernel_execve()
>       old_mm = current->mm;
>       active_mm = current->active_mm;
>       *** preempt *** -------------------->  schedule()
>                                                prev->active_mm = NULL;
>                                                mmdrop(prev active_mm);
>                                              ...
>                       <--------------------  schedule()
>       current->mm = mm;
>       current->active_mm = mm;
>       if (!old_mm)
>           mmdrop(active_mm);
> 
> If we switch back to the kernel thread from a different mm, there is a
> double free of the old active_mm, and a missing free of the new one.
> 
> Closing this race only requires interrupts to be disabled while ->mm
> and ->active_mm are being switched, but the TLB problem requires also
> holding interrupts off over activate_mm. Unfortunately not all archs
> can do that yet, e.g., arm defers the switch if irqs are disabled and
> expects finish_arch_post_lock_switch() to be called to complete the
> flush; um takes a blocking lock in activate_mm().
> 
> So as a first step, disable interrupts across the mm/active_mm updates
> to close the lazy tlb preempt race, and provide an arch option to
> extend that to activate_mm which allows architectures doing IPI based
> TLB shootdowns to close the second race.
> 
> This is a bit ugly, but in the interest of fixing the bug and backporting
> before all architectures are converted this is a compromise.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I'm thinking we want this selected on x86 as well. Andy?

> ---
>  arch/Kconfig |  7 +++++++
>  fs/exec.c    | 17 +++++++++++++++--
>  2 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index af14a567b493..94821e3f94d1 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
>  	bool
>  	depends on MMU_GATHER_TABLE_FREE
>  
> +config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
> +	bool
> +	help
> +	  Temporary select until all architectures can be converted to have
> +	  irqs disabled over activate_mm. Architectures that do IPI based TLB
> +	  shootdowns should enable this.
> +
>  config ARCH_HAVE_NMI_SAFE_CMPXCHG
>  	bool
>  
> diff --git a/fs/exec.c b/fs/exec.c
> index a91003e28eaa..d4fb18baf1fb 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
>  	}
>  
>  	task_lock(tsk);
> -	active_mm = tsk->active_mm;
>  	membarrier_exec_mmap(mm);
> -	tsk->mm = mm;
> +
> +	local_irq_disable();
> +	active_mm = tsk->active_mm;
>  	tsk->active_mm = mm;
> +	tsk->mm = mm;
> +	/*
> +	 * This prevents preemption while active_mm is being loaded and
> +	 * it and mm are being updated, which could cause problems for
> +	 * lazy tlb mm refcounting when these are updated by context
> +	 * switches. Not all architectures can handle irqs off over
> +	 * activate_mm yet.
> +	 */
> +	if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> +		local_irq_enable();
>  	activate_mm(active_mm, mm);
> +	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> +		local_irq_enable();
>  	tsk->mm->vmacache_seqnum = 0;
>  	vmacache_flush(tsk);
>  	task_unlock(tsk);
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  2020-09-14  4:52 ` [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race Nicholas Piggin
  2020-09-14  7:00   ` Nicholas Piggin
@ 2020-09-14 19:59   ` David Miller
  2020-09-15  3:24     ` Nicholas Piggin
  1 sibling, 1 reply; 16+ messages in thread
From: David Miller @ 2020-09-14 19:59 UTC (permalink / raw)
  To: npiggin
  Cc: linux-mm, linux-arch, linux-kernel, linuxppc-dev, sparclinux,
	aneesh.kumar, akpm, axboe, peterz

From: Nicholas Piggin <npiggin@gmail.com>
Date: Mon, 14 Sep 2020 14:52:18 +1000

 ...
> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
> optimisation could be effectively restored by sending IPIs to mm_cpumask
> members and having them remove themselves from mm_cpumask. This is more
> tricky so I leave it as an exercise for someone with a sparc64 SMP.
> powerpc has a (currently similarly broken) example.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Sad to see this optimization go away, but what can I do:

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  2020-09-14 10:56   ` peterz
@ 2020-09-15  2:48     ` Nicholas Piggin
  2020-09-15 11:26       ` Michael Ellerman
  2020-09-18 12:18       ` Michael Ellerman
  0 siblings, 2 replies; 16+ messages in thread
From: Nicholas Piggin @ 2020-09-15  2:48 UTC (permalink / raw)
  To: peterz
  Cc: Andrew Morton, Aneesh Kumar K . V, Jens Axboe, Dave Hansen,
	David S . Miller, linux-arch, linux-kernel,
	linux-mm @ kvack . org, linuxppc-dev, Andy Lutomirski,
	sparclinux

Excerpts from peterz@infradead.org's message of September 14, 2020 8:56 pm:
> On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote:
>> Reading and modifying current->mm and current->active_mm and switching
>> mm should be done with irqs off, to prevent races seeing an intermediate
>> state.
>> 
>> This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
>> invalidate"). At exec-time when the new mm is activated, the old one
>> should usually be single-threaded and no longer used, unless something
>> else is holding an mm_users reference (which may be possible).
>> 
>> Absent other mm_users, there is also a race with preemption and lazy tlb
>> switching. Consider the kernel_execve case where the current thread is
>> using a lazy tlb active mm:
>> 
>>   call_usermodehelper()
>>     kernel_execve()
>>       old_mm = current->mm;
>>       active_mm = current->active_mm;
>>       *** preempt *** -------------------->  schedule()
>>                                                prev->active_mm = NULL;
>>                                                mmdrop(prev active_mm);
>>                                              ...
>>                       <--------------------  schedule()
>>       current->mm = mm;
>>       current->active_mm = mm;
>>       if (!old_mm)
>>           mmdrop(active_mm);
>> 
>> If we switch back to the kernel thread from a different mm, there is a
>> double free of the old active_mm, and a missing free of the new one.
>> 
>> Closing this race only requires interrupts to be disabled while ->mm
>> and ->active_mm are being switched, but the TLB problem requires also
>> holding interrupts off over activate_mm. Unfortunately not all archs
>> can do that yet, e.g., arm defers the switch if irqs are disabled and
>> expects finish_arch_post_lock_switch() to be called to complete the
>> flush; um takes a blocking lock in activate_mm().
>> 
>> So as a first step, disable interrupts across the mm/active_mm updates
>> to close the lazy tlb preempt race, and provide an arch option to
>> extend that to activate_mm which allows architectures doing IPI based
>> TLB shootdowns to close the second race.
>> 
>> This is a bit ugly, but in the interest of fixing the bug and backporting
>> before all architectures are converted this is a compromise.
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> I'm thinking we want this selected on x86 as well. Andy?

Thanks for the ack. The plan was to take it through the powerpc tree,
but if you'd want x86 to select it, maybe a topic branch? Although
Michael will be away during the next merge window so I don't want to
get too fancy. Would you mind doing it in a follow up merge after
powerpc, being that it's (I think) a small change?

I do think all archs should be selecting this, and we want to remove
the divergent code paths from here as soon as possible. I was planning
to send patches for the N+1 window at least for all the easy archs.
But the sooner the better really, we obviously want to share code
coverage with x86 :)

Thanks,
Nick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  2020-09-14 10:23     ` Anatoly Pugachev
@ 2020-09-15  2:49       ` Nicholas Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nicholas Piggin @ 2020-09-15  2:49 UTC (permalink / raw)
  To: Anatoly Pugachev
  Cc: Andrew Morton, Aneesh Kumar K . V, Jens Axboe, David S . Miller,
	linux-arch, Linux Kernel list, linux-mm, linuxppc-dev,
	Peter Zijlstra, Sparc kernel list

Excerpts from Anatoly Pugachev's message of September 14, 2020 8:23 pm:
> On Mon, Sep 14, 2020 at 10:00 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> Excerpts from Nicholas Piggin's message of September 14, 2020 2:52 pm:
>>
>> [...]
>>
>> > The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
>> > optimisation could be effectively restored by sending IPIs to mm_cpumask
>> > members and having them remove themselves from mm_cpumask. This is more
>> > tricky so I leave it as an exercise for someone with a sparc64 SMP.
>> > powerpc has a (currently similarly broken) example.
>>
>> So this compiles and boots on qemu, but qemu does not support any
>> sparc64 machines with SMP. Attempting some simple hacks doesn't get
>> me far because openbios isn't populating an SMP device tree, which
>> blows up everywhere.
>>
>> The patch is _relatively_ simple, hopefully it shouldn't explode, so
>> it's probably ready for testing on real SMP hardware, if someone has
>> a few cycles.
> 
> Nick,
> 
> applied this patch to over 'v5.9-rc5' tag , used my test VM (ldom)
> with 32 vcpus.
> Machine boot, stress-ng test ( run as
> "stress-ng --cpu 8 --io 8 --vm 8 --vm-bytes 2G --fork 8 --timeout 15m" )
> finishes without errors.
> 

Thank you very much Anatoly.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  2020-09-14 19:59   ` David Miller
@ 2020-09-15  3:24     ` Nicholas Piggin
  2020-09-15 19:42       ` David Miller
  0 siblings, 1 reply; 16+ messages in thread
From: Nicholas Piggin @ 2020-09-15  3:24 UTC (permalink / raw)
  To: David Miller
  Cc: akpm, aneesh.kumar, axboe, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, peterz, sparclinux

Excerpts from David Miller's message of September 15, 2020 5:59 am:
> From: Nicholas Piggin <npiggin@gmail.com>
> Date: Mon, 14 Sep 2020 14:52:18 +1000
> 
>  ...
>> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
>> optimisation could be effectively restored by sending IPIs to mm_cpumask
>> members and having them remove themselves from mm_cpumask. This is more
>> tricky so I leave it as an exercise for someone with a sparc64 SMP.
>> powerpc has a (currently similarly broken) example.
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> 
> Sad to see this optimization go away, but what can I do:
> 
> Acked-by: David S. Miller <davem@davemloft.net>
> 

Thanks Dave, any objection if we merge this via the powerpc tree
to keep the commits together?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  2020-09-15  2:48     ` Nicholas Piggin
@ 2020-09-15 11:26       ` Michael Ellerman
  2020-09-18 12:18       ` Michael Ellerman
  1 sibling, 0 replies; 16+ messages in thread
From: Michael Ellerman @ 2020-09-15 11:26 UTC (permalink / raw)
  To: Nicholas Piggin, peterz
  Cc: Jens Axboe, linux-arch, linux-mm @ kvack . org,
	Aneesh Kumar K . V, linux-kernel, Andy Lutomirski, Dave Hansen,
	sparclinux, Andrew Morton, linuxppc-dev, David S . Miller

Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from peterz@infradead.org's message of September 14, 2020 8:56 pm:
>> On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote:
>>> Reading and modifying current->mm and current->active_mm and switching
>>> mm should be done with irqs off, to prevent races seeing an intermediate
>>> state.
...
>>> 
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> 
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> 
>> I'm thinking we want this selected on x86 as well. Andy?
>
> Thanks for the ack. The plan was to take it through the powerpc tree,
> but if you'd want x86 to select it, maybe a topic branch? Although
> Michael will be away during the next merge window so I don't want to
> get too fancy. Would you mind doing it in a follow up merge after
> powerpc, being that it's (I think) a small change?

Or get akpm to take the series, including the x86 change.

cheers

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  2020-09-15  3:24     ` Nicholas Piggin
@ 2020-09-15 19:42       ` David Miller
  0 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2020-09-15 19:42 UTC (permalink / raw)
  To: npiggin
  Cc: akpm, aneesh.kumar, axboe, linux-arch, linux-kernel, linux-mm,
	linuxppc-dev, peterz, sparclinux

From: Nicholas Piggin <npiggin@gmail.com>
Date: Tue, 15 Sep 2020 13:24:07 +1000

> Excerpts from David Miller's message of September 15, 2020 5:59 am:
>> From: Nicholas Piggin <npiggin@gmail.com>
>> Date: Mon, 14 Sep 2020 14:52:18 +1000
>> 
>>  ...
>>> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
>>> optimisation could be effectively restored by sending IPIs to mm_cpumask
>>> members and having them remove themselves from mm_cpumask. This is more
>>> tricky so I leave it as an exercise for someone with a sparc64 SMP.
>>> powerpc has a (currently similarly broken) example.
>>> 
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> 
>> Sad to see this optimization go away, but what can I do:
>> 
>> Acked-by: David S. Miller <davem@davemloft.net>
>> 
> 
> Thanks Dave, any objection if we merge this via the powerpc tree
> to keep the commits together?

No objection.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  2020-09-15  2:48     ` Nicholas Piggin
  2020-09-15 11:26       ` Michael Ellerman
@ 2020-09-18 12:18       ` Michael Ellerman
  1 sibling, 0 replies; 16+ messages in thread
From: Michael Ellerman @ 2020-09-18 12:18 UTC (permalink / raw)
  To: Nicholas Piggin, peterz
  Cc: Jens Axboe, linux-arch, linux-mm @ kvack . org,
	Aneesh Kumar K . V, linux-kernel, Andy Lutomirski, Dave Hansen,
	sparclinux, Andrew Morton, linuxppc-dev, David S . Miller

Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from peterz@infradead.org's message of September 14, 2020 8:56 pm:
>> On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote:
>>> Reading and modifying current->mm and current->active_mm and switching
>>> mm should be done with irqs off, to prevent races seeing an intermediate
>>> state.
...
>>> 
>>> This is a bit ugly, but in the interest of fixing the bug and backporting
>>> before all architectures are converted this is a compromise.
>>> 
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> 
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> 
>> I'm thinking we want this selected on x86 as well. Andy?
>
> Thanks for the ack. The plan was to take it through the powerpc tree,
> but if you'd want x86 to select it, maybe a topic branch?

I've put this series in a topic branch based on v5.9-rc2:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/log/?h=topic/irqs-off-activate-mm

I plan to merge it into the powerpc/next tree for v5.10, but if anyone
else wants to merge it that's fine too.

cheers

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes
  2020-09-14  4:52 [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes Nicholas Piggin
                   ` (3 preceding siblings ...)
  2020-09-14  4:52 ` [PATCH v2 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm Nicholas Piggin
@ 2020-09-24 12:28 ` Michael Ellerman
  4 siblings, 0 replies; 16+ messages in thread
From: Michael Ellerman @ 2020-09-24 12:28 UTC (permalink / raw)
  To: Nicholas Piggin, linux-mm @ kvack . org
  Cc: Aneesh Kumar K . V, linux-kernel, Andrew Morton, Peter Zijlstra,
	David S . Miller, sparclinux, Jens Axboe, linux-arch,
	linuxppc-dev

On Mon, 14 Sep 2020 14:52:15 +1000, Nicholas Piggin wrote:
> This is an attempt to fix a few different related issues around
> switching mm, TLB flushing, and lazy tlb mm handling.
> 
> This will require all architectures to eventually move to disabling
> irqs over activate_mm, but it's possible we could add another arch
> call after irqs are re-enabled for those few which can't do their
> entire activation with irqs disabled.
> 
> [...]

Applied to powerpc/next.

[1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
      https://git.kernel.org/powerpc/c/d53c3dfb23c45f7d4f910c3a3ca84bf0a99c6143
[2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
      https://git.kernel.org/powerpc/c/66acd46080bd9e5ad2be4b0eb1d498d5145d058e
[3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
      https://git.kernel.org/powerpc/c/bafb056ce27940c9994ea905336aa8f27b4f7275
[4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
      https://git.kernel.org/powerpc/c/a665eec0a22e11cdde708c1c256a465ebe768047

cheers

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-09-24 12:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-14  4:52 [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes Nicholas Piggin
2020-09-14  4:52 ` [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race Nicholas Piggin
2020-09-14 10:56   ` peterz
2020-09-15  2:48     ` Nicholas Piggin
2020-09-15 11:26       ` Michael Ellerman
2020-09-18 12:18       ` Michael Ellerman
2020-09-14  4:52 ` [PATCH v2 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM Nicholas Piggin
2020-09-14  4:52 ` [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race Nicholas Piggin
2020-09-14  7:00   ` Nicholas Piggin
2020-09-14 10:23     ` Anatoly Pugachev
2020-09-15  2:49       ` Nicholas Piggin
2020-09-14 19:59   ` David Miller
2020-09-15  3:24     ` Nicholas Piggin
2020-09-15 19:42       ` David Miller
2020-09-14  4:52 ` [PATCH v2 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm Nicholas Piggin
2020-09-24 12:28 ` [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).