linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] more mm switching vs TLB shootdown and lazy tlb
@ 2020-08-28 10:00 Nicholas Piggin
  2020-08-28 10:00 ` [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race Nicholas Piggin
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Nicholas Piggin @ 2020-08-28 10:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Nicholas Piggin, linux-arch, linux-kernel, linuxppc-dev,
	Aneesh Kumar K.V, Andrew Morton, Jens Axboe, Peter Zijlstra,
	David S. Miller

This is an attempt to fix a few different related issues around
switching mm, TLB flushing, and lazy tlb mm handling.

This will require all architectures to eventually move to disabling
irqs over activate_mm, but it's possible we could add another arch
call after irqs are re-enabled for those few which can't do their
entire activation with irqs disabled.

I'd like some feedback on the sparc/powerpc vs kthread_use_mm
problem too.

Thanks,
Nick

Nicholas Piggin (4):
  mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

 arch/Kconfig                           |  7 +++
 arch/powerpc/Kconfig                   |  1 +
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/include/asm/tlb.h         | 13 ------
 arch/powerpc/mm/book3s64/radix_tlb.c   | 23 ++++++---
 arch/sparc/kernel/smp_64.c             | 65 ++++++--------------------
 fs/exec.c                              | 17 ++++++-
 7 files changed, 54 insertions(+), 74 deletions(-)

-- 
2.23.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  2020-08-28 10:00 [PATCH 0/4] more mm switching vs TLB shootdown and lazy tlb Nicholas Piggin
@ 2020-08-28 10:00 ` Nicholas Piggin
  2020-08-28 11:15   ` peterz
  2020-08-28 10:00 ` [PATCH 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM Nicholas Piggin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Nicholas Piggin @ 2020-08-28 10:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Nicholas Piggin, linux-arch, linux-kernel, linuxppc-dev,
	Aneesh Kumar K.V, Andrew Morton, Jens Axboe, Peter Zijlstra,
	David S. Miller

Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.

This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).

Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:

  call_usermodehelper()
    kernel_execve()
      old_mm = current->mm;
      active_mm = current->active_mm;
      *** preempt *** -------------------->  schedule()
                                               prev->active_mm = NULL;
                                               mmdrop(prev active_mm);
                                             ...
                      <--------------------  schedule()
      current->mm = mm;
      current->active_mm = mm;
      if (!old_mm)
          mmdrop(active_mm);

If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.

Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().

So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.

This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig |  7 +++++++
 fs/exec.c    | 17 +++++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..94821e3f94d1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
 	bool
 	depends on MMU_GATHER_TABLE_FREE
 
+config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
+	bool
+	help
+	  Temporary select until all architectures can be converted to have
+	  irqs disabled over activate_mm. Architectures that do IPI based TLB
+	  shootdowns should enable this.
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..d4fb18baf1fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
 	}
 
 	task_lock(tsk);
-	active_mm = tsk->active_mm;
 	membarrier_exec_mmap(mm);
-	tsk->mm = mm;
+
+	local_irq_disable();
+	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
+	tsk->mm = mm;
+	/*
+	 * This prevents preemption while active_mm is being loaded and
+	 * it and mm are being updated, which could cause problems for
+	 * lazy tlb mm refcounting when these are updated by context
+	 * switches. Not all architectures can handle irqs off over
+	 * activate_mm yet.
+	 */
+	if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+		local_irq_enable();
 	activate_mm(active_mm, mm);
+	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+		local_irq_enable();
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
 	task_unlock(tsk);
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  2020-08-28 10:00 [PATCH 0/4] more mm switching vs TLB shootdown and lazy tlb Nicholas Piggin
  2020-08-28 10:00 ` [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race Nicholas Piggin
@ 2020-08-28 10:00 ` Nicholas Piggin
  2020-08-28 10:00 ` [PATCH 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race Nicholas Piggin
  2020-08-28 10:00 ` [PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm Nicholas Piggin
  3 siblings, 0 replies; 9+ messages in thread
From: Nicholas Piggin @ 2020-08-28 10:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Nicholas Piggin, linux-arch, linux-kernel, linuxppc-dev,
	Aneesh Kumar K.V, Andrew Morton, Jens Axboe, Peter Zijlstra,
	David S. Miller

powerpc uses IPIs in some situations to switch a kernel thread away
from a lazy tlb mm, which is subject to the TLB flushing race
described in the changelog introducing ARCH_WANT_IRQS_OFF_ACTIVATE_MM.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/Kconfig                   | 1 +
 arch/powerpc/include/asm/mmu_context.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbfb3ce9..65cb32211574 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -149,6 +149,7 @@ config PPC
 	select ARCH_USE_QUEUED_RWLOCKS		if PPC_QUEUED_SPINLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS	if PPC_QUEUED_SPINLOCKS
 	select ARCH_WANT_IPC_PARSE_VERSION
+	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select ARCH_WEAK_RELEASE_ACQUIRE
 	select BINFMT_ELF
 	select BUILDTIME_TABLE_SORT
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 7f3658a97384..e02aa793420b 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -244,7 +244,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
  */
 static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
 {
-	switch_mm(prev, next, current);
+	switch_mm_irqs_off(prev, next, current);
 }
 
 /* We don't currently use enter_lazy_tlb() for anything */
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  2020-08-28 10:00 [PATCH 0/4] more mm switching vs TLB shootdown and lazy tlb Nicholas Piggin
  2020-08-28 10:00 ` [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race Nicholas Piggin
  2020-08-28 10:00 ` [PATCH 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM Nicholas Piggin
@ 2020-08-28 10:00 ` Nicholas Piggin
  2020-08-28 10:00 ` [PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm Nicholas Piggin
  3 siblings, 0 replies; 9+ messages in thread
From: Nicholas Piggin @ 2020-08-28 10:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Nicholas Piggin, linux-arch, linux-kernel, linuxppc-dev,
	Aneesh Kumar K.V, Andrew Morton, Jens Axboe, Peter Zijlstra,
	David S. Miller, sparclinux

The de facto (and apparently uncommented) standard for using an mm had,
thanks to this code in sparc if nothing else, been that you must have a
reference on mm_users *and that reference must have been obtained with
mmget()*, i.e., from a thread with a reference to mm_users that had used
the mm.

The introduction of mmget_not_zero() in commit d2005e3f41d4
("userfaultfd: don't pin the user memory in userfaultfd_file_create()")
allowed mm_count holders to aoperate on user mappings asynchronously
from the actual threads using the mm, but they were not to load those
mappings into their TLB (i.e., walking vmas and page tables is okay,
kthread_use_mm() is not).

io_uring 2b188cc1bb857 ("Add io_uring IO interface") added code which
does a kthread_use_mm() from a mmget_not_zero() refcount.

The problem with this is code which previously assumed mm == current->mm
and mm->mm_users == 1 implies the mm will remain single-threaded at
least until this thread creates another mm_users reference, has now
broken.

arch/sparc/kernel/smp_64.c:

    if (atomic_read(&mm->mm_users) == 1) {
        cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
        goto local_flush_and_out;
    }

vs fs/io_uring.c

    if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) ||
                 !mmget_not_zero(ctx->sqo_mm)))
        return -EFAULT;
    kthread_use_mm(ctx->sqo_mm);

mmget_not_zero() could come in right after the mm_users == 1 test, then
kthread_use_mm() which sets its CPU in the mm_cpumask. That update could
be lost if cpumask_copy() occurs afterward.

I propose we fix this by allowing mmget_not_zero() to be a first-class
reference, and not have this obscure undocumented and unchecked
restriction.

The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
optimisation could be effectively restored by sending IPIs to mm_cpumask
members and having them remove themselves from mm_cpumask. This is more
tricky so I leave it as an exercise for someone with a sparc64 SMP.
powerpc has a (currently similarly broken) example.

Cc: sparclinux@vger.kernel.org
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/sparc/kernel/smp_64.c | 65 ++++++++------------------------------
 1 file changed, 14 insertions(+), 51 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index e286e2badc8a..e38d8bf454e8 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
  * are flush_tlb_*() routines, and these run after flush_cache_*()
  * which performs the flushw.
  *
- * The SMP TLB coherency scheme we use works as follows:
- *
- * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
- *    space has (potentially) executed on, this is the heuristic
- *    we use to avoid doing cross calls.
- *
- *    Also, for flushing from kswapd and also for clones, we
- *    use cpu_vm_mask as the list of cpus to make run the TLB.
- *
- * 2) TLB context numbers are shared globally across all processors
- *    in the system, this allows us to play several games to avoid
- *    cross calls.
- *
- *    One invariant is that when a cpu switches to a process, and
- *    that processes tsk->active_mm->cpu_vm_mask does not have the
- *    current cpu's bit set, that tlb context is flushed locally.
- *
- *    If the address space is non-shared (ie. mm->count == 1) we avoid
- *    cross calls when we want to flush the currently running process's
- *    tlb state.  This is done by clearing all cpu bits except the current
- *    processor's in current->mm->cpu_vm_mask and performing the
- *    flush locally only.  This will force any subsequent cpus which run
- *    this task to flush the context from the local tlb if the process
- *    migrates to another cpu (again).
- *
- * 3) For shared address spaces (threads) and swapping we bite the
- *    bullet for most cases and perform the cross call (but only to
- *    the cpus listed in cpu_vm_mask).
- *
- *    The performance gain from "optimizing" away the cross call for threads is
- *    questionable (in theory the big win for threads is the massive sharing of
- *    address space state across processors).
+ * mm->cpu_vm_mask is a bit mask of which cpus an address
+ * space has (potentially) executed on, this is the heuristic
+ * we use to limit cross calls.
  */
 
 /* This currently is only used by the hugetlb arch pre-fault
@@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
 void smp_flush_tlb_mm(struct mm_struct *mm)
 {
 	u32 ctx = CTX_HWBITS(mm->context);
-	int cpu = get_cpu();
 
-	if (atomic_read(&mm->mm_users) == 1) {
-		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-		goto local_flush_and_out;
-	}
+	get_cpu();
 
 	smp_cross_call_masked(&xcall_flush_tlb_mm,
 			      ctx, 0, 0,
 			      mm_cpumask(mm));
 
-local_flush_and_out:
 	__flush_tlb_mm(ctx, SECONDARY_CONTEXT);
 
 	put_cpu();
@@ -1114,17 +1080,15 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
 {
 	u32 ctx = CTX_HWBITS(mm->context);
 	struct tlb_pending_info info;
-	int cpu = get_cpu();
+
+	get_cpu();
 
 	info.ctx = ctx;
 	info.nr = nr;
 	info.vaddrs = vaddrs;
 
-	if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
-		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-	else
-		smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
-				       &info, 1);
+	smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
+			       &info, 1);
 
 	__flush_tlb_pending(ctx, nr, vaddrs);
 
@@ -1134,14 +1098,13 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
 void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
 {
 	unsigned long context = CTX_HWBITS(mm->context);
-	int cpu = get_cpu();
 
-	if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
-		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-	else
-		smp_cross_call_masked(&xcall_flush_tlb_page,
-				      context, vaddr, 0,
-				      mm_cpumask(mm));
+	get_cpu();
+
+	smp_cross_call_masked(&xcall_flush_tlb_page,
+			      context, vaddr, 0,
+			      mm_cpumask(mm));
+
 	__flush_tlb_page(context, vaddr);
 
 	put_cpu();
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
  2020-08-28 10:00 [PATCH 0/4] more mm switching vs TLB shootdown and lazy tlb Nicholas Piggin
                   ` (2 preceding siblings ...)
  2020-08-28 10:00 ` [PATCH 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race Nicholas Piggin
@ 2020-08-28 10:00 ` Nicholas Piggin
  2020-09-01 12:00   ` Michael Ellerman
  3 siblings, 1 reply; 9+ messages in thread
From: Nicholas Piggin @ 2020-08-28 10:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Nicholas Piggin, linux-arch, linux-kernel, linuxppc-dev,
	Aneesh Kumar K.V, Andrew Morton, Jens Axboe, Peter Zijlstra,
	David S. Miller

Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.

That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.

Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
if it's okay. This fix relies on commit 38cf307c1f20 ("mm: fix
kthread_use_mm() vs TLB invalidate") to disable irqs over the mm switch,
and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to be enabled.

Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/tlb.h       | 13 -------------
 arch/powerpc/mm/book3s64/radix_tlb.c | 23 ++++++++++++++++-------
 2 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index fbc6f3002f23..d97f061fecac 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
 		return false;
 	return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
 }
-static inline void mm_reset_thread_local(struct mm_struct *mm)
-{
-	WARN_ON(atomic_read(&mm->context.copros) > 0);
-	/*
-	 * It's possible for mm_access to take a reference on mm_users to
-	 * access the remote mm from another thread, but it's not allowed
-	 * to set mm_cpumask, so mm_users may be > 1 here.
-	 */
-	WARN_ON(current->mm != mm);
-	atomic_set(&mm->context.active_cpus, 1);
-	cpumask_clear(mm_cpumask(mm));
-	cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
-}
 #else /* CONFIG_PPC_BOOK3S_64 */
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..a421a0e3f930 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
 	struct mm_struct *mm = arg;
 	unsigned long pid = mm->context.id;
 
+	/*
+	 * A kthread could have done a mmget_not_zero() after the flushing CPU
+	 * checked mm_users == 1, and be in the process of kthread_use_mm when
+	 * interrupted here. In that case, current->mm will be set to mm,
+	 * because kthread_use_mm() setting ->mm and switching to the mm is
+	 * done with interrupts off.
+	 */
 	if (current->mm == mm)
-		return; /* Local CPU */
+		goto out_flush;
 
 	if (current->active_mm == mm) {
-		/*
-		 * Must be a kernel thread because sender is single-threaded.
-		 */
-		BUG_ON(current->mm);
+		WARN_ON_ONCE(current->mm != NULL);
+		/* Is a kernel thread and is using mm as the lazy tlb */
 		mmgrab(&init_mm);
-		switch_mm(mm, &init_mm, current);
 		current->active_mm = &init_mm;
+		switch_mm_irqs_off(mm, &init_mm, current);
 		mmdrop(mm);
 	}
+
+	atomic_dec(&mm->context.active_cpus);
+	cpumask_clear_cpu(smp_processor_id(), mm_cpumask(mm));
+
+out_flush:
 	_tlbiel_pid(pid, RIC_FLUSH_ALL);
 }
 
@@ -672,7 +682,6 @@ static void exit_flush_lazy_tlbs(struct mm_struct *mm)
 	 */
 	smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
 				(void *)mm, 1);
-	mm_reset_thread_local(mm);
 }
 
 void radix__flush_tlb_mm(struct mm_struct *mm)
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  2020-08-28 10:00 ` [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race Nicholas Piggin
@ 2020-08-28 11:15   ` peterz
  2020-08-31  1:25     ` Nicholas Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: peterz @ 2020-08-28 11:15 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-mm, linux-arch, linux-kernel, linuxppc-dev,
	Aneesh Kumar K.V, Andrew Morton, Jens Axboe, David S. Miller

On Fri, Aug 28, 2020 at 08:00:19PM +1000, Nicholas Piggin wrote:

> Closing this race only requires interrupts to be disabled while ->mm
> and ->active_mm are being switched, but the TLB problem requires also
> holding interrupts off over activate_mm. Unfortunately not all archs
> can do that yet, e.g., arm defers the switch if irqs are disabled and
> expects finish_arch_post_lock_switch() to be called to complete the
> flush; um takes a blocking lock in activate_mm().

ARM at least has activate_mm() := switch_mm(), so it could be made to
work.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  2020-08-28 11:15   ` peterz
@ 2020-08-31  1:25     ` Nicholas Piggin
  0 siblings, 0 replies; 9+ messages in thread
From: Nicholas Piggin @ 2020-08-31  1:25 UTC (permalink / raw)
  To: peterz
  Cc: Andrew Morton, Aneesh Kumar K.V, Jens Axboe, David S. Miller,
	linux-arch, linux-kernel, linux-mm, linuxppc-dev

Excerpts from peterz@infradead.org's message of August 28, 2020 9:15 pm:
> On Fri, Aug 28, 2020 at 08:00:19PM +1000, Nicholas Piggin wrote:
> 
>> Closing this race only requires interrupts to be disabled while ->mm
>> and ->active_mm are being switched, but the TLB problem requires also
>> holding interrupts off over activate_mm. Unfortunately not all archs
>> can do that yet, e.g., arm defers the switch if irqs are disabled and
>> expects finish_arch_post_lock_switch() to be called to complete the
>> flush; um takes a blocking lock in activate_mm().
> 
> ARM at least has activate_mm() := switch_mm(), so it could be made to
> work.
>

Yeah, so long as that post_lock_switch switch did the right thing with
respect to its TLB flushing. It should do because arm doesn't seem to
check ->mm or ->active_mm (and if it was broken, the scheduler context
switch would be suspect too). I don't think the fix would be hard, just
that I don't have a good way to test it and qemu isn't great for testing
this kind of thing.

um too I think could probably defer that lock until after interrupts are
enabled again. I might throw a bunch of arch conversion patches over the
wall if this gets merged and try to move things along.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
  2020-08-28 10:00 ` [PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm Nicholas Piggin
@ 2020-09-01 12:00   ` Michael Ellerman
  2020-09-02  9:48     ` Nicholas Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Ellerman @ 2020-09-01 12:00 UTC (permalink / raw)
  To: Nicholas Piggin, linux-mm
  Cc: Nicholas Piggin, linux-arch, linux-kernel, linuxppc-dev,
	Aneesh Kumar K.V, Andrew Morton, Jens Axboe, Peter Zijlstra,
	David S. Miller

Nicholas Piggin <npiggin@gmail.com> writes:
> Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
> single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
> a process under certain conditions. One of the assumptions is that
> mm_users would not be incremented via a reference outside the process
> context with mmget_not_zero() then go on to kthread_use_mm() via that
> reference.
>
> That invariant was broken by io_uring code (see previous sparc64 fix),
> but I'll point Fixes: to the original powerpc commit because we are
> changing that assumption going forward, so this will make backports
> match up.
>
> Fix this by no longer relying on that assumption, but by having each CPU
> check the mm is not being used, and clearing their own bit from the mask
> if it's okay. This fix relies on commit 38cf307c1f20 ("mm: fix
> kthread_use_mm() vs TLB invalidate") to disable irqs over the mm switch,
> and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to be enabled.

You could use:

Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")

> Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/include/asm/tlb.h       | 13 -------------
>  arch/powerpc/mm/book3s64/radix_tlb.c | 23 ++++++++++++++++-------
>  2 files changed, 16 insertions(+), 20 deletions(-)

One minor nit below if you're respinning anyway.

You know this stuff better than me, but I still reviewed it and it seems
good to me.

Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>

> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
> index fbc6f3002f23..d97f061fecac 100644
> --- a/arch/powerpc/include/asm/tlb.h
> +++ b/arch/powerpc/include/asm/tlb.h
> @@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
>  		return false;
>  	return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
>  }
> -static inline void mm_reset_thread_local(struct mm_struct *mm)
> -{
> -	WARN_ON(atomic_read(&mm->context.copros) > 0);
> -	/*
> -	 * It's possible for mm_access to take a reference on mm_users to
> -	 * access the remote mm from another thread, but it's not allowed
> -	 * to set mm_cpumask, so mm_users may be > 1 here.
> -	 */
> -	WARN_ON(current->mm != mm);
> -	atomic_set(&mm->context.active_cpus, 1);
> -	cpumask_clear(mm_cpumask(mm));
> -	cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
> -}
>  #else /* CONFIG_PPC_BOOK3S_64 */
>  static inline int mm_is_thread_local(struct mm_struct *mm)
>  {
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> index 0d233763441f..a421a0e3f930 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
>  	struct mm_struct *mm = arg;
>  	unsigned long pid = mm->context.id;
>  
> +	/*
> +	 * A kthread could have done a mmget_not_zero() after the flushing CPU
> +	 * checked mm_users == 1, and be in the process of kthread_use_mm when
                                ^
                                in mm_is_singlethreaded()

Adding that reference would help join the dots for a new reader I think.

cheers

> +	 * interrupted here. In that case, current->mm will be set to mm,
> +	 * because kthread_use_mm() setting ->mm and switching to the mm is
> +	 * done with interrupts off.
> +	 */
>  	if (current->mm == mm)
> -		return; /* Local CPU */
> +		goto out_flush;
>  
>  	if (current->active_mm == mm) {
> -		/*
> -		 * Must be a kernel thread because sender is single-threaded.
> -		 */
> -		BUG_ON(current->mm);
> +		WARN_ON_ONCE(current->mm != NULL);
> +		/* Is a kernel thread and is using mm as the lazy tlb */
>  		mmgrab(&init_mm);
> -		switch_mm(mm, &init_mm, current);
>  		current->active_mm = &init_mm;
> +		switch_mm_irqs_off(mm, &init_mm, current);
>  		mmdrop(mm);
>  	}
> +
> +	atomic_dec(&mm->context.active_cpus);
> +	cpumask_clear_cpu(smp_processor_id(), mm_cpumask(mm));
> +
> +out_flush:
>  	_tlbiel_pid(pid, RIC_FLUSH_ALL);
>  }
>  
> @@ -672,7 +682,6 @@ static void exit_flush_lazy_tlbs(struct mm_struct *mm)
>  	 */
>  	smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
>  				(void *)mm, 1);
> -	mm_reset_thread_local(mm);
>  }
>  
>  void radix__flush_tlb_mm(struct mm_struct *mm)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
  2020-09-01 12:00   ` Michael Ellerman
@ 2020-09-02  9:48     ` Nicholas Piggin
  0 siblings, 0 replies; 9+ messages in thread
From: Nicholas Piggin @ 2020-09-02  9:48 UTC (permalink / raw)
  To: linux-mm, Michael Ellerman
  Cc: Andrew Morton, Aneesh Kumar K.V, Jens Axboe, David S. Miller,
	linux-arch, linux-kernel, linuxppc-dev, Peter Zijlstra

Excerpts from Michael Ellerman's message of September 1, 2020 10:00 pm:
> Nicholas Piggin <npiggin@gmail.com> writes:
>> Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
>> single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
>> a process under certain conditions. One of the assumptions is that
>> mm_users would not be incremented via a reference outside the process
>> context with mmget_not_zero() then go on to kthread_use_mm() via that
>> reference.
>>
>> That invariant was broken by io_uring code (see previous sparc64 fix),
>> but I'll point Fixes: to the original powerpc commit because we are
>> changing that assumption going forward, so this will make backports
>> match up.
>>
>> Fix this by no longer relying on that assumption, but by having each CPU
>> check the mm is not being used, and clearing their own bit from the mask
>> if it's okay. This fix relies on commit 38cf307c1f20 ("mm: fix
>> kthread_use_mm() vs TLB invalidate") to disable irqs over the mm switch,
>> and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to be enabled.
> 
> You could use:
> 
> Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")

Good idea I wil.

>> Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>  arch/powerpc/include/asm/tlb.h       | 13 -------------
>>  arch/powerpc/mm/book3s64/radix_tlb.c | 23 ++++++++++++++++-------
>>  2 files changed, 16 insertions(+), 20 deletions(-)
> 
> One minor nit below if you're respinning anyway.
> 
> You know this stuff better than me, but I still reviewed it and it seems
> good to me.
> 
> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>

Thanks.

> 
>> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
>> index fbc6f3002f23..d97f061fecac 100644
>> --- a/arch/powerpc/include/asm/tlb.h
>> +++ b/arch/powerpc/include/asm/tlb.h
>> @@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
>>  		return false;
>>  	return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
>>  }
>> -static inline void mm_reset_thread_local(struct mm_struct *mm)
>> -{
>> -	WARN_ON(atomic_read(&mm->context.copros) > 0);
>> -	/*
>> -	 * It's possible for mm_access to take a reference on mm_users to
>> -	 * access the remote mm from another thread, but it's not allowed
>> -	 * to set mm_cpumask, so mm_users may be > 1 here.
>> -	 */
>> -	WARN_ON(current->mm != mm);
>> -	atomic_set(&mm->context.active_cpus, 1);
>> -	cpumask_clear(mm_cpumask(mm));
>> -	cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
>> -}
>>  #else /* CONFIG_PPC_BOOK3S_64 */
>>  static inline int mm_is_thread_local(struct mm_struct *mm)
>>  {
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index 0d233763441f..a421a0e3f930 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
>>  	struct mm_struct *mm = arg;
>>  	unsigned long pid = mm->context.id;
>>  
>> +	/*
>> +	 * A kthread could have done a mmget_not_zero() after the flushing CPU
>> +	 * checked mm_users == 1, and be in the process of kthread_use_mm when
>                                 ^
>                                 in mm_is_singlethreaded()
> 
> Adding that reference would help join the dots for a new reader I think.

Yes you're right I can change that.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-09-02  9:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-28 10:00 [PATCH 0/4] more mm switching vs TLB shootdown and lazy tlb Nicholas Piggin
2020-08-28 10:00 ` [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race Nicholas Piggin
2020-08-28 11:15   ` peterz
2020-08-31  1:25     ` Nicholas Piggin
2020-08-28 10:00 ` [PATCH 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM Nicholas Piggin
2020-08-28 10:00 ` [PATCH 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race Nicholas Piggin
2020-08-28 10:00 ` [PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm Nicholas Piggin
2020-09-01 12:00   ` Michael Ellerman
2020-09-02  9:48     ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).