All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Membarrier updates
@ 2020-10-20 13:47 Mathieu Desnoyers
  2020-10-20 13:47 ` [PATCH 1/3] sched: fix exit_mm vs membarrier (v4) Mathieu Desnoyers
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-10-20 13:47 UTC (permalink / raw)
  To: Peter Zijlstra, Boqun Feng; +Cc: linux-kernel, Mathieu Desnoyers

Please find the following membarrier updates series posted for inclusion
upstream.

Thanks,

Mathieu

Mathieu Desnoyers (3):
  sched: fix exit_mm vs membarrier (v4)
  sched: membarrier: cover kthread_use_mm (v4)
  sched: membarrier: document memory ordering scenarios

 include/linux/sched/mm.h  |   5 ++
 kernel/exit.c             |  16 ++++-
 kernel/kthread.c          |  21 ++++++
 kernel/sched/idle.c       |   1 +
 kernel/sched/membarrier.c | 147 ++++++++++++++++++++++++++++++++++++--
 5 files changed, 185 insertions(+), 5 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] sched: fix exit_mm vs membarrier (v4)
  2020-10-20 13:47 [PATCH 0/3] Membarrier updates Mathieu Desnoyers
@ 2020-10-20 13:47 ` Mathieu Desnoyers
  2020-10-20 14:36   ` Peter Zijlstra
  2020-10-29 10:51   ` [tip: sched/core] " tip-bot2 for Mathieu Desnoyers
  2020-10-20 13:47 ` [PATCH 2/3] sched: membarrier: cover kthread_use_mm (v4) Mathieu Desnoyers
  2020-10-20 13:47 ` [PATCH 3/3] sched: membarrier: document memory ordering scenarios Mathieu Desnoyers
  2 siblings, 2 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-10-20 13:47 UTC (permalink / raw)
  To: Peter Zijlstra, Boqun Feng
  Cc: linux-kernel, Mathieu Desnoyers, Will Deacon, Paul E . McKenney,
	Nicholas Piggin, Andy Lutomirski, Thomas Gleixner,
	Linus Torvalds, Alan Stern, linux-mm

exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.

exit_mm should also update the runqueue's membarrier_state so
membarrier global expedited IPIs are not sent when they are not
needed.

The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.

Here is the scenario I have in mind:

Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).

The AFAIU we can have:

Userspace variables:

int x = 0, y = 0;

CPU 0                   CPU 1
Thread A                Thread B
(in thread group A)     (in thread group B)

x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
                        r1 = load y
                        membarrier()
                          skips CPU 0 (no IPI) because its current mm is NULL
                        r2 = load x
                        BUG_ON(r1 == 1 && r2 == 0)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: linux-mm@kvack.org
---
Changes since v1:
- Use smp_mb__after_spinlock rather than smp_mb.
- Document race scenario in commit message.

Changes since v2:
- Introduce membarrier_update_current_mm,
- Use membarrier_update_current_mm to update rq's membarrier_state from
  exit_mm.

Changes since v3:
- Disable interrupts around call to membarrier_update_current_mm, which
  is required to access the runqueue's fields.
---
 include/linux/sched/mm.h  |  5 +++++
 kernel/exit.c             | 16 +++++++++++++++-
 kernel/sched/membarrier.c | 12 ++++++++++++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index f889e332912f..5dd7f56baaba 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -370,6 +370,8 @@ static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 
 extern void membarrier_exec_mmap(struct mm_struct *mm);
 
+extern void membarrier_update_current_mm(struct mm_struct *next_mm);
+
 #else
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
 static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
@@ -384,6 +386,9 @@ static inline void membarrier_exec_mmap(struct mm_struct *mm)
 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 {
 }
+static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
+{
+}
 #endif
 
 #endif /* _LINUX_SCHED_MM_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index 733e80f334e7..18ca74c07085 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -475,10 +475,24 @@ static void exit_mm(void)
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
+	/*
+	 * When a thread stops operating on an address space, the loop
+	 * in membarrier_private_expedited() may not observe that
+	 * tsk->mm, and the loop in membarrier_global_expedited() may
+	 * not observe a MEMBARRIER_STATE_GLOBAL_EXPEDITED
+	 * rq->membarrier_state, so those would not issue an IPI.
+	 * Membarrier requires a memory barrier after accessing
+	 * user-space memory, before clearing tsk->mm or the
+	 * rq->membarrier_state.
+	 */
+	smp_mb__after_spinlock();
+	local_irq_disable();
 	current->mm = NULL;
-	mmap_read_unlock(mm);
+	membarrier_update_current_mm(NULL);
 	enter_lazy_tlb(mm, current);
+	local_irq_enable();
 	task_unlock(current);
+	mmap_read_unlock(mm);
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 168479a7d61b..8bc8b8a888b7 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -63,6 +63,18 @@ void membarrier_exec_mmap(struct mm_struct *mm)
 	this_cpu_write(runqueues.membarrier_state, 0);
 }
 
+void membarrier_update_current_mm(struct mm_struct *next_mm)
+{
+	struct rq *rq = this_rq();
+	int membarrier_state = 0;
+
+	if (next_mm)
+		membarrier_state = atomic_read(&next_mm->membarrier_state);
+	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
+		return;
+	WRITE_ONCE(rq->membarrier_state, membarrier_state);
+}
+
 static int membarrier_global_expedited(void)
 {
 	int cpu;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] sched: membarrier: cover kthread_use_mm (v4)
  2020-10-20 13:47 [PATCH 0/3] Membarrier updates Mathieu Desnoyers
  2020-10-20 13:47 ` [PATCH 1/3] sched: fix exit_mm vs membarrier (v4) Mathieu Desnoyers
@ 2020-10-20 13:47 ` Mathieu Desnoyers
  2020-10-29 10:51   ` [tip: sched/core] " tip-bot2 for Mathieu Desnoyers
  2020-10-20 13:47 ` [PATCH 3/3] sched: membarrier: document memory ordering scenarios Mathieu Desnoyers
  2 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-10-20 13:47 UTC (permalink / raw)
  To: Peter Zijlstra, Boqun Feng
  Cc: linux-kernel, Mathieu Desnoyers, Will Deacon, Paul E . McKenney,
	Nicholas Piggin, Andy Lutomirski, Andrew Morton

Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
to allow the effect of membarrier(2) to apply to kthreads accessing
user-space memory as well.

Given that no prior kthread use this guarantee and that it only affects
kthreads, adding this guarantee does not affect user-space ABI.

Refine the check in membarrier_global_expedited to exclude runqueues
running the idle thread rather than all kthreads from the IPI cpumask.

Now that membarrier_global_expedited can IPI kthreads, the scheduler
also needs to update the runqueue's membarrier_state when entering lazy
TLB state.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
Changes since v1:
- Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
- Use smp_mb__after_spinlock rather than smp_mb after task_lock.

Changes since v2:
- Update the rq's membarrier state on kthread use/unuse mm,
- The scheduler must use membarrier_switch_mm for lazy TLB case as well,
  now that global expedited membarrier IPIs kthreads.

Changes since v3:
- Revert back to not using membarrier_switch_mm for lazy TLB case. This
  is made OK by ensuring that the global expedited membarrier only skips
  kthreads which have a NULL mm, so it only skips kthreads which are in
  lazy TLB mode.
---
 kernel/kthread.c          | 21 +++++++++++++++++++++
 kernel/sched/idle.c       |  1 +
 kernel/sched/membarrier.c |  7 +++----
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3edaa380dc7b..a396734d31f3 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1248,6 +1248,7 @@ void kthread_use_mm(struct mm_struct *mm)
 		tsk->active_mm = mm;
 	}
 	tsk->mm = mm;
+	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
 	local_irq_enable();
 	task_unlock(tsk);
@@ -1255,8 +1256,19 @@ void kthread_use_mm(struct mm_struct *mm)
 	finish_arch_post_lock_switch();
 #endif
 
+	/*
+	 * When a kthread starts operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after storing to tsk->mm, before accessing
+	 * user-space memory. A full memory barrier for membarrier
+	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
+	 * mmdrop(), or explicitly with smp_mb().
+	 */
 	if (active_mm != mm)
 		mmdrop(active_mm);
+	else
+		smp_mb();
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
@@ -1276,9 +1288,18 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	force_uaccess_end(to_kthread(tsk)->oldfs);
 
 	task_lock(tsk);
+	/*
+	 * When a kthread stops operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after accessing user-space memory, before
+	 * clearing tsk->mm.
+	 */
+	smp_mb__after_spinlock();
 	sync_mm_rss(mm);
 	local_irq_disable();
 	tsk->mm = NULL;
+	membarrier_update_current_mm(NULL);
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	local_irq_enable();
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f324dc36fc43..f0d81a5ea471 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -338,6 +338,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
 	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
 	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
 	WARN_ON_ONCE(!duration_ns);
+	WARN_ON_ONCE(current->mm);
 
 	rcu_sleep_check();
 	preempt_disable();
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 8bc8b8a888b7..8b93b6844901 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -113,12 +113,11 @@ static int membarrier_global_expedited(void)
 			continue;
 
 		/*
-		 * Skip the CPU if it runs a kernel thread. The scheduler
-		 * leaves the prior task mm in place as an optimization when
-		 * scheduling a kthread.
+		 * Skip the CPU if it runs a kernel thread which is not using
+		 * a task mm.
 		 */
 		p = rcu_dereference(cpu_rq(cpu)->curr);
-		if (p->flags & PF_KTHREAD)
+		if (!p->mm)
 			continue;
 
 		__cpumask_set_cpu(cpu, tmpmask);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] sched: membarrier: document memory ordering scenarios
  2020-10-20 13:47 [PATCH 0/3] Membarrier updates Mathieu Desnoyers
  2020-10-20 13:47 ` [PATCH 1/3] sched: fix exit_mm vs membarrier (v4) Mathieu Desnoyers
  2020-10-20 13:47 ` [PATCH 2/3] sched: membarrier: cover kthread_use_mm (v4) Mathieu Desnoyers
@ 2020-10-20 13:47 ` Mathieu Desnoyers
  2020-10-29 10:51   ` [tip: sched/core] " tip-bot2 for Mathieu Desnoyers
  2 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-10-20 13:47 UTC (permalink / raw)
  To: Peter Zijlstra, Boqun Feng
  Cc: linux-kernel, Mathieu Desnoyers, Will Deacon, Paul E . McKenney,
	Nicholas Piggin, Andy Lutomirski, Andrew Morton, Alan Stern

Document membarrier ordering scenarios in membarrier.c. Thanks to Alan
Stern for refreshing my memory. Now that I have those in mind, it seems
appropriate to serialize them to comments for posterity.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
---
 kernel/sched/membarrier.c | 128 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 8b93b6844901..943bdf5e9108 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -6,6 +6,134 @@
  */
 #include "sched.h"
 
+/*
+ * For documentation purposes, here are some membarrier ordering
+ * scenarios to keep in mind:
+ *
+ * A) Userspace thread execution after IPI vs membarrier's memory
+ *    barrier before sending the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the start of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU0 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU1 after the IPI-induced memory barrier:
+ *
+ *         CPU0                              CPU1
+ *
+ *         x = 1
+ *         membarrier():
+ *           a: smp_mb()
+ *           b: send IPI                       IPI-induced mb
+ *           c: smp_mb()
+ *         r2 = y
+ *                                           y = 1
+ *                                           barrier()
+ *                                           r1 = x
+ *
+ *                     BUG_ON(r1 == 0 && r2 == 0)
+ *
+ * The write to y and load from x by CPU1 are unordered by the hardware,
+ * so it's possible to have "r1 = x" reordered before "y = 1" at any
+ * point after (b).  If the memory barrier at (a) is omitted, then "x = 1"
+ * can be reordered after (a) (although not after (c)), so we get r1 == 0
+ * and r2 == 0.  This violates the guarantee that membarrier() is
+ * supposed by provide.
+ *
+ * The timing of the memory barrier at (a) has to ensure that it executes
+ * before the IPI-induced memory barrier on CPU1.
+ *
+ * B) Userspace thread execution before IPI vs membarrier's memory
+ *    barrier after completing the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the end of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU1 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU0 after the membarrier():
+ *
+ *         CPU0                              CPU1
+ *
+ *                                           x = 1
+ *                                           barrier()
+ *                                           y = 1
+ *         r2 = y
+ *         membarrier():
+ *           a: smp_mb()
+ *           b: send IPI                       IPI-induced mb
+ *           c: smp_mb()
+ *         r1 = x
+ *         BUG_ON(r1 == 0 && r2 == 1)
+ *
+ * The writes to x and y are unordered by the hardware, so it's possible to
+ * have "r2 = 1" even though the write to x doesn't execute until (b).  If
+ * the memory barrier at (c) is omitted then "r1 = x" can be reordered
+ * before (b) (although not before (a)), so we get "r1 = 0".  This violates
+ * the guarantee that membarrier() is supposed to provide.
+ *
+ * The timing of the memory barrier at (c) has to ensure that it executes
+ * after the IPI-induced memory barrier on CPU1.
+ *
+ * C) Scheduling userspace thread -> kthread -> userspace thread vs membarrier
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *           a: smp_mb()
+ *                                           d: switch to kthread (includes mb)
+ *           b: read rq->curr->mm == NULL
+ *                                           e: switch to user (includes mb)
+ *           c: smp_mb()
+ *
+ * Using the scenario from (A), we can show that (a) needs to be paired
+ * with (e). Using the scenario from (B), we can show that (c) needs to
+ * be paired with (d).
+ *
+ * D) exit_mm vs membarrier
+ *
+ * Two thread groups are created, A and B.  Thread group B is created by
+ * issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
+ * Let's assume we have a single thread within each thread group (Thread A
+ * and Thread B).  Thread A runs on CPU0, Thread B runs on CPU1.
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *             a: smp_mb()
+ *                                           exit_mm():
+ *                                             d: smp_mb()
+ *                                             e: current->mm = NULL
+ *             b: read rq->curr->mm == NULL
+ *             c: smp_mb()
+ *
+ * Using scenario (B), we can show that (c) needs to be paired with (d).
+ *
+ * E) kthread_{use,unuse}_mm vs membarrier
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *           a: smp_mb()
+ *                                           kthread_unuse_mm()
+ *                                             d: smp_mb()
+ *                                             e: current->mm = NULL
+ *           b: read rq->curr->mm == NULL
+ *                                           kthread_use_mm()
+ *                                             f: current->mm = mm
+ *                                             g: smp_mb()
+ *           c: smp_mb()
+ *
+ * Using the scenario from (A), we can show that (a) needs to be paired
+ * with (g). Using the scenario from (B), we can show that (c) needs to
+ * be paired with (d).
+ */
+
 /*
  * Bitmask made from a "or" of all commands within enum membarrier_cmd,
  * except MEMBARRIER_CMD_QUERY.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] sched: fix exit_mm vs membarrier (v4)
  2020-10-20 13:47 ` [PATCH 1/3] sched: fix exit_mm vs membarrier (v4) Mathieu Desnoyers
@ 2020-10-20 14:36   ` Peter Zijlstra
  2020-10-20 14:59       ` Mathieu Desnoyers
  2020-10-29 10:51   ` [tip: sched/core] " tip-bot2 for Mathieu Desnoyers
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2020-10-20 14:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Boqun Feng, linux-kernel, Will Deacon, Paul E . McKenney,
	Nicholas Piggin, Andy Lutomirski, Thomas Gleixner,
	Linus Torvalds, Alan Stern, linux-mm

On Tue, Oct 20, 2020 at 09:47:13AM -0400, Mathieu Desnoyers wrote:
> +void membarrier_update_current_mm(struct mm_struct *next_mm)
> +{
> +	struct rq *rq = this_rq();
> +	int membarrier_state = 0;
> +
> +	if (next_mm)
> +		membarrier_state = atomic_read(&next_mm->membarrier_state);
> +	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
> +		return;
> +	WRITE_ONCE(rq->membarrier_state, membarrier_state);
> +}

This is suspisioucly similar to membarrier_switch_mm().

Would something like so make sense?

---
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -206,14 +206,7 @@ void membarrier_exec_mmap(struct mm_stru
 
 void membarrier_update_current_mm(struct mm_struct *next_mm)
 {
-	struct rq *rq = this_rq();
-	int membarrier_state = 0;
-
-	if (next_mm)
-		membarrier_state = atomic_read(&next_mm->membarrier_state);
-	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
-		return;
-	WRITE_ONCE(rq->membarrier_state, membarrier_state);
+	membarrier_switch_mm(this_rq(), NULL, next_mm);
 }
 
 static int membarrier_global_expedited(void)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2621155393c..3d589c2ffd28 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2645,12 +2645,14 @@ static inline void membarrier_switch_mm(struct rq *rq,
 					struct mm_struct *prev_mm,
 					struct mm_struct *next_mm)
 {
-	int membarrier_state;
+	int membarrier_state = 0;
 
 	if (prev_mm == next_mm)
 		return;
 
-	membarrier_state = atomic_read(&next_mm->membarrier_state);
+	if (next_mm)
+		membarrier_state = atomic_read(&next_mm->membarrier_state);
+
 	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
 		return;
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] sched: fix exit_mm vs membarrier (v4)
  2020-10-20 14:36   ` Peter Zijlstra
@ 2020-10-20 14:59       ` Mathieu Desnoyers
  0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-10-20 14:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Alan Stern,
	linux-mm

----- On Oct 20, 2020, at 10:36 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Oct 20, 2020 at 09:47:13AM -0400, Mathieu Desnoyers wrote:
>> +void membarrier_update_current_mm(struct mm_struct *next_mm)
>> +{
>> +	struct rq *rq = this_rq();
>> +	int membarrier_state = 0;
>> +
>> +	if (next_mm)
>> +		membarrier_state = atomic_read(&next_mm->membarrier_state);
>> +	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
>> +		return;
>> +	WRITE_ONCE(rq->membarrier_state, membarrier_state);
>> +}
> 
> This is suspisioucly similar to membarrier_switch_mm().
> 
> Would something like so make sense?

Very much yes. Do you want me to re-send the series, or you
want to fold this in as you merge it ?

Thanks,

Mathieu

> 
> ---
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -206,14 +206,7 @@ void membarrier_exec_mmap(struct mm_stru
> 
> void membarrier_update_current_mm(struct mm_struct *next_mm)
> {
> -	struct rq *rq = this_rq();
> -	int membarrier_state = 0;
> -
> -	if (next_mm)
> -		membarrier_state = atomic_read(&next_mm->membarrier_state);
> -	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
> -		return;
> -	WRITE_ONCE(rq->membarrier_state, membarrier_state);
> +	membarrier_switch_mm(this_rq(), NULL, next_mm);
> }
> 
> static int membarrier_global_expedited(void)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d2621155393c..3d589c2ffd28 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2645,12 +2645,14 @@ static inline void membarrier_switch_mm(struct rq *rq,
> 					struct mm_struct *prev_mm,
> 					struct mm_struct *next_mm)
> {
> -	int membarrier_state;
> +	int membarrier_state = 0;
> 
> 	if (prev_mm == next_mm)
> 		return;
> 
> -	membarrier_state = atomic_read(&next_mm->membarrier_state);
> +	if (next_mm)
> +		membarrier_state = atomic_read(&next_mm->membarrier_state);
> +
> 	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
>  		return;

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] sched: fix exit_mm vs membarrier (v4)
@ 2020-10-20 14:59       ` Mathieu Desnoyers
  0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-10-20 14:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Alan Stern,
	linux-mm

----- On Oct 20, 2020, at 10:36 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Oct 20, 2020 at 09:47:13AM -0400, Mathieu Desnoyers wrote:
>> +void membarrier_update_current_mm(struct mm_struct *next_mm)
>> +{
>> +	struct rq *rq = this_rq();
>> +	int membarrier_state = 0;
>> +
>> +	if (next_mm)
>> +		membarrier_state = atomic_read(&next_mm->membarrier_state);
>> +	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
>> +		return;
>> +	WRITE_ONCE(rq->membarrier_state, membarrier_state);
>> +}
> 
> This is suspisioucly similar to membarrier_switch_mm().
> 
> Would something like so make sense?

Very much yes. Do you want me to re-send the series, or you
want to fold this in as you merge it ?

Thanks,

Mathieu

> 
> ---
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -206,14 +206,7 @@ void membarrier_exec_mmap(struct mm_stru
> 
> void membarrier_update_current_mm(struct mm_struct *next_mm)
> {
> -	struct rq *rq = this_rq();
> -	int membarrier_state = 0;
> -
> -	if (next_mm)
> -		membarrier_state = atomic_read(&next_mm->membarrier_state);
> -	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
> -		return;
> -	WRITE_ONCE(rq->membarrier_state, membarrier_state);
> +	membarrier_switch_mm(this_rq(), NULL, next_mm);
> }
> 
> static int membarrier_global_expedited(void)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d2621155393c..3d589c2ffd28 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2645,12 +2645,14 @@ static inline void membarrier_switch_mm(struct rq *rq,
> 					struct mm_struct *prev_mm,
> 					struct mm_struct *next_mm)
> {
> -	int membarrier_state;
> +	int membarrier_state = 0;
> 
> 	if (prev_mm == next_mm)
> 		return;
> 
> -	membarrier_state = atomic_read(&next_mm->membarrier_state);
> +	if (next_mm)
> +		membarrier_state = atomic_read(&next_mm->membarrier_state);
> +
> 	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
>  		return;

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] sched: fix exit_mm vs membarrier (v4)
  2020-10-20 14:59       ` Mathieu Desnoyers
  (?)
@ 2020-10-22  6:51       ` Boqun Feng
  -1 siblings, 0 replies; 11+ messages in thread
From: Boqun Feng @ 2020-10-22  6:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
	Nicholas Piggin, Andy Lutomirski, Thomas Gleixner,
	Linus Torvalds, Alan Stern, linux-mm

Hi,

On Tue, Oct 20, 2020 at 10:59:58AM -0400, Mathieu Desnoyers wrote:
> ----- On Oct 20, 2020, at 10:36 AM, Peter Zijlstra peterz@infradead.org wrote:
> 
> > On Tue, Oct 20, 2020 at 09:47:13AM -0400, Mathieu Desnoyers wrote:
> >> +void membarrier_update_current_mm(struct mm_struct *next_mm)
> >> +{
> >> +	struct rq *rq = this_rq();
> >> +	int membarrier_state = 0;
> >> +
> >> +	if (next_mm)
> >> +		membarrier_state = atomic_read(&next_mm->membarrier_state);
> >> +	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
> >> +		return;
> >> +	WRITE_ONCE(rq->membarrier_state, membarrier_state);
> >> +}
> > 
> > This is suspisioucly similar to membarrier_switch_mm().
> > 
> > Would something like so make sense?
> 
> Very much yes. Do you want me to re-send the series, or you
> want to fold this in as you merge it ?
> 
> Thanks,
> 
> Mathieu
> 
> > 
> > ---
> > --- a/kernel/sched/membarrier.c
> > +++ b/kernel/sched/membarrier.c
> > @@ -206,14 +206,7 @@ void membarrier_exec_mmap(struct mm_stru
> > 
> > void membarrier_update_current_mm(struct mm_struct *next_mm)
> > {
> > -	struct rq *rq = this_rq();
> > -	int membarrier_state = 0;
> > -
> > -	if (next_mm)
> > -		membarrier_state = atomic_read(&next_mm->membarrier_state);
> > -	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
> > -		return;
> > -	WRITE_ONCE(rq->membarrier_state, membarrier_state);
> > +	membarrier_switch_mm(this_rq(), NULL, next_mm);
> > }
> > 
> > static int membarrier_global_expedited(void)
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index d2621155393c..3d589c2ffd28 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2645,12 +2645,14 @@ static inline void membarrier_switch_mm(struct rq *rq,
> > 					struct mm_struct *prev_mm,
> > 					struct mm_struct *next_mm)
> > {
> > -	int membarrier_state;
> > +	int membarrier_state = 0;
> > 
> > 	if (prev_mm == next_mm)

Unless I'm missing something subtle, in exit_mm(),
membarrier_update_current_mm() is called with @next_mm == NULL, and
inside membarrier_update_current_mm(), membarrier_switch_mm() is called
wiht @prev_mm == NULL. As a result, the branch above is taken, so
membarrier_update_current_mm() becomes a nop. I think we should use the
previous value of current->mm as the @prev_mm, something like below
maybe?

void update_current_mm(struct mm_struct *next_mm)
{
	struct mm_struct *prev_mm;
	unsigned long flags;

	local_irq_save(flags);
	prev_mm = current->mm;
	current->mm = next_mm;
	membarrier_switch_mm(this_rq(), prev_mm, next_mm);
	local_irq_restore(flags);
}

, and replace all settings for "current->mm" in kernel with
update_current_mm().

Thoughts?

Regards,
Boqun

> > 		return;
> > 
> > -	membarrier_state = atomic_read(&next_mm->membarrier_state);
> > +	if (next_mm)
> > +		membarrier_state = atomic_read(&next_mm->membarrier_state);
> > +
> > 	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
> >  		return;
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [tip: sched/core] sched: membarrier: cover kthread_use_mm (v4)
  2020-10-20 13:47 ` [PATCH 2/3] sched: membarrier: cover kthread_use_mm (v4) Mathieu Desnoyers
@ 2020-10-29 10:51   ` tip-bot2 for Mathieu Desnoyers
  0 siblings, 0 replies; 11+ messages in thread
From: tip-bot2 for Mathieu Desnoyers @ 2020-10-29 10:51 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Mathieu Desnoyers, Peter Zijlstra (Intel), x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     618758ed3a4f7d790414d020b362111748ebbf9f
Gitweb:        https://git.kernel.org/tip/618758ed3a4f7d790414d020b362111748ebbf9f
Author:        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
AuthorDate:    Tue, 20 Oct 2020 09:47:14 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 29 Oct 2020 11:00:31 +01:00

sched: membarrier: cover kthread_use_mm (v4)

Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
to allow the effect of membarrier(2) to apply to kthreads accessing
user-space memory as well.

Given that no prior kthread use this guarantee and that it only affects
kthreads, adding this guarantee does not affect user-space ABI.

Refine the check in membarrier_global_expedited to exclude runqueues
running the idle thread rather than all kthreads from the IPI cpumask.

Now that membarrier_global_expedited can IPI kthreads, the scheduler
also needs to update the runqueue's membarrier_state when entering lazy
TLB state.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoyers@efficios.com
---
 kernel/kthread.c          | 21 +++++++++++++++++++++
 kernel/sched/idle.c       |  1 +
 kernel/sched/membarrier.c |  7 +++----
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index e29773c..481428f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1248,6 +1248,7 @@ void kthread_use_mm(struct mm_struct *mm)
 		tsk->active_mm = mm;
 	}
 	tsk->mm = mm;
+	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
 	local_irq_enable();
 	task_unlock(tsk);
@@ -1255,8 +1256,19 @@ void kthread_use_mm(struct mm_struct *mm)
 	finish_arch_post_lock_switch();
 #endif
 
+	/*
+	 * When a kthread starts operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after storing to tsk->mm, before accessing
+	 * user-space memory. A full memory barrier for membarrier
+	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
+	 * mmdrop(), or explicitly with smp_mb().
+	 */
 	if (active_mm != mm)
 		mmdrop(active_mm);
+	else
+		smp_mb();
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
@@ -1276,9 +1288,18 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	force_uaccess_end(to_kthread(tsk)->oldfs);
 
 	task_lock(tsk);
+	/*
+	 * When a kthread stops operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after accessing user-space memory, before
+	 * clearing tsk->mm.
+	 */
+	smp_mb__after_spinlock();
 	sync_mm_rss(mm);
 	local_irq_disable();
 	tsk->mm = NULL;
+	membarrier_update_current_mm(NULL);
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	local_irq_enable();
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 24d0ee2..846743e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -338,6 +338,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
 	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
 	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
 	WARN_ON_ONCE(!duration_ns);
+	WARN_ON_ONCE(current->mm);
 
 	rcu_sleep_check();
 	preempt_disable();
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index aac3292..f223f35 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -126,12 +126,11 @@ static int membarrier_global_expedited(void)
 			continue;
 
 		/*
-		 * Skip the CPU if it runs a kernel thread. The scheduler
-		 * leaves the prior task mm in place as an optimization when
-		 * scheduling a kthread.
+		 * Skip the CPU if it runs a kernel thread which is not using
+		 * a task mm.
 		 */
 		p = rcu_dereference(cpu_rq(cpu)->curr);
-		if (p->flags & PF_KTHREAD)
+		if (!p->mm)
 			continue;
 
 		__cpumask_set_cpu(cpu, tmpmask);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [tip: sched/core] sched: membarrier: document memory ordering scenarios
  2020-10-20 13:47 ` [PATCH 3/3] sched: membarrier: document memory ordering scenarios Mathieu Desnoyers
@ 2020-10-29 10:51   ` tip-bot2 for Mathieu Desnoyers
  0 siblings, 0 replies; 11+ messages in thread
From: tip-bot2 for Mathieu Desnoyers @ 2020-10-29 10:51 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Mathieu Desnoyers, Peter Zijlstra (Intel), x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     25595eb6aaa9fbb31330f1e0b400642694bc6574
Gitweb:        https://git.kernel.org/tip/25595eb6aaa9fbb31330f1e0b400642694bc6574
Author:        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
AuthorDate:    Tue, 20 Oct 2020 09:47:15 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 29 Oct 2020 11:00:31 +01:00

sched: membarrier: document memory ordering scenarios

Document membarrier ordering scenarios in membarrier.c. Thanks to Alan
Stern for refreshing my memory. Now that I have those in mind, it seems
appropriate to serialize them to comments for posterity.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201020134715.13909-4-mathieu.desnoyers@efficios.com
---
 kernel/sched/membarrier.c | 128 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 128 insertions(+)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index f223f35..5a40b38 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -7,6 +7,134 @@
 #include "sched.h"
 
 /*
+ * For documentation purposes, here are some membarrier ordering
+ * scenarios to keep in mind:
+ *
+ * A) Userspace thread execution after IPI vs membarrier's memory
+ *    barrier before sending the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the start of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU0 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU1 after the IPI-induced memory barrier:
+ *
+ *         CPU0                              CPU1
+ *
+ *         x = 1
+ *         membarrier():
+ *           a: smp_mb()
+ *           b: send IPI                       IPI-induced mb
+ *           c: smp_mb()
+ *         r2 = y
+ *                                           y = 1
+ *                                           barrier()
+ *                                           r1 = x
+ *
+ *                     BUG_ON(r1 == 0 && r2 == 0)
+ *
+ * The write to y and load from x by CPU1 are unordered by the hardware,
+ * so it's possible to have "r1 = x" reordered before "y = 1" at any
+ * point after (b).  If the memory barrier at (a) is omitted, then "x = 1"
+ * can be reordered after (a) (although not after (c)), so we get r1 == 0
+ * and r2 == 0.  This violates the guarantee that membarrier() is
+ * supposed by provide.
+ *
+ * The timing of the memory barrier at (a) has to ensure that it executes
+ * before the IPI-induced memory barrier on CPU1.
+ *
+ * B) Userspace thread execution before IPI vs membarrier's memory
+ *    barrier after completing the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the end of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU1 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU0 after the membarrier():
+ *
+ *         CPU0                              CPU1
+ *
+ *                                           x = 1
+ *                                           barrier()
+ *                                           y = 1
+ *         r2 = y
+ *         membarrier():
+ *           a: smp_mb()
+ *           b: send IPI                       IPI-induced mb
+ *           c: smp_mb()
+ *         r1 = x
+ *         BUG_ON(r1 == 0 && r2 == 1)
+ *
+ * The writes to x and y are unordered by the hardware, so it's possible to
+ * have "r2 = 1" even though the write to x doesn't execute until (b).  If
+ * the memory barrier at (c) is omitted then "r1 = x" can be reordered
+ * before (b) (although not before (a)), so we get "r1 = 0".  This violates
+ * the guarantee that membarrier() is supposed to provide.
+ *
+ * The timing of the memory barrier at (c) has to ensure that it executes
+ * after the IPI-induced memory barrier on CPU1.
+ *
+ * C) Scheduling userspace thread -> kthread -> userspace thread vs membarrier
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *           a: smp_mb()
+ *                                           d: switch to kthread (includes mb)
+ *           b: read rq->curr->mm == NULL
+ *                                           e: switch to user (includes mb)
+ *           c: smp_mb()
+ *
+ * Using the scenario from (A), we can show that (a) needs to be paired
+ * with (e). Using the scenario from (B), we can show that (c) needs to
+ * be paired with (d).
+ *
+ * D) exit_mm vs membarrier
+ *
+ * Two thread groups are created, A and B.  Thread group B is created by
+ * issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
+ * Let's assume we have a single thread within each thread group (Thread A
+ * and Thread B).  Thread A runs on CPU0, Thread B runs on CPU1.
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *             a: smp_mb()
+ *                                           exit_mm():
+ *                                             d: smp_mb()
+ *                                             e: current->mm = NULL
+ *             b: read rq->curr->mm == NULL
+ *             c: smp_mb()
+ *
+ * Using scenario (B), we can show that (c) needs to be paired with (d).
+ *
+ * E) kthread_{use,unuse}_mm vs membarrier
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *           a: smp_mb()
+ *                                           kthread_unuse_mm()
+ *                                             d: smp_mb()
+ *                                             e: current->mm = NULL
+ *           b: read rq->curr->mm == NULL
+ *                                           kthread_use_mm()
+ *                                             f: current->mm = mm
+ *                                             g: smp_mb()
+ *           c: smp_mb()
+ *
+ * Using the scenario from (A), we can show that (a) needs to be paired
+ * with (g). Using the scenario from (B), we can show that (c) needs to
+ * be paired with (d).
+ */
+
+/*
  * Bitmask made from a "or" of all commands within enum membarrier_cmd,
  * except MEMBARRIER_CMD_QUERY.
  */

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [tip: sched/core] sched: fix exit_mm vs membarrier (v4)
  2020-10-20 13:47 ` [PATCH 1/3] sched: fix exit_mm vs membarrier (v4) Mathieu Desnoyers
  2020-10-20 14:36   ` Peter Zijlstra
@ 2020-10-29 10:51   ` tip-bot2 for Mathieu Desnoyers
  1 sibling, 0 replies; 11+ messages in thread
From: tip-bot2 for Mathieu Desnoyers @ 2020-10-29 10:51 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Mathieu Desnoyers, Peter Zijlstra (Intel), x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5bc78502322a5e4eef3f1b2a2813751dc6434143
Gitweb:        https://git.kernel.org/tip/5bc78502322a5e4eef3f1b2a2813751dc6434143
Author:        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
AuthorDate:    Tue, 20 Oct 2020 09:47:13 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 29 Oct 2020 11:00:30 +01:00

sched: fix exit_mm vs membarrier (v4)

exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.

exit_mm should also update the runqueue's membarrier_state so
membarrier global expedited IPIs are not sent when they are not
needed.

The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.

Here is the scenario I have in mind:

Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).

The AFAIU we can have:

Userspace variables:

int x = 0, y = 0;

CPU 0                   CPU 1
Thread A                Thread B
(in thread group A)     (in thread group B)

x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
                        r1 = load y
                        membarrier()
                          skips CPU 0 (no IPI) because its current mm is NULL
                        r2 = load x
                        BUG_ON(r1 == 1 && r2 == 0)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoyers@efficios.com
---
 include/linux/sched/mm.h  |  5 +++++
 kernel/exit.c             | 16 +++++++++++++++-
 kernel/sched/membarrier.c | 12 ++++++++++++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index d5ece7a..a91fb3a 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -347,6 +347,8 @@ static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 
 extern void membarrier_exec_mmap(struct mm_struct *mm);
 
+extern void membarrier_update_current_mm(struct mm_struct *next_mm);
+
 #else
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
 static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
@@ -361,6 +363,9 @@ static inline void membarrier_exec_mmap(struct mm_struct *mm)
 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 {
 }
+static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
+{
+}
 #endif
 
 #endif /* _LINUX_SCHED_MM_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index 87a2d51..a3dd6b3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -475,10 +475,24 @@ static void exit_mm(void)
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
+	/*
+	 * When a thread stops operating on an address space, the loop
+	 * in membarrier_private_expedited() may not observe that
+	 * tsk->mm, and the loop in membarrier_global_expedited() may
+	 * not observe a MEMBARRIER_STATE_GLOBAL_EXPEDITED
+	 * rq->membarrier_state, so those would not issue an IPI.
+	 * Membarrier requires a memory barrier after accessing
+	 * user-space memory, before clearing tsk->mm or the
+	 * rq->membarrier_state.
+	 */
+	smp_mb__after_spinlock();
+	local_irq_disable();
 	current->mm = NULL;
-	mmap_read_unlock(mm);
+	membarrier_update_current_mm(NULL);
 	enter_lazy_tlb(mm, current);
+	local_irq_enable();
 	task_unlock(current);
+	mmap_read_unlock(mm);
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index e23e74d..aac3292 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -76,6 +76,18 @@ void membarrier_exec_mmap(struct mm_struct *mm)
 	this_cpu_write(runqueues.membarrier_state, 0);
 }
 
+void membarrier_update_current_mm(struct mm_struct *next_mm)
+{
+	struct rq *rq = this_rq();
+	int membarrier_state = 0;
+
+	if (next_mm)
+		membarrier_state = atomic_read(&next_mm->membarrier_state);
+	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
+		return;
+	WRITE_ONCE(rq->membarrier_state, membarrier_state);
+}
+
 static int membarrier_global_expedited(void)
 {
 	int cpu;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-10-29 10:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-20 13:47 [PATCH 0/3] Membarrier updates Mathieu Desnoyers
2020-10-20 13:47 ` [PATCH 1/3] sched: fix exit_mm vs membarrier (v4) Mathieu Desnoyers
2020-10-20 14:36   ` Peter Zijlstra
2020-10-20 14:59     ` Mathieu Desnoyers
2020-10-20 14:59       ` Mathieu Desnoyers
2020-10-22  6:51       ` Boqun Feng
2020-10-29 10:51   ` [tip: sched/core] " tip-bot2 for Mathieu Desnoyers
2020-10-20 13:47 ` [PATCH 2/3] sched: membarrier: cover kthread_use_mm (v4) Mathieu Desnoyers
2020-10-29 10:51   ` [tip: sched/core] " tip-bot2 for Mathieu Desnoyers
2020-10-20 13:47 ` [PATCH 3/3] sched: membarrier: document memory ordering scenarios Mathieu Desnoyers
2020-10-29 10:51   ` [tip: sched/core] " tip-bot2 for Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.