linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] sched: membarrier updates
@ 2020-08-14 16:43 Mathieu Desnoyers
  2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
	Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers

Hi,

Please find those three changes based on our recent discussions about
interaction between membarrier and exit_mm and kthread_use_mm. I added
documentation of memory ordering scenarios to membarrier.c as well.

Thanks,

Mathieu

Mathieu Desnoyers (3):
  sched: fix exit_mm vs membarrier (v2)
  sched: membarrier: cover kthread_use_mm (v2)
  sched: membarrier: document memory ordering scenarios

 kernel/exit.c             |   8 +++
 kernel/kthread.c          |  19 +++++++
 kernel/sched/idle.c       |   1 +
 kernel/sched/membarrier.c | 136 ++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 158 insertions(+), 6 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2)
  2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
@ 2020-08-14 16:43 ` Mathieu Desnoyers
  2020-08-16 15:23   ` Boqun Feng
  2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
	Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers,
	Thomas Gleixner, Linus Torvalds, linux-mm

exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.

The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.

Here is the scenario I have in mind:

Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).

The AFAIU we can have:

Userspace variables:

int x = 0, y = 0;

CPU 0                   CPU 1
Thread A                Thread B
(in thread group A)     (in thread group B)

x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
                        r1 = load y
                        membarrier()
                          skips CPU 0 (no IPI) because its current mm is NULL
                        r2 = load x
                        BUG_ON(r1 == 1 && r2 == 0)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: linux-mm@kvack.org
---
Changes since v1:
- Use smp_mb__after_spinlock rather than smp_mb.
- Document race scenario in commit message.
---
 kernel/exit.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/exit.c b/kernel/exit.c
index 733e80f334e7..fe64e6e28dd5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -475,6 +475,14 @@ static void exit_mm(void)
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
+	/*
+	 * When a thread stops operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after accessing user-space memory, before
+	 * clearing tsk->mm.
+	 */
+	smp_mb__after_spinlock();
 	current->mm = NULL;
 	mmap_read_unlock(mm);
 	enter_lazy_tlb(mm, current);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
  2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
  2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
@ 2020-08-14 16:43 ` Mathieu Desnoyers
  2020-08-16 15:29   ` Boqun Feng
  2020-08-14 16:43 ` [RFC PATCH 3/3] sched: membarrier: document memory ordering scenarios Mathieu Desnoyers
       [not found] ` <20200816070932.10752-1-hdanton@sina.com>
  3 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
	Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers

Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
to allow the effect of membarrier(2) to apply to kthreads accessing
user-space memory as well.

Given that no prior kthread use this guarantee and that it only affects
kthreads, adding this guarantee does not affect user-space ABI.

Refine the check in membarrier_global_expedited to exclude runqueues
running the idle thread rather than all kthreads from the IPI cpumask.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
Changes since v1:
- Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
- Use smp_mb__after_spinlock rather than smp_mb after task_lock.
---
 kernel/kthread.c          | 19 +++++++++++++++++++
 kernel/sched/idle.c       |  1 +
 kernel/sched/membarrier.c |  8 ++------
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3edaa380dc7b..77aaaa7bc8d9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
 	finish_arch_post_lock_switch();
 #endif
 
+	/*
+	 * When a kthread starts operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after storing to tsk->mm, before accessing
+	 * user-space memory. A full memory barrier for membarrier
+	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
+	 * mmdrop(), or explicitly with smp_mb().
+	 */
 	if (active_mm != mm)
 		mmdrop(active_mm);
+	else
+		smp_mb();
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
@@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	force_uaccess_end(to_kthread(tsk)->oldfs);
 
 	task_lock(tsk);
+	/*
+	 * When a kthread stops operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after accessing user-space memory, before
+	 * clearing tsk->mm.
+	 */
+	smp_mb__after_spinlock();
 	sync_mm_rss(mm);
 	local_irq_disable();
 	tsk->mm = NULL;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6bf34986f45c..3443ee8335d0 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
 	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
 	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
 	WARN_ON_ONCE(!duration_ns);
+	WARN_ON_ONCE(current->mm);
 
 	rcu_sleep_check();
 	preempt_disable();
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 168479a7d61b..8a294483074d 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
 		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
 			continue;
 
-		/*
-		 * Skip the CPU if it runs a kernel thread. The scheduler
-		 * leaves the prior task mm in place as an optimization when
-		 * scheduling a kthread.
-		 */
+		/* Skip the CPU if it runs the idle thread. */
 		p = rcu_dereference(cpu_rq(cpu)->curr);
-		if (p->flags & PF_KTHREAD)
+		if (is_idle_task(p))
 			continue;
 
 		__cpumask_set_cpu(cpu, tmpmask);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/3] sched: membarrier: document memory ordering scenarios
  2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
  2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
  2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
@ 2020-08-14 16:43 ` Mathieu Desnoyers
       [not found] ` <20200816070932.10752-1-hdanton@sina.com>
  3 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
	Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers

Document membarrier ordering scenarios in membarrier.c. Thanks to Alan
Stern for refreshing my memory. Now that I have those in mind, it seems
appropriate to serialize them to comments for posterity.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
---
 kernel/sched/membarrier.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 8a294483074d..103f5edb8ba5 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -7,6 +7,134 @@
 #include "sched.h"
 
 /*
+ * For documentation purposes, here are some membarrier ordering
+ * scenarios to keep in mind:
+ *
+ * A) Userspace thread execution after IPI vs membarrier's memory
+ *    barrier before sending the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the start of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU0 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU1 after the IPI-induced memory barrier:
+ *
+ *         CPU0                              CPU1
+ *
+ *         x = 1
+ *         membarrier():
+ *           a: smp_mb()
+ *           b: send IPI                       IPI-induced mb
+ *           c: smp_mb()
+ *         r2 = y
+ *                                           y = 1
+ *                                           barrier()
+ *                                           r1 = x
+ *
+ *                     BUG_ON(r1 == 0 && r2 == 0)
+ *
+ * The write to y and load from x by CPU1 are unordered by the hardware,
+ * so it's possible to have "r1 = x" reordered before "y = 1" at any
+ * point after (b).  If the memory barrier at (a) is omitted, then "x = 1"
+ * can be reordered after (a) (although not after (c)), so we get r1 == 0
+ * and r2 == 0.  This violates the guarantee that membarrier() is
+ * supposed by provide.
+ *
+ * The timing of the memory barrier at (a) has to ensure that it executes
+ * before the IPI-induced memory barrier on CPU1.
+ *
+ * B) Userspace thread execution before IPI vs membarrier's memory
+ *    barrier after completing the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the end of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU1 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU0 after the membarrier():
+ *
+ *         CPU0                              CPU1
+ *
+ *                                           x = 1
+ *                                           barrier()
+ *                                           y = 1
+ *         r2 = y
+ *         membarrier():
+ *           a: smp_mb()
+ *           b: send IPI                       IPI-induced mb
+ *           c: smp_mb()
+ *         r1 = x
+ *         BUG_ON(r1 == 0 && r2 == 1)
+ *
+ * The writes to x and y are unordered by the hardware, so it's possible to
+ * have "r2 = 1" even though the write to x doesn't execute until (b).  If
+ * the memory barrier at (c) is omitted then "r1 = x" can be reordered
+ * before (b) (although not before (a)), so we get "r1 = 0".  This violates
+ * the guarantee that membarrier() is supposed to provide.
+ *
+ * The timing of the memory barrier at (c) has to ensure that it executes
+ * after the IPI-induced memory barrier on CPU1.
+ *
+ * C) Scheduling userspace thread -> kthread -> userspace thread vs membarrier
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *           a: smp_mb()
+ *                                           d: switch to kthread (includes mb)
+ *           b: read rq->curr->mm == NULL
+ *                                           e: switch to user (includes mb)
+ *           c: smp_mb()
+ *
+ * Using the scenario from (A), we can show that (a) needs to be paired
+ * with (e). Using the scenario from (B), we can show that (c) needs to
+ * be paired with (d).
+ *
+ * D) exit_mm vs membarrier
+ *
+ * Two thread groups are created, A and B.  Thread group B is created by
+ * issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
+ * Let's assume we have a single thread within each thread group (Thread A
+ * and Thread B).  Thread A runs on CPU0, Thread B runs on CPU1.
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *             a: smp_mb()
+ *                                           exit_mm():
+ *                                             d: smp_mb()
+ *                                             e: current->mm = NULL
+ *             b: read rq->curr->mm == NULL
+ *             c: smp_mb()
+ *
+ * Using scenario (B), we can show that (c) needs to be paired with (d).
+ *
+ * E) kthread_{use,unuse}_mm vs membarrier
+ *
+ *           CPU0                            CPU1
+ *
+ *           membarrier():
+ *           a: smp_mb()
+ *                                           kthread_unuse_mm()
+ *                                             d: smp_mb()
+ *                                             e: current->mm = NULL
+ *           b: read rq->curr->mm == NULL
+ *                                           kthread_use_mm()
+ *                                             f: current->mm = mm
+ *                                             g: smp_mb()
+ *           c: smp_mb()
+ *
+ * Using the scenario from (A), we can show that (a) needs to be paired
+ * with (g). Using the scenario from (B), we can show that (c) needs to
+ * be paired with (d).
+ */
+
+/*
  * Bitmask made from a "or" of all commands within enum membarrier_cmd,
  * except MEMBARRIER_CMD_QUERY.
  */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2)
  2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
@ 2020-08-16 15:23   ` Boqun Feng
  2020-09-24 15:01     ` Mathieu Desnoyers
  0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2020-08-16 15:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, Paul E . McKenney,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin,
	Thomas Gleixner, Linus Torvalds, linux-mm

Hi Mathieu,

On Fri, Aug 14, 2020 at 12:43:56PM -0400, Mathieu Desnoyers wrote:
> exit_mm should issue memory barriers after user-space memory accesses,
> before clearing current->mm, to order user-space memory accesses
> performed prior to exit_mm before clearing tsk->mm, which has the
> effect of skipping the membarrier private expedited IPIs.
> 
> The membarrier system call can be issued concurrently with do_exit
> if we have thread groups created with CLONE_VM but not CLONE_THREAD.
> 
> Here is the scenario I have in mind:
> 
> Two thread groups are created, A and B. Thread group B is created by
> issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
> Let's assume we have a single thread within each thread group (Thread A
> and Thread B).
> 
> The AFAIU we can have:
> 
> Userspace variables:
> 
> int x = 0, y = 0;
> 
> CPU 0                   CPU 1
> Thread A                Thread B
> (in thread group A)     (in thread group B)
> 
> x = 1
> barrier()
> y = 1
> exit()
> exit_mm()
> current->mm = NULL;
>                         r1 = load y
>                         membarrier()
>                           skips CPU 0 (no IPI) because its current mm is NULL
>                         r2 = load x
>                         BUG_ON(r1 == 1 && r2 == 0)
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: linux-mm@kvack.org
> ---
> Changes since v1:
> - Use smp_mb__after_spinlock rather than smp_mb.
> - Document race scenario in commit message.
> ---
>  kernel/exit.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 733e80f334e7..fe64e6e28dd5 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -475,6 +475,14 @@ static void exit_mm(void)
>  	BUG_ON(mm != current->active_mm);
>  	/* more a memory barrier than a real lock */
>  	task_lock(current);
> +	/*
> +	 * When a thread stops operating on an address space, the loop
> +	 * in membarrier_{private,global}_expedited() may not observe

Is it accurate to say that the correctness of
membarrier_global_expedited() relies on the observation of ->mm? Because
IIUC membarrier_global_expedited() loop doesn't check ->mm.

Regards,
Boqun

> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> +	 * memory barrier after accessing user-space memory, before
> +	 * clearing tsk->mm.
> +	 */
> +	smp_mb__after_spinlock();
>  	current->mm = NULL;
>  	mmap_read_unlock(mm);
>  	enter_lazy_tlb(mm, current);
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
  2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
@ 2020-08-16 15:29   ` Boqun Feng
  2020-08-24 15:27     ` Mathieu Desnoyers
  0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2020-08-16 15:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, Paul E . McKenney,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin

On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
> to allow the effect of membarrier(2) to apply to kthreads accessing
> user-space memory as well.
> 
> Given that no prior kthread use this guarantee and that it only affects
> kthreads, adding this guarantee does not affect user-space ABI.
> 
> Refine the check in membarrier_global_expedited to exclude runqueues
> running the idle thread rather than all kthreads from the IPI cpumask.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> Changes since v1:
> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
> ---
>  kernel/kthread.c          | 19 +++++++++++++++++++
>  kernel/sched/idle.c       |  1 +
>  kernel/sched/membarrier.c |  8 ++------
>  3 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3edaa380dc7b..77aaaa7bc8d9 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
>  	finish_arch_post_lock_switch();
>  #endif
>  
> +	/*
> +	 * When a kthread starts operating on an address space, the loop
> +	 * in membarrier_{private,global}_expedited() may not observe
> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> +	 * memory barrier after storing to tsk->mm, before accessing
> +	 * user-space memory. A full memory barrier for membarrier
> +	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
> +	 * mmdrop(), or explicitly with smp_mb().
> +	 */
>  	if (active_mm != mm)
>  		mmdrop(active_mm);
> +	else
> +		smp_mb();

Similar question here: could smp_mb() guarantee the correctness of
GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
kthread_unuse_mm(), too?

Am I miss something here?

Regards,
Boqun

>  
>  	to_kthread(tsk)->oldfs = force_uaccess_begin();
>  }
> @@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
>  	force_uaccess_end(to_kthread(tsk)->oldfs);
>  
>  	task_lock(tsk);
> +	/*
> +	 * When a kthread stops operating on an address space, the loop
> +	 * in membarrier_{private,global}_expedited() may not observe
> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> +	 * memory barrier after accessing user-space memory, before
> +	 * clearing tsk->mm.
> +	 */
> +	smp_mb__after_spinlock();
>  	sync_mm_rss(mm);
>  	local_irq_disable();
>  	tsk->mm = NULL;
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 6bf34986f45c..3443ee8335d0 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
>  	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
>  	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
>  	WARN_ON_ONCE(!duration_ns);
> +	WARN_ON_ONCE(current->mm);
>  
>  	rcu_sleep_check();
>  	preempt_disable();
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 168479a7d61b..8a294483074d 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
>  		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
>  			continue;
>  
> -		/*
> -		 * Skip the CPU if it runs a kernel thread. The scheduler
> -		 * leaves the prior task mm in place as an optimization when
> -		 * scheduling a kthread.
> -		 */
> +		/* Skip the CPU if it runs the idle thread. */
>  		p = rcu_dereference(cpu_rq(cpu)->curr);
> -		if (p->flags & PF_KTHREAD)
> +		if (is_idle_task(p))
>  			continue;
>  
>  		__cpumask_set_cpu(cpu, tmpmask);
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
       [not found] ` <20200816070932.10752-1-hdanton@sina.com>
@ 2020-08-16 21:05   ` Mathieu Desnoyers
  0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-16 21:05 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin

----- On Aug 16, 2020, at 3:09 AM, Hillf Danton hdanton@sina.com wrote:

> On Fri, 14 Aug 2020 12:43:57 -0400 Mathieu Desnoyers wrote:
>> 
>> Given that no prior kthread use this guarantee and that it only affects
>> kthreads, adding this guarantee does not affect user-space ABI.
> 
> Can you expand a bit on why kthreads like ksoftirqd have to ack the
> IPIs from Dave who's not CAP_SYS_ADMIN.

Do ksoftirqd kthreads ever use kthread_use_mm() to access user
processes' memory ? If not, then they won't be disturbed by any IPI.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
  2020-08-16 15:29   ` Boqun Feng
@ 2020-08-24 15:27     ` Mathieu Desnoyers
  2020-08-25  2:06       ` Boqun Feng
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-24 15:27 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin

----- On Aug 16, 2020, at 11:29 AM, Boqun Feng boqun.feng@gmail.com wrote:

> On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
>> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
>> to allow the effect of membarrier(2) to apply to kthreads accessing
>> user-space memory as well.
>> 
>> Given that no prior kthread use this guarantee and that it only affects
>> kthreads, adding this guarantee does not affect user-space ABI.
>> 
>> Refine the check in membarrier_global_expedited to exclude runqueues
>> running the idle thread rather than all kthreads from the IPI cpumask.
>> 
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> ---
>> Changes since v1:
>> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
>> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
>> ---
>>  kernel/kthread.c          | 19 +++++++++++++++++++
>>  kernel/sched/idle.c       |  1 +
>>  kernel/sched/membarrier.c |  8 ++------
>>  3 files changed, 22 insertions(+), 6 deletions(-)
>> 
>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> index 3edaa380dc7b..77aaaa7bc8d9 100644
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
>>  	finish_arch_post_lock_switch();
>>  #endif
>>  
>> +	/*
>> +	 * When a kthread starts operating on an address space, the loop
>> +	 * in membarrier_{private,global}_expedited() may not observe
>> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> +	 * memory barrier after storing to tsk->mm, before accessing
>> +	 * user-space memory. A full memory barrier for membarrier
>> +	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
>> +	 * mmdrop(), or explicitly with smp_mb().
>> +	 */
>>  	if (active_mm != mm)
>>  		mmdrop(active_mm);
>> +	else
>> +		smp_mb();
> 
> Similar question here: could smp_mb() guarantee the correctness of
> GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
> kthread_unuse_mm(), too?
> 
> Am I miss something here?

I think you have a good point there. Which brings me to wonder why we
don't have membarrier_switch_mm() when entering/leaving lazy TLB mode.
This means an IPI can be sent to a kthread even if it does not use an
mm, just because the membarrier state in the runqueue is that of the
active_mm.

Thoughts ?

Thanks,

Mathieu

> 
> Regards,
> Boqun
> 
>>  
>>  	to_kthread(tsk)->oldfs = force_uaccess_begin();
>>  }
>> @@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
>>  	force_uaccess_end(to_kthread(tsk)->oldfs);
>>  
>>  	task_lock(tsk);
>> +	/*
>> +	 * When a kthread stops operating on an address space, the loop
>> +	 * in membarrier_{private,global}_expedited() may not observe
>> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> +	 * memory barrier after accessing user-space memory, before
>> +	 * clearing tsk->mm.
>> +	 */
>> +	smp_mb__after_spinlock();
>>  	sync_mm_rss(mm);
>>  	local_irq_disable();
>>  	tsk->mm = NULL;
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 6bf34986f45c..3443ee8335d0 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
>>  	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
>>  	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
>>  	WARN_ON_ONCE(!duration_ns);
>> +	WARN_ON_ONCE(current->mm);
>>  
>>  	rcu_sleep_check();
>>  	preempt_disable();
>> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
>> index 168479a7d61b..8a294483074d 100644
>> --- a/kernel/sched/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
>>  		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
>>  			continue;
>>  
>> -		/*
>> -		 * Skip the CPU if it runs a kernel thread. The scheduler
>> -		 * leaves the prior task mm in place as an optimization when
>> -		 * scheduling a kthread.
>> -		 */
>> +		/* Skip the CPU if it runs the idle thread. */
>>  		p = rcu_dereference(cpu_rq(cpu)->curr);
>> -		if (p->flags & PF_KTHREAD)
>> +		if (is_idle_task(p))
>>  			continue;
>>  
>>  		__cpumask_set_cpu(cpu, tmpmask);
>> --
>> 2.11.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
  2020-08-24 15:27     ` Mathieu Desnoyers
@ 2020-08-25  2:06       ` Boqun Feng
  2020-09-24 15:26         ` Mathieu Desnoyers
  0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2020-08-25  2:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin

On Mon, Aug 24, 2020 at 11:27:49AM -0400, Mathieu Desnoyers wrote:
> ----- On Aug 16, 2020, at 11:29 AM, Boqun Feng boqun.feng@gmail.com wrote:
> 
> > On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
> >> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
> >> to allow the effect of membarrier(2) to apply to kthreads accessing
> >> user-space memory as well.
> >> 
> >> Given that no prior kthread use this guarantee and that it only affects
> >> kthreads, adding this guarantee does not affect user-space ABI.
> >> 
> >> Refine the check in membarrier_global_expedited to exclude runqueues
> >> running the idle thread rather than all kthreads from the IPI cpumask.
> >> 
> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Cc: Will Deacon <will@kernel.org>
> >> Cc: Paul E. McKenney <paulmck@kernel.org>
> >> Cc: Nicholas Piggin <npiggin@gmail.com>
> >> Cc: Andy Lutomirski <luto@amacapital.net>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> ---
> >> Changes since v1:
> >> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
> >> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
> >> ---
> >>  kernel/kthread.c          | 19 +++++++++++++++++++
> >>  kernel/sched/idle.c       |  1 +
> >>  kernel/sched/membarrier.c |  8 ++------
> >>  3 files changed, 22 insertions(+), 6 deletions(-)
> >> 
> >> diff --git a/kernel/kthread.c b/kernel/kthread.c
> >> index 3edaa380dc7b..77aaaa7bc8d9 100644
> >> --- a/kernel/kthread.c
> >> +++ b/kernel/kthread.c
> >> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
> >>  	finish_arch_post_lock_switch();
> >>  #endif
> >>  
> >> +	/*
> >> +	 * When a kthread starts operating on an address space, the loop
> >> +	 * in membarrier_{private,global}_expedited() may not observe
> >> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> >> +	 * memory barrier after storing to tsk->mm, before accessing
> >> +	 * user-space memory. A full memory barrier for membarrier
> >> +	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
> >> +	 * mmdrop(), or explicitly with smp_mb().
> >> +	 */
> >>  	if (active_mm != mm)
> >>  		mmdrop(active_mm);
> >> +	else
> >> +		smp_mb();
> > 
> > Similar question here: could smp_mb() guarantee the correctness of
> > GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
> > kthread_unuse_mm(), too?
> > 
> > Am I miss something here?
> 
> I think you have a good point there. Which brings me to wonder why we
> don't have membarrier_switch_mm() when entering/leaving lazy TLB mode.
> This means an IPI can be sent to a kthread even if it does not use an
> mm, just because the membarrier state in the runqueue is that of the
> active_mm.
> 
> Thoughts ?
> 

Right, I think we should also handle the percpu membarrier_state. The
basic rule is whenever we change current->mm or current (i.e. rq->curr)
itself, we need to update the percpu membarrier_state accordingly.

Regards,
Boqun

> Thanks,
> 
> Mathieu
> 
> > 
> > Regards,
> > Boqun
> > 
> >>  
> >>  	to_kthread(tsk)->oldfs = force_uaccess_begin();
> >>  }
> >> @@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
> >>  	force_uaccess_end(to_kthread(tsk)->oldfs);
> >>  
> >>  	task_lock(tsk);
> >> +	/*
> >> +	 * When a kthread stops operating on an address space, the loop
> >> +	 * in membarrier_{private,global}_expedited() may not observe
> >> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> >> +	 * memory barrier after accessing user-space memory, before
> >> +	 * clearing tsk->mm.
> >> +	 */
> >> +	smp_mb__after_spinlock();
> >>  	sync_mm_rss(mm);
> >>  	local_irq_disable();
> >>  	tsk->mm = NULL;
> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> index 6bf34986f45c..3443ee8335d0 100644
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
> >>  	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
> >>  	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
> >>  	WARN_ON_ONCE(!duration_ns);
> >> +	WARN_ON_ONCE(current->mm);
> >>  
> >>  	rcu_sleep_check();
> >>  	preempt_disable();
> >> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> >> index 168479a7d61b..8a294483074d 100644
> >> --- a/kernel/sched/membarrier.c
> >> +++ b/kernel/sched/membarrier.c
> >> @@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
> >>  		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
> >>  			continue;
> >>  
> >> -		/*
> >> -		 * Skip the CPU if it runs a kernel thread. The scheduler
> >> -		 * leaves the prior task mm in place as an optimization when
> >> -		 * scheduling a kthread.
> >> -		 */
> >> +		/* Skip the CPU if it runs the idle thread. */
> >>  		p = rcu_dereference(cpu_rq(cpu)->curr);
> >> -		if (p->flags & PF_KTHREAD)
> >> +		if (is_idle_task(p))
> >>  			continue;
> >>  
> >>  		__cpumask_set_cpu(cpu, tmpmask);
> >> --
> >> 2.11.0
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2)
  2020-08-16 15:23   ` Boqun Feng
@ 2020-09-24 15:01     ` Mathieu Desnoyers
  0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-09-24 15:01 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin,
	Thomas Gleixner, Linus Torvalds, linux-mm

----- On Aug 16, 2020, at 11:23 AM, Boqun Feng boqun.feng@gmail.com wrote:

> Hi Mathieu,
> 
> On Fri, Aug 14, 2020 at 12:43:56PM -0400, Mathieu Desnoyers wrote:
>> exit_mm should issue memory barriers after user-space memory accesses,
>> before clearing current->mm, to order user-space memory accesses
>> performed prior to exit_mm before clearing tsk->mm, which has the
>> effect of skipping the membarrier private expedited IPIs.
>> 
>> The membarrier system call can be issued concurrently with do_exit
>> if we have thread groups created with CLONE_VM but not CLONE_THREAD.
>> 
>> Here is the scenario I have in mind:
>> 
>> Two thread groups are created, A and B. Thread group B is created by
>> issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
>> Let's assume we have a single thread within each thread group (Thread A
>> and Thread B).
>> 
>> The AFAIU we can have:
>> 
>> Userspace variables:
>> 
>> int x = 0, y = 0;
>> 
>> CPU 0                   CPU 1
>> Thread A                Thread B
>> (in thread group A)     (in thread group B)
>> 
>> x = 1
>> barrier()
>> y = 1
>> exit()
>> exit_mm()
>> current->mm = NULL;
>>                         r1 = load y
>>                         membarrier()
>>                           skips CPU 0 (no IPI) because its current mm is NULL
>>                         r2 = load x
>>                         BUG_ON(r1 == 1 && r2 == 0)
>> 
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Alan Stern <stern@rowland.harvard.edu>
>> Cc: linux-mm@kvack.org
>> ---
>> Changes since v1:
>> - Use smp_mb__after_spinlock rather than smp_mb.
>> - Document race scenario in commit message.
>> ---
>>  kernel/exit.c | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>> 
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 733e80f334e7..fe64e6e28dd5 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -475,6 +475,14 @@ static void exit_mm(void)
>>  	BUG_ON(mm != current->active_mm);
>>  	/* more a memory barrier than a real lock */
>>  	task_lock(current);
>> +	/*
>> +	 * When a thread stops operating on an address space, the loop
>> +	 * in membarrier_{private,global}_expedited() may not observe
> 
> Is it accurate to say that the correctness of
> membarrier_global_expedited() relies on the observation of ->mm? Because
> IIUC membarrier_global_expedited() loop doesn't check ->mm.

Good point, I was wrong. Will instead reword as:

        /*
         * When a thread stops operating on an address space, the loop
         * in membarrier_private_expedited() may not observe that
         * tsk->mm, and the loop in membarrier_global_expedited() may
         * not observe a MEMBARRIER_STATE_GLOBAL_EXPEDITED
         * rq->membarrier_state, so those would not issue an IPI.
         * Membarrier requires a memory barrier after accessing
         * user-space memory, before clearing tsk->mm or the
         * rq->membarrier_state.
         */

And I'll make sure exit_mm clears this_rq()->membarrier_state as well.

Thanks,

Mathieu

> 
> Regards,
> Boqun
> 
>> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> +	 * memory barrier after accessing user-space memory, before
>> +	 * clearing tsk->mm.
>> +	 */
>> +	smp_mb__after_spinlock();
>>  	current->mm = NULL;
>>  	mmap_read_unlock(mm);
>>  	enter_lazy_tlb(mm, current);
>> --
>> 2.11.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
  2020-08-25  2:06       ` Boqun Feng
@ 2020-09-24 15:26         ` Mathieu Desnoyers
  0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-09-24 15:26 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
	Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin

----- On Aug 24, 2020, at 10:06 PM, Boqun Feng boqun.feng@gmail.com wrote:

> On Mon, Aug 24, 2020 at 11:27:49AM -0400, Mathieu Desnoyers wrote:
>> ----- On Aug 16, 2020, at 11:29 AM, Boqun Feng boqun.feng@gmail.com wrote:
>> 
>> > On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
>> >> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
>> >> to allow the effect of membarrier(2) to apply to kthreads accessing
>> >> user-space memory as well.
>> >> 
>> >> Given that no prior kthread use this guarantee and that it only affects
>> >> kthreads, adding this guarantee does not affect user-space ABI.
>> >> 
>> >> Refine the check in membarrier_global_expedited to exclude runqueues
>> >> running the idle thread rather than all kthreads from the IPI cpumask.
>> >> 
>> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> >> Cc: Will Deacon <will@kernel.org>
>> >> Cc: Paul E. McKenney <paulmck@kernel.org>
>> >> Cc: Nicholas Piggin <npiggin@gmail.com>
>> >> Cc: Andy Lutomirski <luto@amacapital.net>
>> >> Cc: Andrew Morton <akpm@linux-foundation.org>
>> >> ---
>> >> Changes since v1:
>> >> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
>> >> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
>> >> ---
>> >>  kernel/kthread.c          | 19 +++++++++++++++++++
>> >>  kernel/sched/idle.c       |  1 +
>> >>  kernel/sched/membarrier.c |  8 ++------
>> >>  3 files changed, 22 insertions(+), 6 deletions(-)
>> >> 
>> >> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> >> index 3edaa380dc7b..77aaaa7bc8d9 100644
>> >> --- a/kernel/kthread.c
>> >> +++ b/kernel/kthread.c
>> >> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
>> >>  	finish_arch_post_lock_switch();
>> >>  #endif
>> >>  
>> >> +	/*
>> >> +	 * When a kthread starts operating on an address space, the loop
>> >> +	 * in membarrier_{private,global}_expedited() may not observe
>> >> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> >> +	 * memory barrier after storing to tsk->mm, before accessing
>> >> +	 * user-space memory. A full memory barrier for membarrier
>> >> +	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
>> >> +	 * mmdrop(), or explicitly with smp_mb().
>> >> +	 */
>> >>  	if (active_mm != mm)
>> >>  		mmdrop(active_mm);
>> >> +	else
>> >> +		smp_mb();
>> > 
>> > Similar question here: could smp_mb() guarantee the correctness of
>> > GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
>> > kthread_unuse_mm(), too?
>> > 
>> > Am I miss something here?
>> 
>> I think you have a good point there. Which brings me to wonder why we
>> don't have membarrier_switch_mm() when entering/leaving lazy TLB mode.
>> This means an IPI can be sent to a kthread even if it does not use an
>> mm, just because the membarrier state in the runqueue is that of the
>> active_mm.
>> 
>> Thoughts ?
>> 
> 
> Right, I think we should also handle the percpu membarrier_state. The
> basic rule is whenever we change current->mm or current (i.e. rq->curr)
> itself, we need to update the percpu membarrier_state accordingly.

OK, so as we introduce IPIs to kthreads which are using kthread_use_mm, we need to
reconsider how the scheduler deals with runqueues entering lazy TLB state. Currently,
membarrier_switch_mm() is not issued when entering lazy TLB state. But as we
start considering kthreads as candidates for sending IPIs, we need to update
the rq->membarrier_state whenever we enter lazy TLB state as well.

So I plan to do this change to cover this:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84758f34cdb0..44521dc5602a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3736,6 +3736,8 @@ context_switch(struct rq *rq, struct task_struct *prev,
         */
        arch_start_context_switch(prev);
 
+       membarrier_switch_mm(rq, prev->mm, next->mm);
+
        /*
         * kernel -> kernel   lazy + transfer active
         *   user -> kernel   lazy + mmgrab() active
@@ -3752,7 +3754,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
                else
                        prev->active_mm = NULL;
        } else {                                        // to user
-               membarrier_switch_mm(rq, prev->active_mm, next->mm);
                /*
                 * sys_membarrier() requires an smp_mb() between setting
                 * rq->curr / membarrier_switch_mm() and returning to userspace.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3fd283892761..481149066086 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2592,12 +2592,13 @@ static inline void membarrier_switch_mm(struct rq *rq,
                                        struct mm_struct *prev_mm,
                                        struct mm_struct *next_mm)
 {
-       int membarrier_state;
+       int membarrier_state = 0;
 
        if (prev_mm == next_mm)
                return;
 
-       membarrier_state = atomic_read(&next_mm->membarrier_state);
+       if (next_mm)
+               membarrier_state = atomic_read(&next_mm->membarrier_state);
        if (READ_ONCE(rq->membarrier_state) == membarrier_state)
                return;

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-09-24 15:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
2020-08-16 15:23   ` Boqun Feng
2020-09-24 15:01     ` Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
2020-08-16 15:29   ` Boqun Feng
2020-08-24 15:27     ` Mathieu Desnoyers
2020-08-25  2:06       ` Boqun Feng
2020-09-24 15:26         ` Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 3/3] sched: membarrier: document memory ordering scenarios Mathieu Desnoyers
     [not found] ` <20200816070932.10752-1-hdanton@sina.com>
2020-08-16 21:05   ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).