* [RFC PATCH 0/3] sched: membarrier updates
@ 2020-08-14 16:43 Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers
Hi,
Please find those three changes based on our recent discussions about
interaction between membarrier and exit_mm and kthread_use_mm. I added
documentation of memory ordering scenarios to membarrier.c as well.
Thanks,
Mathieu
Mathieu Desnoyers (3):
sched: fix exit_mm vs membarrier (v2)
sched: membarrier: cover kthread_use_mm (v2)
sched: membarrier: document memory ordering scenarios
kernel/exit.c | 8 +++
kernel/kthread.c | 19 +++++++
kernel/sched/idle.c | 1 +
kernel/sched/membarrier.c | 136 ++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 158 insertions(+), 6 deletions(-)
--
2.11.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2)
2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
@ 2020-08-14 16:43 ` Mathieu Desnoyers
2020-08-16 15:23 ` Boqun Feng
2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers,
Thomas Gleixner, Linus Torvalds, linux-mm
exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.
The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.
Here is the scenario I have in mind:
Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).
The AFAIU we can have:
Userspace variables:
int x = 0, y = 0;
CPU 0 CPU 1
Thread A Thread B
(in thread group A) (in thread group B)
x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
r1 = load y
membarrier()
skips CPU 0 (no IPI) because its current mm is NULL
r2 = load x
BUG_ON(r1 == 1 && r2 == 0)
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: linux-mm@kvack.org
---
Changes since v1:
- Use smp_mb__after_spinlock rather than smp_mb.
- Document race scenario in commit message.
---
kernel/exit.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/exit.c b/kernel/exit.c
index 733e80f334e7..fe64e6e28dd5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -475,6 +475,14 @@ static void exit_mm(void)
BUG_ON(mm != current->active_mm);
/* more a memory barrier than a real lock */
task_lock(current);
+ /*
+ * When a thread stops operating on an address space, the loop
+ * in membarrier_{private,global}_expedited() may not observe
+ * that tsk->mm, and not issue an IPI. Membarrier requires a
+ * memory barrier after accessing user-space memory, before
+ * clearing tsk->mm.
+ */
+ smp_mb__after_spinlock();
current->mm = NULL;
mmap_read_unlock(mm);
enter_lazy_tlb(mm, current);
--
2.11.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
@ 2020-08-14 16:43 ` Mathieu Desnoyers
2020-08-16 15:29 ` Boqun Feng
2020-08-14 16:43 ` [RFC PATCH 3/3] sched: membarrier: document memory ordering scenarios Mathieu Desnoyers
[not found] ` <20200816070932.10752-1-hdanton@sina.com>
3 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers
Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
to allow the effect of membarrier(2) to apply to kthreads accessing
user-space memory as well.
Given that no prior kthread use this guarantee and that it only affects
kthreads, adding this guarantee does not affect user-space ABI.
Refine the check in membarrier_global_expedited to exclude runqueues
running the idle thread rather than all kthreads from the IPI cpumask.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
Changes since v1:
- Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
- Use smp_mb__after_spinlock rather than smp_mb after task_lock.
---
kernel/kthread.c | 19 +++++++++++++++++++
kernel/sched/idle.c | 1 +
kernel/sched/membarrier.c | 8 ++------
3 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3edaa380dc7b..77aaaa7bc8d9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
finish_arch_post_lock_switch();
#endif
+ /*
+ * When a kthread starts operating on an address space, the loop
+ * in membarrier_{private,global}_expedited() may not observe
+ * that tsk->mm, and not issue an IPI. Membarrier requires a
+ * memory barrier after storing to tsk->mm, before accessing
+ * user-space memory. A full memory barrier for membarrier
+ * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
+ * mmdrop(), or explicitly with smp_mb().
+ */
if (active_mm != mm)
mmdrop(active_mm);
+ else
+ smp_mb();
to_kthread(tsk)->oldfs = force_uaccess_begin();
}
@@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
force_uaccess_end(to_kthread(tsk)->oldfs);
task_lock(tsk);
+ /*
+ * When a kthread stops operating on an address space, the loop
+ * in membarrier_{private,global}_expedited() may not observe
+ * that tsk->mm, and not issue an IPI. Membarrier requires a
+ * memory barrier after accessing user-space memory, before
+ * clearing tsk->mm.
+ */
+ smp_mb__after_spinlock();
sync_mm_rss(mm);
local_irq_disable();
tsk->mm = NULL;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6bf34986f45c..3443ee8335d0 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
WARN_ON_ONCE(!duration_ns);
+ WARN_ON_ONCE(current->mm);
rcu_sleep_check();
preempt_disable();
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 168479a7d61b..8a294483074d 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
MEMBARRIER_STATE_GLOBAL_EXPEDITED))
continue;
- /*
- * Skip the CPU if it runs a kernel thread. The scheduler
- * leaves the prior task mm in place as an optimization when
- * scheduling a kthread.
- */
+ /* Skip the CPU if it runs the idle thread. */
p = rcu_dereference(cpu_rq(cpu)->curr);
- if (p->flags & PF_KTHREAD)
+ if (is_idle_task(p))
continue;
__cpumask_set_cpu(cpu, tmpmask);
--
2.11.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RFC PATCH 3/3] sched: membarrier: document memory ordering scenarios
2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
@ 2020-08-14 16:43 ` Mathieu Desnoyers
[not found] ` <20200816070932.10752-1-hdanton@sina.com>
3 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-14 16:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Will Deacon, Paul E . McKenney, Andy Lutomirski,
Andrew Morton, Alan Stern, Nicholas Piggin, Mathieu Desnoyers
Document membarrier ordering scenarios in membarrier.c. Thanks to Alan
Stern for refreshing my memory. Now that I have those in mind, it seems
appropriate to serialize them to comments for posterity.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
---
kernel/sched/membarrier.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 128 insertions(+)
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 8a294483074d..103f5edb8ba5 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -7,6 +7,134 @@
#include "sched.h"
/*
+ * For documentation purposes, here are some membarrier ordering
+ * scenarios to keep in mind:
+ *
+ * A) Userspace thread execution after IPI vs membarrier's memory
+ * barrier before sending the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the start of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU0 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU1 after the IPI-induced memory barrier:
+ *
+ * CPU0 CPU1
+ *
+ * x = 1
+ * membarrier():
+ * a: smp_mb()
+ * b: send IPI IPI-induced mb
+ * c: smp_mb()
+ * r2 = y
+ * y = 1
+ * barrier()
+ * r1 = x
+ *
+ * BUG_ON(r1 == 0 && r2 == 0)
+ *
+ * The write to y and load from x by CPU1 are unordered by the hardware,
+ * so it's possible to have "r1 = x" reordered before "y = 1" at any
+ * point after (b). If the memory barrier at (a) is omitted, then "x = 1"
+ * can be reordered after (a) (although not after (c)), so we get r1 == 0
+ * and r2 == 0. This violates the guarantee that membarrier() is
+ * supposed by provide.
+ *
+ * The timing of the memory barrier at (a) has to ensure that it executes
+ * before the IPI-induced memory barrier on CPU1.
+ *
+ * B) Userspace thread execution before IPI vs membarrier's memory
+ * barrier after completing the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the end of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU1 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU0 after the membarrier():
+ *
+ * CPU0 CPU1
+ *
+ * x = 1
+ * barrier()
+ * y = 1
+ * r2 = y
+ * membarrier():
+ * a: smp_mb()
+ * b: send IPI IPI-induced mb
+ * c: smp_mb()
+ * r1 = x
+ * BUG_ON(r1 == 0 && r2 == 1)
+ *
+ * The writes to x and y are unordered by the hardware, so it's possible to
+ * have "r2 = 1" even though the write to x doesn't execute until (b). If
+ * the memory barrier at (c) is omitted then "r1 = x" can be reordered
+ * before (b) (although not before (a)), so we get "r1 = 0". This violates
+ * the guarantee that membarrier() is supposed to provide.
+ *
+ * The timing of the memory barrier at (c) has to ensure that it executes
+ * after the IPI-induced memory barrier on CPU1.
+ *
+ * C) Scheduling userspace thread -> kthread -> userspace thread vs membarrier
+ *
+ * CPU0 CPU1
+ *
+ * membarrier():
+ * a: smp_mb()
+ * d: switch to kthread (includes mb)
+ * b: read rq->curr->mm == NULL
+ * e: switch to user (includes mb)
+ * c: smp_mb()
+ *
+ * Using the scenario from (A), we can show that (a) needs to be paired
+ * with (e). Using the scenario from (B), we can show that (c) needs to
+ * be paired with (d).
+ *
+ * D) exit_mm vs membarrier
+ *
+ * Two thread groups are created, A and B. Thread group B is created by
+ * issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
+ * Let's assume we have a single thread within each thread group (Thread A
+ * and Thread B). Thread A runs on CPU0, Thread B runs on CPU1.
+ *
+ * CPU0 CPU1
+ *
+ * membarrier():
+ * a: smp_mb()
+ * exit_mm():
+ * d: smp_mb()
+ * e: current->mm = NULL
+ * b: read rq->curr->mm == NULL
+ * c: smp_mb()
+ *
+ * Using scenario (B), we can show that (c) needs to be paired with (d).
+ *
+ * E) kthread_{use,unuse}_mm vs membarrier
+ *
+ * CPU0 CPU1
+ *
+ * membarrier():
+ * a: smp_mb()
+ * kthread_unuse_mm()
+ * d: smp_mb()
+ * e: current->mm = NULL
+ * b: read rq->curr->mm == NULL
+ * kthread_use_mm()
+ * f: current->mm = mm
+ * g: smp_mb()
+ * c: smp_mb()
+ *
+ * Using the scenario from (A), we can show that (a) needs to be paired
+ * with (g). Using the scenario from (B), we can show that (c) needs to
+ * be paired with (d).
+ */
+
+/*
* Bitmask made from a "or" of all commands within enum membarrier_cmd,
* except MEMBARRIER_CMD_QUERY.
*/
--
2.11.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2)
2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
@ 2020-08-16 15:23 ` Boqun Feng
2020-09-24 15:01 ` Mathieu Desnoyers
0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2020-08-16 15:23 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Peter Zijlstra, linux-kernel, Will Deacon, Paul E . McKenney,
Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin,
Thomas Gleixner, Linus Torvalds, linux-mm
Hi Mathieu,
On Fri, Aug 14, 2020 at 12:43:56PM -0400, Mathieu Desnoyers wrote:
> exit_mm should issue memory barriers after user-space memory accesses,
> before clearing current->mm, to order user-space memory accesses
> performed prior to exit_mm before clearing tsk->mm, which has the
> effect of skipping the membarrier private expedited IPIs.
>
> The membarrier system call can be issued concurrently with do_exit
> if we have thread groups created with CLONE_VM but not CLONE_THREAD.
>
> Here is the scenario I have in mind:
>
> Two thread groups are created, A and B. Thread group B is created by
> issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
> Let's assume we have a single thread within each thread group (Thread A
> and Thread B).
>
> The AFAIU we can have:
>
> Userspace variables:
>
> int x = 0, y = 0;
>
> CPU 0 CPU 1
> Thread A Thread B
> (in thread group A) (in thread group B)
>
> x = 1
> barrier()
> y = 1
> exit()
> exit_mm()
> current->mm = NULL;
> r1 = load y
> membarrier()
> skips CPU 0 (no IPI) because its current mm is NULL
> r2 = load x
> BUG_ON(r1 == 1 && r2 == 0)
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: linux-mm@kvack.org
> ---
> Changes since v1:
> - Use smp_mb__after_spinlock rather than smp_mb.
> - Document race scenario in commit message.
> ---
> kernel/exit.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 733e80f334e7..fe64e6e28dd5 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -475,6 +475,14 @@ static void exit_mm(void)
> BUG_ON(mm != current->active_mm);
> /* more a memory barrier than a real lock */
> task_lock(current);
> + /*
> + * When a thread stops operating on an address space, the loop
> + * in membarrier_{private,global}_expedited() may not observe
Is it accurate to say that the correctness of
membarrier_global_expedited() relies on the observation of ->mm? Because
IIUC membarrier_global_expedited() loop doesn't check ->mm.
Regards,
Boqun
> + * that tsk->mm, and not issue an IPI. Membarrier requires a
> + * memory barrier after accessing user-space memory, before
> + * clearing tsk->mm.
> + */
> + smp_mb__after_spinlock();
> current->mm = NULL;
> mmap_read_unlock(mm);
> enter_lazy_tlb(mm, current);
> --
> 2.11.0
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
@ 2020-08-16 15:29 ` Boqun Feng
2020-08-24 15:27 ` Mathieu Desnoyers
0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2020-08-16 15:29 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Peter Zijlstra, linux-kernel, Will Deacon, Paul E . McKenney,
Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin
On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
> to allow the effect of membarrier(2) to apply to kthreads accessing
> user-space memory as well.
>
> Given that no prior kthread use this guarantee and that it only affects
> kthreads, adding this guarantee does not affect user-space ABI.
>
> Refine the check in membarrier_global_expedited to exclude runqueues
> running the idle thread rather than all kthreads from the IPI cpumask.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> Changes since v1:
> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
> ---
> kernel/kthread.c | 19 +++++++++++++++++++
> kernel/sched/idle.c | 1 +
> kernel/sched/membarrier.c | 8 ++------
> 3 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3edaa380dc7b..77aaaa7bc8d9 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
> finish_arch_post_lock_switch();
> #endif
>
> + /*
> + * When a kthread starts operating on an address space, the loop
> + * in membarrier_{private,global}_expedited() may not observe
> + * that tsk->mm, and not issue an IPI. Membarrier requires a
> + * memory barrier after storing to tsk->mm, before accessing
> + * user-space memory. A full memory barrier for membarrier
> + * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
> + * mmdrop(), or explicitly with smp_mb().
> + */
> if (active_mm != mm)
> mmdrop(active_mm);
> + else
> + smp_mb();
Similar question here: could smp_mb() guarantee the correctness of
GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
kthread_unuse_mm(), too?
Am I miss something here?
Regards,
Boqun
>
> to_kthread(tsk)->oldfs = force_uaccess_begin();
> }
> @@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
> force_uaccess_end(to_kthread(tsk)->oldfs);
>
> task_lock(tsk);
> + /*
> + * When a kthread stops operating on an address space, the loop
> + * in membarrier_{private,global}_expedited() may not observe
> + * that tsk->mm, and not issue an IPI. Membarrier requires a
> + * memory barrier after accessing user-space memory, before
> + * clearing tsk->mm.
> + */
> + smp_mb__after_spinlock();
> sync_mm_rss(mm);
> local_irq_disable();
> tsk->mm = NULL;
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 6bf34986f45c..3443ee8335d0 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
> WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
> WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
> WARN_ON_ONCE(!duration_ns);
> + WARN_ON_ONCE(current->mm);
>
> rcu_sleep_check();
> preempt_disable();
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 168479a7d61b..8a294483074d 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
> MEMBARRIER_STATE_GLOBAL_EXPEDITED))
> continue;
>
> - /*
> - * Skip the CPU if it runs a kernel thread. The scheduler
> - * leaves the prior task mm in place as an optimization when
> - * scheduling a kthread.
> - */
> + /* Skip the CPU if it runs the idle thread. */
> p = rcu_dereference(cpu_rq(cpu)->curr);
> - if (p->flags & PF_KTHREAD)
> + if (is_idle_task(p))
> continue;
>
> __cpumask_set_cpu(cpu, tmpmask);
> --
> 2.11.0
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
[not found] ` <20200816070932.10752-1-hdanton@sina.com>
@ 2020-08-16 21:05 ` Mathieu Desnoyers
0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-16 21:05 UTC (permalink / raw)
To: Hillf Danton
Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin
----- On Aug 16, 2020, at 3:09 AM, Hillf Danton hdanton@sina.com wrote:
> On Fri, 14 Aug 2020 12:43:57 -0400 Mathieu Desnoyers wrote:
>>
>> Given that no prior kthread use this guarantee and that it only affects
>> kthreads, adding this guarantee does not affect user-space ABI.
>
> Can you expand a bit on why kthreads like ksoftirqd have to ack the
> IPIs from Dave who's not CAP_SYS_ADMIN.
Do ksoftirqd kthreads ever use kthread_use_mm() to access user
processes' memory ? If not, then they won't be disturbed by any IPI.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
2020-08-16 15:29 ` Boqun Feng
@ 2020-08-24 15:27 ` Mathieu Desnoyers
2020-08-25 2:06 ` Boqun Feng
0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-08-24 15:27 UTC (permalink / raw)
To: Boqun Feng
Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin
----- On Aug 16, 2020, at 11:29 AM, Boqun Feng boqun.feng@gmail.com wrote:
> On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
>> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
>> to allow the effect of membarrier(2) to apply to kthreads accessing
>> user-space memory as well.
>>
>> Given that no prior kthread use this guarantee and that it only affects
>> kthreads, adding this guarantee does not affect user-space ABI.
>>
>> Refine the check in membarrier_global_expedited to exclude runqueues
>> running the idle thread rather than all kthreads from the IPI cpumask.
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> ---
>> Changes since v1:
>> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
>> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
>> ---
>> kernel/kthread.c | 19 +++++++++++++++++++
>> kernel/sched/idle.c | 1 +
>> kernel/sched/membarrier.c | 8 ++------
>> 3 files changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> index 3edaa380dc7b..77aaaa7bc8d9 100644
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
>> finish_arch_post_lock_switch();
>> #endif
>>
>> + /*
>> + * When a kthread starts operating on an address space, the loop
>> + * in membarrier_{private,global}_expedited() may not observe
>> + * that tsk->mm, and not issue an IPI. Membarrier requires a
>> + * memory barrier after storing to tsk->mm, before accessing
>> + * user-space memory. A full memory barrier for membarrier
>> + * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
>> + * mmdrop(), or explicitly with smp_mb().
>> + */
>> if (active_mm != mm)
>> mmdrop(active_mm);
>> + else
>> + smp_mb();
>
> Similar question here: could smp_mb() guarantee the correctness of
> GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
> kthread_unuse_mm(), too?
>
> Am I miss something here?
I think you have a good point there. Which brings me to wonder why we
don't have membarrier_switch_mm() when entering/leaving lazy TLB mode.
This means an IPI can be sent to a kthread even if it does not use an
mm, just because the membarrier state in the runqueue is that of the
active_mm.
Thoughts ?
Thanks,
Mathieu
>
> Regards,
> Boqun
>
>>
>> to_kthread(tsk)->oldfs = force_uaccess_begin();
>> }
>> @@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
>> force_uaccess_end(to_kthread(tsk)->oldfs);
>>
>> task_lock(tsk);
>> + /*
>> + * When a kthread stops operating on an address space, the loop
>> + * in membarrier_{private,global}_expedited() may not observe
>> + * that tsk->mm, and not issue an IPI. Membarrier requires a
>> + * memory barrier after accessing user-space memory, before
>> + * clearing tsk->mm.
>> + */
>> + smp_mb__after_spinlock();
>> sync_mm_rss(mm);
>> local_irq_disable();
>> tsk->mm = NULL;
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 6bf34986f45c..3443ee8335d0 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
>> WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
>> WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
>> WARN_ON_ONCE(!duration_ns);
>> + WARN_ON_ONCE(current->mm);
>>
>> rcu_sleep_check();
>> preempt_disable();
>> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
>> index 168479a7d61b..8a294483074d 100644
>> --- a/kernel/sched/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
>> MEMBARRIER_STATE_GLOBAL_EXPEDITED))
>> continue;
>>
>> - /*
>> - * Skip the CPU if it runs a kernel thread. The scheduler
>> - * leaves the prior task mm in place as an optimization when
>> - * scheduling a kthread.
>> - */
>> + /* Skip the CPU if it runs the idle thread. */
>> p = rcu_dereference(cpu_rq(cpu)->curr);
>> - if (p->flags & PF_KTHREAD)
>> + if (is_idle_task(p))
>> continue;
>>
>> __cpumask_set_cpu(cpu, tmpmask);
>> --
>> 2.11.0
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
2020-08-24 15:27 ` Mathieu Desnoyers
@ 2020-08-25 2:06 ` Boqun Feng
2020-09-24 15:26 ` Mathieu Desnoyers
0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2020-08-25 2:06 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin
On Mon, Aug 24, 2020 at 11:27:49AM -0400, Mathieu Desnoyers wrote:
> ----- On Aug 16, 2020, at 11:29 AM, Boqun Feng boqun.feng@gmail.com wrote:
>
> > On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
> >> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
> >> to allow the effect of membarrier(2) to apply to kthreads accessing
> >> user-space memory as well.
> >>
> >> Given that no prior kthread use this guarantee and that it only affects
> >> kthreads, adding this guarantee does not affect user-space ABI.
> >>
> >> Refine the check in membarrier_global_expedited to exclude runqueues
> >> running the idle thread rather than all kthreads from the IPI cpumask.
> >>
> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Cc: Will Deacon <will@kernel.org>
> >> Cc: Paul E. McKenney <paulmck@kernel.org>
> >> Cc: Nicholas Piggin <npiggin@gmail.com>
> >> Cc: Andy Lutomirski <luto@amacapital.net>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> ---
> >> Changes since v1:
> >> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
> >> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
> >> ---
> >> kernel/kthread.c | 19 +++++++++++++++++++
> >> kernel/sched/idle.c | 1 +
> >> kernel/sched/membarrier.c | 8 ++------
> >> 3 files changed, 22 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/kernel/kthread.c b/kernel/kthread.c
> >> index 3edaa380dc7b..77aaaa7bc8d9 100644
> >> --- a/kernel/kthread.c
> >> +++ b/kernel/kthread.c
> >> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
> >> finish_arch_post_lock_switch();
> >> #endif
> >>
> >> + /*
> >> + * When a kthread starts operating on an address space, the loop
> >> + * in membarrier_{private,global}_expedited() may not observe
> >> + * that tsk->mm, and not issue an IPI. Membarrier requires a
> >> + * memory barrier after storing to tsk->mm, before accessing
> >> + * user-space memory. A full memory barrier for membarrier
> >> + * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
> >> + * mmdrop(), or explicitly with smp_mb().
> >> + */
> >> if (active_mm != mm)
> >> mmdrop(active_mm);
> >> + else
> >> + smp_mb();
> >
> > Similar question here: could smp_mb() guarantee the correctness of
> > GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
> > kthread_unuse_mm(), too?
> >
> > Am I miss something here?
>
> I think you have a good point there. Which brings me to wonder why we
> don't have membarrier_switch_mm() when entering/leaving lazy TLB mode.
> This means an IPI can be sent to a kthread even if it does not use an
> mm, just because the membarrier state in the runqueue is that of the
> active_mm.
>
> Thoughts ?
>
Right, I think we should also handle the percpu membarrier_state. The
basic rule is whenever we change current->mm or current (i.e. rq->curr)
itself, we need to update the percpu membarrier_state accordingly.
Regards,
Boqun
> Thanks,
>
> Mathieu
>
> >
> > Regards,
> > Boqun
> >
> >>
> >> to_kthread(tsk)->oldfs = force_uaccess_begin();
> >> }
> >> @@ -1276,6 +1287,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
> >> force_uaccess_end(to_kthread(tsk)->oldfs);
> >>
> >> task_lock(tsk);
> >> + /*
> >> + * When a kthread stops operating on an address space, the loop
> >> + * in membarrier_{private,global}_expedited() may not observe
> >> + * that tsk->mm, and not issue an IPI. Membarrier requires a
> >> + * memory barrier after accessing user-space memory, before
> >> + * clearing tsk->mm.
> >> + */
> >> + smp_mb__after_spinlock();
> >> sync_mm_rss(mm);
> >> local_irq_disable();
> >> tsk->mm = NULL;
> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> index 6bf34986f45c..3443ee8335d0 100644
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -341,6 +341,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
> >> WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
> >> WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
> >> WARN_ON_ONCE(!duration_ns);
> >> + WARN_ON_ONCE(current->mm);
> >>
> >> rcu_sleep_check();
> >> preempt_disable();
> >> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> >> index 168479a7d61b..8a294483074d 100644
> >> --- a/kernel/sched/membarrier.c
> >> +++ b/kernel/sched/membarrier.c
> >> @@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
> >> MEMBARRIER_STATE_GLOBAL_EXPEDITED))
> >> continue;
> >>
> >> - /*
> >> - * Skip the CPU if it runs a kernel thread. The scheduler
> >> - * leaves the prior task mm in place as an optimization when
> >> - * scheduling a kthread.
> >> - */
> >> + /* Skip the CPU if it runs the idle thread. */
> >> p = rcu_dereference(cpu_rq(cpu)->curr);
> >> - if (p->flags & PF_KTHREAD)
> >> + if (is_idle_task(p))
> >> continue;
> >>
> >> __cpumask_set_cpu(cpu, tmpmask);
> >> --
> >> 2.11.0
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2)
2020-08-16 15:23 ` Boqun Feng
@ 2020-09-24 15:01 ` Mathieu Desnoyers
0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-09-24 15:01 UTC (permalink / raw)
To: Boqun Feng
Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin,
Thomas Gleixner, Linus Torvalds, linux-mm
----- On Aug 16, 2020, at 11:23 AM, Boqun Feng boqun.feng@gmail.com wrote:
> Hi Mathieu,
>
> On Fri, Aug 14, 2020 at 12:43:56PM -0400, Mathieu Desnoyers wrote:
>> exit_mm should issue memory barriers after user-space memory accesses,
>> before clearing current->mm, to order user-space memory accesses
>> performed prior to exit_mm before clearing tsk->mm, which has the
>> effect of skipping the membarrier private expedited IPIs.
>>
>> The membarrier system call can be issued concurrently with do_exit
>> if we have thread groups created with CLONE_VM but not CLONE_THREAD.
>>
>> Here is the scenario I have in mind:
>>
>> Two thread groups are created, A and B. Thread group B is created by
>> issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
>> Let's assume we have a single thread within each thread group (Thread A
>> and Thread B).
>>
>> The AFAIU we can have:
>>
>> Userspace variables:
>>
>> int x = 0, y = 0;
>>
>> CPU 0 CPU 1
>> Thread A Thread B
>> (in thread group A) (in thread group B)
>>
>> x = 1
>> barrier()
>> y = 1
>> exit()
>> exit_mm()
>> current->mm = NULL;
>> r1 = load y
>> membarrier()
>> skips CPU 0 (no IPI) because its current mm is NULL
>> r2 = load x
>> BUG_ON(r1 == 1 && r2 == 0)
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Alan Stern <stern@rowland.harvard.edu>
>> Cc: linux-mm@kvack.org
>> ---
>> Changes since v1:
>> - Use smp_mb__after_spinlock rather than smp_mb.
>> - Document race scenario in commit message.
>> ---
>> kernel/exit.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 733e80f334e7..fe64e6e28dd5 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -475,6 +475,14 @@ static void exit_mm(void)
>> BUG_ON(mm != current->active_mm);
>> /* more a memory barrier than a real lock */
>> task_lock(current);
>> + /*
>> + * When a thread stops operating on an address space, the loop
>> + * in membarrier_{private,global}_expedited() may not observe
>
> Is it accurate to say that the correctness of
> membarrier_global_expedited() relies on the observation of ->mm? Because
> IIUC membarrier_global_expedited() loop doesn't check ->mm.
Good point, I was wrong. Will instead reword as:
/*
* When a thread stops operating on an address space, the loop
* in membarrier_private_expedited() may not observe that
* tsk->mm, and the loop in membarrier_global_expedited() may
* not observe a MEMBARRIER_STATE_GLOBAL_EXPEDITED
* rq->membarrier_state, so those would not issue an IPI.
* Membarrier requires a memory barrier after accessing
* user-space memory, before clearing tsk->mm or the
* rq->membarrier_state.
*/
And I'll make sure exit_mm clears this_rq()->membarrier_state as well.
Thanks,
Mathieu
>
> Regards,
> Boqun
>
>> + * that tsk->mm, and not issue an IPI. Membarrier requires a
>> + * memory barrier after accessing user-space memory, before
>> + * clearing tsk->mm.
>> + */
>> + smp_mb__after_spinlock();
>> current->mm = NULL;
>> mmap_read_unlock(mm);
>> enter_lazy_tlb(mm, current);
>> --
>> 2.11.0
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2)
2020-08-25 2:06 ` Boqun Feng
@ 2020-09-24 15:26 ` Mathieu Desnoyers
0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2020-09-24 15:26 UTC (permalink / raw)
To: Boqun Feng
Cc: Peter Zijlstra, linux-kernel, Will Deacon, paulmck,
Andy Lutomirski, Andrew Morton, Alan Stern, Nicholas Piggin
----- On Aug 24, 2020, at 10:06 PM, Boqun Feng boqun.feng@gmail.com wrote:
> On Mon, Aug 24, 2020 at 11:27:49AM -0400, Mathieu Desnoyers wrote:
>> ----- On Aug 16, 2020, at 11:29 AM, Boqun Feng boqun.feng@gmail.com wrote:
>>
>> > On Fri, Aug 14, 2020 at 12:43:57PM -0400, Mathieu Desnoyers wrote:
>> >> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
>> >> to allow the effect of membarrier(2) to apply to kthreads accessing
>> >> user-space memory as well.
>> >>
>> >> Given that no prior kthread use this guarantee and that it only affects
>> >> kthreads, adding this guarantee does not affect user-space ABI.
>> >>
>> >> Refine the check in membarrier_global_expedited to exclude runqueues
>> >> running the idle thread rather than all kthreads from the IPI cpumask.
>> >>
>> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> >> Cc: Will Deacon <will@kernel.org>
>> >> Cc: Paul E. McKenney <paulmck@kernel.org>
>> >> Cc: Nicholas Piggin <npiggin@gmail.com>
>> >> Cc: Andy Lutomirski <luto@amacapital.net>
>> >> Cc: Andrew Morton <akpm@linux-foundation.org>
>> >> ---
>> >> Changes since v1:
>> >> - Add WARN_ON_ONCE(current->mm) in play_idle_precise (PeterZ),
>> >> - Use smp_mb__after_spinlock rather than smp_mb after task_lock.
>> >> ---
>> >> kernel/kthread.c | 19 +++++++++++++++++++
>> >> kernel/sched/idle.c | 1 +
>> >> kernel/sched/membarrier.c | 8 ++------
>> >> 3 files changed, 22 insertions(+), 6 deletions(-)
>> >>
>> >> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> >> index 3edaa380dc7b..77aaaa7bc8d9 100644
>> >> --- a/kernel/kthread.c
>> >> +++ b/kernel/kthread.c
>> >> @@ -1255,8 +1255,19 @@ void kthread_use_mm(struct mm_struct *mm)
>> >> finish_arch_post_lock_switch();
>> >> #endif
>> >>
>> >> + /*
>> >> + * When a kthread starts operating on an address space, the loop
>> >> + * in membarrier_{private,global}_expedited() may not observe
>> >> + * that tsk->mm, and not issue an IPI. Membarrier requires a
>> >> + * memory barrier after storing to tsk->mm, before accessing
>> >> + * user-space memory. A full memory barrier for membarrier
>> >> + * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
>> >> + * mmdrop(), or explicitly with smp_mb().
>> >> + */
>> >> if (active_mm != mm)
>> >> mmdrop(active_mm);
>> >> + else
>> >> + smp_mb();
>> >
>> > Similar question here: could smp_mb() guarantee the correctness of
>> > GLOBAL_EXPEDITED? Don't you need membarrier_switch_mm() here and in
>> > kthread_unuse_mm(), too?
>> >
>> > Am I miss something here?
>>
>> I think you have a good point there. Which brings me to wonder why we
>> don't have membarrier_switch_mm() when entering/leaving lazy TLB mode.
>> This means an IPI can be sent to a kthread even if it does not use an
>> mm, just because the membarrier state in the runqueue is that of the
>> active_mm.
>>
>> Thoughts ?
>>
>
> Right, I think we should also handle the percpu membarrier_state. The
> basic rule is whenever we change current->mm or current (i.e. rq->curr)
> itself, we need to update the percpu membarrier_state accordingly.
OK, so as we introduce IPIs to kthreads which are using kthread_use_mm, we need to
reconsider how the scheduler deals with runqueues entering lazy TLB state. Currently,
membarrier_switch_mm() is not issued when entering lazy TLB state. But as we
start considering kthreads as candidates for sending IPIs, we need to update
the rq->membarrier_state whenever we enter lazy TLB state as well.
So I plan to do this change to cover this:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84758f34cdb0..44521dc5602a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3736,6 +3736,8 @@ context_switch(struct rq *rq, struct task_struct *prev,
*/
arch_start_context_switch(prev);
+ membarrier_switch_mm(rq, prev->mm, next->mm);
+
/*
* kernel -> kernel lazy + transfer active
* user -> kernel lazy + mmgrab() active
@@ -3752,7 +3754,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
else
prev->active_mm = NULL;
} else { // to user
- membarrier_switch_mm(rq, prev->active_mm, next->mm);
/*
* sys_membarrier() requires an smp_mb() between setting
* rq->curr / membarrier_switch_mm() and returning to userspace.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3fd283892761..481149066086 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2592,12 +2592,13 @@ static inline void membarrier_switch_mm(struct rq *rq,
struct mm_struct *prev_mm,
struct mm_struct *next_mm)
{
- int membarrier_state;
+ int membarrier_state = 0;
if (prev_mm == next_mm)
return;
- membarrier_state = atomic_read(&next_mm->membarrier_state);
+ if (next_mm)
+ membarrier_state = atomic_read(&next_mm->membarrier_state);
if (READ_ONCE(rq->membarrier_state) == membarrier_state)
return;
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-09-24 15:26 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-14 16:43 [RFC PATCH 0/3] sched: membarrier updates Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 1/3] sched: fix exit_mm vs membarrier (v2) Mathieu Desnoyers
2020-08-16 15:23 ` Boqun Feng
2020-09-24 15:01 ` Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
2020-08-16 15:29 ` Boqun Feng
2020-08-24 15:27 ` Mathieu Desnoyers
2020-08-25 2:06 ` Boqun Feng
2020-09-24 15:26 ` Mathieu Desnoyers
2020-08-14 16:43 ` [RFC PATCH 3/3] sched: membarrier: document memory ordering scenarios Mathieu Desnoyers
[not found] ` <20200816070932.10752-1-hdanton@sina.com>
2020-08-16 21:05 ` [RFC PATCH 2/3] sched: membarrier: cover kthread_use_mm (v2) Mathieu Desnoyers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).