linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Membarrier fixes/cleanups
@ 2019-09-06  3:12 Mathieu Desnoyers
  2019-09-06  3:12 ` [RFC PATCH 1/4] Fix: sched/membarrier: private expedited registration check Mathieu Desnoyers
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-06  3:12 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman,
	Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Mathieu Desnoyers

This series of fixes/cleanups is submitted for feedback. It takes care
of membarrier issues recently discussed.

Thanks,

Mathieu

Mathieu Desnoyers (4):
  Fix: sched/membarrier: private expedited registration check
  Cleanup: sched/membarrier: remove redundant check
  Cleanup: sched/membarrier: only sync_core before usermode for same mm
  Fix: sched/membarrier: p->mm->membarrier_state racy load

 include/linux/mm_types.h  |   7 +-
 include/linux/sched/mm.h  |   8 +--
 kernel/sched/core.c       |   1 +
 kernel/sched/membarrier.c | 141 +++++++++++++++++++++++++++-----------
 kernel/sched/sched.h      |  33 +++++++++
 5 files changed, 143 insertions(+), 47 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 1/4] Fix: sched/membarrier: private expedited registration check
  2019-09-06  3:12 [RFC PATCH 0/4] Membarrier fixes/cleanups Mathieu Desnoyers
@ 2019-09-06  3:12 ` Mathieu Desnoyers
  2019-09-06  3:12 ` [RFC PATCH 2/4] Cleanup: sched/membarrier: remove redundant check Mathieu Desnoyers
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-06  3:12 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman,
	Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Mathieu Desnoyers

Fix a logic flaw in the way membarrier_register_private_expedited()
handles ready state checks for private expedited sync core and private
expedited registrations.

If a private expedited membarrier registration is first performed, and
then a private expedited sync_core registration is performed, the ready
state check will skip the second registration when it really should not.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/membarrier.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index aa8d75804108..5110d91b1b0e 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -226,7 +226,7 @@ static int membarrier_register_private_expedited(int flags)
 	 * groups, which use the same mm. (CLONE_VM but not
 	 * CLONE_THREAD).
 	 */
-	if (atomic_read(&mm->membarrier_state) & state)
+	if ((atomic_read(&mm->membarrier_state) & state) == state)
 		return 0;
 	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state);
 	if (flags & MEMBARRIER_FLAG_SYNC_CORE)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 2/4] Cleanup: sched/membarrier: remove redundant check
  2019-09-06  3:12 [RFC PATCH 0/4] Membarrier fixes/cleanups Mathieu Desnoyers
  2019-09-06  3:12 ` [RFC PATCH 1/4] Fix: sched/membarrier: private expedited registration check Mathieu Desnoyers
@ 2019-09-06  3:12 ` Mathieu Desnoyers
  2019-09-06  3:12 ` [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm Mathieu Desnoyers
  2019-09-06  3:13 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load Mathieu Desnoyers
  3 siblings, 0 replies; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-06  3:12 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman,
	Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Mathieu Desnoyers

Checking that the number of threads is 1 is redundant with checking
mm_users == 1.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/membarrier.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 5110d91b1b0e..7e0a0d6535f3 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -186,7 +186,7 @@ static int membarrier_register_global_expedited(void)
 	    MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY)
 		return 0;
 	atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED, &mm->membarrier_state);
-	if (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1) {
+	if (atomic_read(&mm->mm_users) == 1) {
 		/*
 		 * For single mm user, single threaded process, we can
 		 * simply issue a memory barrier after setting
@@ -232,7 +232,7 @@ static int membarrier_register_private_expedited(int flags)
 	if (flags & MEMBARRIER_FLAG_SYNC_CORE)
 		atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE,
 			  &mm->membarrier_state);
-	if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
+	if (atomic_read(&mm->mm_users) != 1) {
 		/*
 		 * Ensure all future scheduler executions will observe the
 		 * new thread flag state for this process.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm
  2019-09-06  3:12 [RFC PATCH 0/4] Membarrier fixes/cleanups Mathieu Desnoyers
  2019-09-06  3:12 ` [RFC PATCH 1/4] Fix: sched/membarrier: private expedited registration check Mathieu Desnoyers
  2019-09-06  3:12 ` [RFC PATCH 2/4] Cleanup: sched/membarrier: remove redundant check Mathieu Desnoyers
@ 2019-09-06  3:12 ` Mathieu Desnoyers
  2019-09-06  7:41   ` Peter Zijlstra
  2019-09-06  3:13 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load Mathieu Desnoyers
  3 siblings, 1 reply; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-06  3:12 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman,
	Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Mathieu Desnoyers

When the prev and next task's mm change, switch_mm() provides the core
serializing guarantees before returning to usermode. The only case
where an explicit core serialization is needed is when the scheduler
keeps the same mm for prev and next.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched/mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 4a7944078cc3..8557ec664213 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -362,6 +362,8 @@ enum {
 
 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 {
+	if (current->mm != mm)
+		return;
 	if (likely(!(atomic_read(&mm->membarrier_state) &
 		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
 		return;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load
  2019-09-06  3:12 [RFC PATCH 0/4] Membarrier fixes/cleanups Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2019-09-06  3:12 ` [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm Mathieu Desnoyers
@ 2019-09-06  3:13 ` Mathieu Desnoyers
  2019-09-06  8:23   ` Peter Zijlstra
  3 siblings, 1 reply; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-06  3:13 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman,
	Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Mathieu Desnoyers

The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
in the scheduler prepare task switch when the scheduler switches between
mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct so sched
switch prepare hits a cache line which is already touched by the
scheduler switch_mm.

membarrier_execve() hook now needs to clear the runqueue's membarrier
state in addition to clear the mm membarrier state, so move its
implementation into the scheduler membarrier code so it can access the
runqueue structure.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops. In fact, it turns out that
membarrier_global_expedited() does not need the RCU read-side critical
section because it does not need to use task_rcu_dereference() anymore,
relying instead on the runqueue's membarrier_state.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm_types.h  |   7 +-
 include/linux/sched/mm.h  |   6 +-
 kernel/sched/core.c       |   1 +
 kernel/sched/membarrier.c | 141 +++++++++++++++++++++++++++-----------
 kernel/sched/sched.h      |  33 +++++++++
 5 files changed, 141 insertions(+), 47 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a7a1083b6fb..7020572eb605 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -382,6 +382,9 @@ struct mm_struct {
 		unsigned long task_size;	/* size of task vm space */
 		unsigned long highest_vm_end;	/* highest vma end address */
 		pgd_t * pgd;
+#ifdef CONFIG_MEMBARRIER
+		atomic_t membarrier_state;
+#endif
 
 		/**
 		 * @mm_users: The number of users including userspace.
@@ -452,9 +455,7 @@ struct mm_struct {
 		unsigned long flags; /* Must use atomic bitops to access */
 
 		struct core_state *core_state; /* coredumping support */
-#ifdef CONFIG_MEMBARRIER
-		atomic_t membarrier_state;
-#endif
+
 #ifdef CONFIG_AIO
 		spinlock_t			ioctx_lock;
 		struct kioctx_table __rcu	*ioctx_table;
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 8557ec664213..7070dbef8066 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -370,10 +370,8 @@ static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 	sync_core_before_usermode();
 }
 
-static inline void membarrier_execve(struct task_struct *t)
-{
-	atomic_set(&t->mm->membarrier_state, 0);
-}
+extern void membarrier_execve(struct task_struct *t);
+
 #else
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
 static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 010d578118d6..1cffc1aa403c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3038,6 +3038,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	perf_event_task_sched_out(prev, next);
 	rseq_preempt(prev);
 	fire_sched_out_preempt_notifiers(prev, next);
+	membarrier_prepare_task_switch(rq, prev, next);
 	prepare_task(next);
 	prepare_arch_switch(next);
 }
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 7e0a0d6535f3..5744c300d29e 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -30,6 +30,28 @@ static void ipi_mb(void *info)
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 }
 
+static void ipi_sync_rq_state(void *info)
+{
+	struct mm_struct *mm = (struct mm_struct *) info;
+
+	if (current->mm != mm)
+		return;
+	WRITE_ONCE(this_rq()->membarrier_state,
+		   atomic_read(&mm->membarrier_state));
+	/*
+	 * Issue a memory barrier after setting MEMBARRIER_STATE_GLOBAL_EXPEDITED
+	 * in the current runqueue to guarantee that no memory access following
+	 * registration is reordered before registration.
+	 */
+	smp_mb();
+}
+
+void membarrier_execve(struct task_struct *t)
+{
+	atomic_set(&t->mm->membarrier_state, 0);
+	WRITE_ONCE(this_rq()->membarrier_state, 0);
+}
+
 static int membarrier_global_expedited(void)
 {
 	int cpu;
@@ -57,8 +79,6 @@ static int membarrier_global_expedited(void)
 
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
-		struct task_struct *p;
-
 		/*
 		 * Skipping the current CPU is OK even through we can be
 		 * migrated at any point. The current CPU, at the point
@@ -70,16 +90,13 @@ static int membarrier_global_expedited(void)
 		if (cpu == raw_smp_processor_id())
 			continue;
 
-		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
-		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
-				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
+		if (READ_ONCE(cpu_rq(cpu)->membarrier_state) &
+		    MEMBARRIER_STATE_GLOBAL_EXPEDITED) {
 			if (!fallback)
 				__cpumask_set_cpu(cpu, tmpmask);
 			else
 				smp_call_function_single(cpu, ipi_mb, NULL, 1);
 		}
-		rcu_read_unlock();
 	}
 	if (!fallback) {
 		preempt_disable();
@@ -136,6 +153,7 @@ static int membarrier_private_expedited(int flags)
 	}
 
 	cpus_read_lock();
+	rcu_read_lock();
 	for_each_online_cpu(cpu) {
 		struct task_struct *p;
 
@@ -149,7 +167,6 @@ static int membarrier_private_expedited(int flags)
 		 */
 		if (cpu == raw_smp_processor_id())
 			continue;
-		rcu_read_lock();
 		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
 		if (p && p->mm == current->mm) {
 			if (!fallback)
@@ -157,8 +174,8 @@ static int membarrier_private_expedited(int flags)
 			else
 				smp_call_function_single(cpu, ipi_mb, NULL, 1);
 		}
-		rcu_read_unlock();
 	}
+	rcu_read_unlock();
 	if (!fallback) {
 		preempt_disable();
 		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
@@ -177,6 +194,71 @@ static int membarrier_private_expedited(int flags)
 	return 0;
 }
 
+static void sync_runqueues_membarrier_state(struct mm_struct *mm)
+{
+	int membarrier_state = atomic_read(&mm->membarrier_state);
+	bool fallback = false;
+	cpumask_var_t tmpmask;
+	int cpu;
+
+	if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) {
+		WRITE_ONCE(this_rq()->membarrier_state, membarrier_state);
+
+		/*
+		 * For single mm user, we can simply issue a memory barrier
+		 * after setting MEMBARRIER_STATE_GLOBAL_EXPEDITED in the
+		 * mm and in the current runqueue to guarantee that no memory
+		 * access following registration is reordered before
+		 * registration.
+		 */
+		smp_mb();
+		return;
+	}
+
+	/*
+	 * For mm with multiple users, we need to ensure all future
+	 * scheduler executions will observe @mm's new membarrier
+	 * state.
+	 */
+	synchronize_rcu();
+
+	if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
+		/* Fallback for OOM. */
+		fallback = true;
+	}
+
+	/*
+	 * For each cpu runqueue, if the task's mm match @mm, ensure that all
+	 * @mm's membarrier state set bits are also set in in the runqueue's
+	 * membarrier state. This ensures that a runqueue scheduling
+	 * between threads which are users of @mm has its membarrier state
+	 * updated.
+	 */
+	cpus_read_lock();
+	rcu_read_lock();
+	for_each_online_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
+		struct task_struct *p;
+
+		p = task_rcu_dereference(&rq->curr);
+		if (p && p->mm == mm) {
+			if (!fallback)
+				__cpumask_set_cpu(cpu, tmpmask);
+			else
+				smp_call_function_single(cpu, ipi_sync_rq_state,
+							 mm, 1);
+		}
+	}
+	rcu_read_unlock();
+	if (!fallback) {
+		preempt_disable();
+		smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1);
+		preempt_enable();
+		free_cpumask_var(tmpmask);
+	}
+	cpus_read_unlock();
+}
+
 static int membarrier_register_global_expedited(void)
 {
 	struct task_struct *p = current;
@@ -186,23 +268,7 @@ static int membarrier_register_global_expedited(void)
 	    MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY)
 		return 0;
 	atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED, &mm->membarrier_state);
-	if (atomic_read(&mm->mm_users) == 1) {
-		/*
-		 * For single mm user, single threaded process, we can
-		 * simply issue a memory barrier after setting
-		 * MEMBARRIER_STATE_GLOBAL_EXPEDITED to guarantee that
-		 * no memory access following registration is reordered
-		 * before registration.
-		 */
-		smp_mb();
-	} else {
-		/*
-		 * For multi-mm user threads, we need to ensure all
-		 * future scheduler executions will observe the new
-		 * thread flag state for this mm.
-		 */
-		synchronize_rcu();
-	}
+	sync_runqueues_membarrier_state(mm);
 	atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY,
 		  &mm->membarrier_state);
 
@@ -213,12 +279,14 @@ static int membarrier_register_private_expedited(int flags)
 {
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
-	int state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY;
+	int ready_state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
+	    set_state = MEMBARRIER_STATE_PRIVATE_EXPEDITED;
 
 	if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
 		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
 			return -EINVAL;
-		state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY;
+		ready_state =
+			MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY;
 	}
 
 	/*
@@ -226,20 +294,13 @@ static int membarrier_register_private_expedited(int flags)
 	 * groups, which use the same mm. (CLONE_VM but not
 	 * CLONE_THREAD).
 	 */
-	if ((atomic_read(&mm->membarrier_state) & state) == state)
+	if ((atomic_read(&mm->membarrier_state) & ready_state) == ready_state)
 		return 0;
-	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state);
 	if (flags & MEMBARRIER_FLAG_SYNC_CORE)
-		atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE,
-			  &mm->membarrier_state);
-	if (atomic_read(&mm->mm_users) != 1) {
-		/*
-		 * Ensure all future scheduler executions will observe the
-		 * new thread flag state for this process.
-		 */
-		synchronize_rcu();
-	}
-	atomic_or(state, &mm->membarrier_state);
+		set_state |= MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE;
+	atomic_or(set_state, &mm->membarrier_state);
+	sync_runqueues_membarrier_state(mm);
+	atomic_or(ready_state, &mm->membarrier_state);
 
 	return 0;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..ac26baa855e8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -907,6 +907,10 @@ struct rq {
 
 	atomic_t		nr_iowait;
 
+#ifdef CONFIG_MEMBARRIER
+	int membarrier_state;
+#endif
+
 #ifdef CONFIG_SMP
 	struct root_domain		*rd;
 	struct sched_domain __rcu	*sd;
@@ -2423,3 +2427,32 @@ static inline bool sched_energy_enabled(void)
 static inline bool sched_energy_enabled(void) { return false; }
 
 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+#ifdef CONFIG_MEMBARRIER
+/*
+ * The scheduler provides memory barriers required by membarrier between:
+ * - prior user-space memory accesses and store to rq->membarrier_state,
+ * - store to rq->membarrier_state and following user-space memory accesses.
+ * In the same way it provides those guarantees around store to rq->curr.
+ */
+static inline void membarrier_prepare_task_switch(struct rq *rq,
+						  struct task_struct *prev,
+						  struct task_struct *next)
+{
+	int membarrier_state = 0;
+	struct mm_struct *next_mm = next->mm;
+
+	if (prev->mm == next_mm)
+		return;
+	if (next_mm)
+		membarrier_state = atomic_read(&next_mm->membarrier_state);
+	if (READ_ONCE(rq->membarrier_state) != membarrier_state)
+		WRITE_ONCE(rq->membarrier_state, membarrier_state);
+}
+#else
+static inline void membarrier_prepare_task_switch(struct rq *rq,
+						  struct task_struct *prev,
+						  struct task_struct *next)
+{
+}
+#endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm
  2019-09-06  3:12 ` [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm Mathieu Desnoyers
@ 2019-09-06  7:41   ` Peter Zijlstra
  2019-09-06 13:40     ` Mathieu Desnoyers
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-09-06  7:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Ingo Molnar, linux-kernel, Oleg Nesterov,
	Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner

On Thu, Sep 05, 2019 at 11:12:59PM -0400, Mathieu Desnoyers wrote:
> When the prev and next task's mm change, switch_mm() provides the core
> serializing guarantees before returning to usermode. The only case
> where an explicit core serialization is needed is when the scheduler
> keeps the same mm for prev and next.
> 
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
> Cc: Chris Metcalf <cmetcalf@ezchip.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Kirill Tkhai <tkhai@yandex.ru>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/sched/mm.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 4a7944078cc3..8557ec664213 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -362,6 +362,8 @@ enum {
>  
>  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>  {
> +	if (current->mm != mm)
> +		return;
>  	if (likely(!(atomic_read(&mm->membarrier_state) &
>  		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
>  		return;

So SYNC_CORE is about I$ coherency and funny thing like that. Now it
seems 'natural' that if we flip the address space, that I$ also gets
wiped/updated, because the whole text mapping changes.

But did we just assume that, or did we verify the truth of this? (I'm
just being paranoid here)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load
  2019-09-06  3:13 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load Mathieu Desnoyers
@ 2019-09-06  8:23   ` Peter Zijlstra
  2019-09-08 13:49     ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) Mathieu Desnoyers
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-09-06  8:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Ingo Molnar, linux-kernel, Oleg Nesterov,
	Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner

On Thu, Sep 05, 2019 at 11:13:00PM -0400, Mathieu Desnoyers wrote:

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6a7a1083b6fb..7020572eb605 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -382,6 +382,9 @@ struct mm_struct {
>  		unsigned long task_size;	/* size of task vm space */
>  		unsigned long highest_vm_end;	/* highest vma end address */
>  		pgd_t * pgd;
> +#ifdef CONFIG_MEMBARRIER

Stick in a comment, on why here. To be close to data already used by
switch_mm().

> +		atomic_t membarrier_state;
> +#endif
>  
>  		/**
>  		 * @mm_users: The number of users including userspace.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 010d578118d6..1cffc1aa403c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3038,6 +3038,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>  	perf_event_task_sched_out(prev, next);
>  	rseq_preempt(prev);
>  	fire_sched_out_preempt_notifiers(prev, next);
> +	membarrier_prepare_task_switch(rq, prev, next);

This had me confused for a while, because I initially thought we'd only
do this for switch_mm(), but you're made it agressive and track kernel
threads too.

I think we can do that slightly different. See below...

>  	prepare_task(next);
>  	prepare_arch_switch(next);
>  }

> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 7e0a0d6535f3..5744c300d29e 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -30,6 +30,28 @@ static void ipi_mb(void *info)

> +void membarrier_execve(struct task_struct *t)
> +{
> +	atomic_set(&t->mm->membarrier_state, 0);
> +	WRITE_ONCE(this_rq()->membarrier_state, 0);

It is the callsite of this one that had me puzzled and confused. I
think it works by accident more than anything else.

You see; I thought the rules were that we'd change it near/before
switch_mm(), and this is quite a way _after_.

I think it might be best to place the call in exec_mmap(), right before
activate_mm().

But that then had me wonder about the membarrier_prepate_task_switch()
thing...

> +/*
> + * The scheduler provides memory barriers required by membarrier between:
> + * - prior user-space memory accesses and store to rq->membarrier_state,
> + * - store to rq->membarrier_state and following user-space memory accesses.
> + * In the same way it provides those guarantees around store to rq->curr.
> + */
> +static inline void membarrier_prepare_task_switch(struct rq *rq,
> +						  struct task_struct *prev,
> +						  struct task_struct *next)
> +{
> +	int membarrier_state = 0;
> +	struct mm_struct *next_mm = next->mm;
> +
> +	if (prev->mm == next_mm)
> +		return;
> +	if (next_mm)
> +		membarrier_state = atomic_read(&next_mm->membarrier_state);
> +	if (READ_ONCE(rq->membarrier_state) != membarrier_state)
> +		WRITE_ONCE(rq->membarrier_state, membarrier_state);
> +}

So if you make the above something like:

static inline void
membarrier_switch_mm(struct rq *rq, struct mm_struct *prev_mm, struct mm_struct *next_mm)
{
	int membarrier_state;

	if (prev_mm == next_mm)
		return;

	membarrier_state = atomic_read(&next_mm->membarrier_state);
	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
		return;

	WRITE_ONCE(rq->membarrier_state, membarrier_state);
}

And put it right in front of switch_mm() in context_switch() then we'll
deal with kernel on the other side, like so:

> @@ -70,16 +90,13 @@ static int membarrier_global_expedited(void)
>  		if (cpu == raw_smp_processor_id())
>  			continue;
>  
> -		rcu_read_lock();
> -		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> -		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
> -				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
> +		if (READ_ONCE(cpu_rq(cpu)->membarrier_state) &
> +		    MEMBARRIER_STATE_GLOBAL_EXPEDITED) {

		p = rcu_dereference(rq->curr);
		if ((READ_ONCE(cpu_rq(cpu)->membarrier_state) & MEMBARRIER_STATE_GLOBAL_EXPEDITED) &&
		    !(p->flags & PF_KTHREAD))

>  			if (!fallback)
>  				__cpumask_set_cpu(cpu, tmpmask);
>  			else
>  				smp_call_function_single(cpu, ipi_mb, NULL, 1);
>  		}
> -		rcu_read_unlock();
>  	}
>  	if (!fallback) {
>  		preempt_disable();

does that make sense?

(also, I hate how long all these membarrier names are)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm
  2019-09-06  7:41   ` Peter Zijlstra
@ 2019-09-06 13:40     ` Mathieu Desnoyers
  0 siblings, 0 replies; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-06 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, Ingo Molnar, linux-kernel, Oleg Nesterov,
	Eric W. Biederman, Linus Torvalds, Russell King, ARM Linux,
	Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner

----- On Sep 6, 2019, at 3:41 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Thu, Sep 05, 2019 at 11:12:59PM -0400, Mathieu Desnoyers wrote:
>> When the prev and next task's mm change, switch_mm() provides the core
>> serializing guarantees before returning to usermode. The only case
>> where an explicit core serialization is needed is when the scheduler
>> keeps the same mm for prev and next.
>> 
>> Suggested-by: Oleg Nesterov <oleg@redhat.com>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Oleg Nesterov <oleg@redhat.com>
>> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
>> Cc: Chris Metcalf <cmetcalf@ezchip.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Kirill Tkhai <tkhai@yandex.ru>
>> Cc: Mike Galbraith <efault@gmx.de>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> ---
>>  include/linux/sched/mm.h | 2 ++
>>  1 file changed, 2 insertions(+)
>> 
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index 4a7944078cc3..8557ec664213 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -362,6 +362,8 @@ enum {
>>  
>>  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>>  {
>> +	if (current->mm != mm)
>> +		return;
>>  	if (likely(!(atomic_read(&mm->membarrier_state) &
>>  		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
>>  		return;
> 
> So SYNC_CORE is about I$ coherency and funny thing like that. Now it
> seems 'natural' that if we flip the address space, that I$ also gets
> wiped/updated, because the whole text mapping changes.
> 
> But did we just assume that, or did we verify the truth of this? (I'm
> just being paranoid here)

We have documented those here:

Documentation/features/sched/membarrier-sync-core/arch-support.txt

For instance, x86:

# * x86
#
# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
# instruction is core serializing, but not SYSEXIT.
#
# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
# However, it can return to user-space through either SYSRETL (compat code),
# SYSRETQ, or IRET.
#
# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
# instead on write_cr3() performed by switch_mm() to provide core serialization
# after changing the current mm, and deal with the special case of kthread ->
# uthread (temporarily keeping current mm into active_mm) by issuing a
# sync_core_before_usermode() in that specific case.

I've also made sure x86 switch_mm_irqs_off() has the following comment,
just in case someone too keen on optimizing away the CR3 write would
forget to look at arch-support.txt:

        /*
         * The membarrier system call requires a full memory barrier and
         * core serialization before returning to user-space, after
         * storing to rq->curr. Writing to CR3 provides that full
         * memory barrier and core serializing instruction.
         */

In the case of arm/arm64, they have no requirement on switch_mm():

# * arm/arm64
#
# Rely on implicit context synchronization as a result of exception return
# when returning from IPI handler, and when returning to user-space.


Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-06  8:23   ` Peter Zijlstra
@ 2019-09-08 13:49     ` Mathieu Desnoyers
  2019-09-08 16:51       ` Linus Torvalds
  2019-09-09 11:00       ` Oleg Nesterov
  0 siblings, 2 replies; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-08 13:49 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman,
	Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Mathieu Desnoyers

The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
---
Changes since v1:
- Take care of Peter Zijlstra's feedback, moving callsites closer
  to switch_mm() (scheduler) and activate_mm() (execve).
- Add memory barrier in membarrier_exec_mmap() prior to clearing
  the membarrier state.
---
 fs/exec.c                 |   2 +-
 include/linux/mm_types.h  |  14 +++-
 include/linux/sched/mm.h  |   8 +-
 kernel/sched/core.c       |   4 +-
 kernel/sched/membarrier.c | 170 ++++++++++++++++++++++++++++----------
 kernel/sched/sched.h      |  34 ++++++++
 6 files changed, 180 insertions(+), 52 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index f7f6a140856a..ed39e2c81338 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1035,6 +1035,7 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->mm = mm;
 	tsk->active_mm = mm;
+	membarrier_exec_mmap(mm);
 	activate_mm(active_mm, mm);
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
@@ -1825,7 +1826,6 @@ static int __do_execve_file(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
-	membarrier_execve(current);
 	rseq_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a7a1083b6fb..ec9bd3a6c827 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -383,6 +383,16 @@ struct mm_struct {
 		unsigned long highest_vm_end;	/* highest vma end address */
 		pgd_t * pgd;
 
+#ifdef CONFIG_MEMBARRIER
+		/**
+		 * @membarrier_state: Flags controlling membarrier behavior.
+		 *
+		 * This field is close to @pgd to hopefully fit in the same
+		 * cache-line, which needs to be touched by switch_mm().
+		 */
+		atomic_t membarrier_state;
+#endif
+
 		/**
 		 * @mm_users: The number of users including userspace.
 		 *
@@ -452,9 +462,7 @@ struct mm_struct {
 		unsigned long flags; /* Must use atomic bitops to access */
 
 		struct core_state *core_state; /* coredumping support */
-#ifdef CONFIG_MEMBARRIER
-		atomic_t membarrier_state;
-#endif
+
 #ifdef CONFIG_AIO
 		spinlock_t			ioctx_lock;
 		struct kioctx_table __rcu	*ioctx_table;
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 8557ec664213..e6770012db18 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -370,10 +370,8 @@ static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 	sync_core_before_usermode();
 }
 
-static inline void membarrier_execve(struct task_struct *t)
-{
-	atomic_set(&t->mm->membarrier_state, 0);
-}
+extern void membarrier_exec_mmap(struct mm_struct *mm);
+
 #else
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
 static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
@@ -382,7 +380,7 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
 {
 }
 #endif
-static inline void membarrier_execve(struct task_struct *t)
+static inline void membarrier_exec_mmap(struct mm_struct *mm)
 {
 }
 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 010d578118d6..7b118be5cc88 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3238,8 +3238,10 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		next->active_mm = oldmm;
 		mmgrab(oldmm);
 		enter_lazy_tlb(oldmm, next);
-	} else
+	} else {
+		membarrier_switch_mm(rq, oldmm, mm);
 		switch_mm_irqs_off(oldmm, mm, next);
+	}
 
 	if (!prev->mm) {
 		prev->active_mm = NULL;
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 7e0a0d6535f3..805a621bbf88 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -30,6 +30,39 @@ static void ipi_mb(void *info)
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 }
 
+static void ipi_sync_rq_state(void *info)
+{
+	struct mm_struct *mm = (struct mm_struct *) info;
+
+	if (current->mm != mm)
+		return;
+	WRITE_ONCE(this_rq()->membarrier_state,
+		   atomic_read(&mm->membarrier_state));
+	/*
+	 * Issue a memory barrier after setting
+	 * MEMBARRIER_STATE_GLOBAL_EXPEDITED in the current runqueue to
+	 * guarantee that no memory access following registration is reordered
+	 * before registration.
+	 */
+	smp_mb();
+}
+
+void membarrier_exec_mmap(struct mm_struct *mm)
+{
+	/*
+	 * Issue a memory barrier before clearing membarrier_state to
+	 * guarantee that no memory access prior to exec is reordered after
+	 * clearing this state.
+	 */
+	smp_mb();
+	atomic_set(&mm->membarrier_state, 0);
+	/*
+	 * Keep the runqueue membarrier_state in sync with this mm
+	 * membarrier_state.
+	 */
+	WRITE_ONCE(this_rq()->membarrier_state, 0);
+}
+
 static int membarrier_global_expedited(void)
 {
 	int cpu;
@@ -56,6 +89,7 @@ static int membarrier_global_expedited(void)
 	}
 
 	cpus_read_lock();
+	rcu_read_lock();
 	for_each_online_cpu(cpu) {
 		struct task_struct *p;
 
@@ -70,17 +104,25 @@ static int membarrier_global_expedited(void)
 		if (cpu == raw_smp_processor_id())
 			continue;
 
-		rcu_read_lock();
+		if (!(READ_ONCE(cpu_rq(cpu)->membarrier_state) &
+		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
+			continue;
+
+		/*
+		 * Skip the CPU if it runs a kernel thread. The scheduler
+		 * leaves the prior task mm in place as an optimization when
+		 * scheduling a kthread.
+		 */
 		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
-		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
-				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
-			if (!fallback)
-				__cpumask_set_cpu(cpu, tmpmask);
-			else
-				smp_call_function_single(cpu, ipi_mb, NULL, 1);
-		}
-		rcu_read_unlock();
+		if (p->flags & PF_KTHREAD)
+			continue;
+
+		if (!fallback)
+			__cpumask_set_cpu(cpu, tmpmask);
+		else
+			smp_call_function_single(cpu, ipi_mb, NULL, 1);
 	}
+	rcu_read_unlock();
 	if (!fallback) {
 		preempt_disable();
 		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
@@ -136,6 +178,7 @@ static int membarrier_private_expedited(int flags)
 	}
 
 	cpus_read_lock();
+	rcu_read_lock();
 	for_each_online_cpu(cpu) {
 		struct task_struct *p;
 
@@ -149,7 +192,6 @@ static int membarrier_private_expedited(int flags)
 		 */
 		if (cpu == raw_smp_processor_id())
 			continue;
-		rcu_read_lock();
 		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
 		if (p && p->mm == current->mm) {
 			if (!fallback)
@@ -157,8 +199,8 @@ static int membarrier_private_expedited(int flags)
 			else
 				smp_call_function_single(cpu, ipi_mb, NULL, 1);
 		}
-		rcu_read_unlock();
 	}
+	rcu_read_unlock();
 	if (!fallback) {
 		preempt_disable();
 		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
@@ -177,6 +219,71 @@ static int membarrier_private_expedited(int flags)
 	return 0;
 }
 
+static void sync_runqueues_membarrier_state(struct mm_struct *mm)
+{
+	int membarrier_state = atomic_read(&mm->membarrier_state);
+	bool fallback = false;
+	cpumask_var_t tmpmask;
+	int cpu;
+
+	if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) {
+		WRITE_ONCE(this_rq()->membarrier_state, membarrier_state);
+
+		/*
+		 * For single mm user, we can simply issue a memory barrier
+		 * after setting MEMBARRIER_STATE_GLOBAL_EXPEDITED in the
+		 * mm and in the current runqueue to guarantee that no memory
+		 * access following registration is reordered before
+		 * registration.
+		 */
+		smp_mb();
+		return;
+	}
+
+	/*
+	 * For mm with multiple users, we need to ensure all future
+	 * scheduler executions will observe @mm's new membarrier
+	 * state.
+	 */
+	synchronize_rcu();
+
+	if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
+		/* Fallback for OOM. */
+		fallback = true;
+	}
+
+	/*
+	 * For each cpu runqueue, if the task's mm match @mm, ensure that all
+	 * @mm's membarrier state set bits are also set in in the runqueue's
+	 * membarrier state. This ensures that a runqueue scheduling
+	 * between threads which are users of @mm has its membarrier state
+	 * updated.
+	 */
+	cpus_read_lock();
+	rcu_read_lock();
+	for_each_online_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
+		struct task_struct *p;
+
+		p = task_rcu_dereference(&rq->curr);
+		if (p && p->mm == mm) {
+			if (!fallback)
+				__cpumask_set_cpu(cpu, tmpmask);
+			else
+				smp_call_function_single(cpu, ipi_sync_rq_state,
+							 mm, 1);
+		}
+	}
+	rcu_read_unlock();
+	if (!fallback) {
+		preempt_disable();
+		smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1);
+		preempt_enable();
+		free_cpumask_var(tmpmask);
+	}
+	cpus_read_unlock();
+}
+
 static int membarrier_register_global_expedited(void)
 {
 	struct task_struct *p = current;
@@ -186,23 +293,7 @@ static int membarrier_register_global_expedited(void)
 	    MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY)
 		return 0;
 	atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED, &mm->membarrier_state);
-	if (atomic_read(&mm->mm_users) == 1) {
-		/*
-		 * For single mm user, single threaded process, we can
-		 * simply issue a memory barrier after setting
-		 * MEMBARRIER_STATE_GLOBAL_EXPEDITED to guarantee that
-		 * no memory access following registration is reordered
-		 * before registration.
-		 */
-		smp_mb();
-	} else {
-		/*
-		 * For multi-mm user threads, we need to ensure all
-		 * future scheduler executions will observe the new
-		 * thread flag state for this mm.
-		 */
-		synchronize_rcu();
-	}
+	sync_runqueues_membarrier_state(mm);
 	atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY,
 		  &mm->membarrier_state);
 
@@ -213,12 +304,14 @@ static int membarrier_register_private_expedited(int flags)
 {
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
-	int state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY;
+	int ready_state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
+	    set_state = MEMBARRIER_STATE_PRIVATE_EXPEDITED;
 
 	if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
 		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
 			return -EINVAL;
-		state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY;
+		ready_state =
+			MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY;
 	}
 
 	/*
@@ -226,20 +319,13 @@ static int membarrier_register_private_expedited(int flags)
 	 * groups, which use the same mm. (CLONE_VM but not
 	 * CLONE_THREAD).
 	 */
-	if ((atomic_read(&mm->membarrier_state) & state) == state)
+	if ((atomic_read(&mm->membarrier_state) & ready_state) == ready_state)
 		return 0;
-	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state);
 	if (flags & MEMBARRIER_FLAG_SYNC_CORE)
-		atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE,
-			  &mm->membarrier_state);
-	if (atomic_read(&mm->mm_users) != 1) {
-		/*
-		 * Ensure all future scheduler executions will observe the
-		 * new thread flag state for this process.
-		 */
-		synchronize_rcu();
-	}
-	atomic_or(state, &mm->membarrier_state);
+		set_state |= MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE;
+	atomic_or(set_state, &mm->membarrier_state);
+	sync_runqueues_membarrier_state(mm);
+	atomic_or(ready_state, &mm->membarrier_state);
 
 	return 0;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..85cb30b1b6b7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -907,6 +907,10 @@ struct rq {
 
 	atomic_t		nr_iowait;
 
+#ifdef CONFIG_MEMBARRIER
+	int membarrier_state;
+#endif
+
 #ifdef CONFIG_SMP
 	struct root_domain		*rd;
 	struct sched_domain __rcu	*sd;
@@ -2423,3 +2427,33 @@ static inline bool sched_energy_enabled(void)
 static inline bool sched_energy_enabled(void) { return false; }
 
 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+#ifdef CONFIG_MEMBARRIER
+/*
+ * The scheduler provides memory barriers required by membarrier between:
+ * - prior user-space memory accesses and store to rq->membarrier_state,
+ * - store to rq->membarrier_state and following user-space memory accesses.
+ * In the same way it provides those guarantees around store to rq->curr.
+ */
+static inline void membarrier_switch_mm(struct rq *rq,
+					struct mm_struct *prev_mm,
+					struct mm_struct *next_mm)
+{
+	int membarrier_state;
+
+	if (prev_mm == next_mm)
+		return;
+
+	membarrier_state = atomic_read(&next_mm->membarrier_state);
+	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
+		return;
+
+	WRITE_ONCE(rq->membarrier_state, membarrier_state);
+}
+#else
+static inline void membarrier_switch_mm(struct rq *rq,
+					struct mm_struct *prev_mm,
+					struct mm_struct *next_mm)
+{
+}
+#endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-08 13:49     ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) Mathieu Desnoyers
@ 2019-09-08 16:51       ` Linus Torvalds
  2019-09-10  9:48         ` Mathieu Desnoyers
  2019-09-09 11:00       ` Oleg Nesterov
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2019-09-08 16:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E. McKenney, Ingo Molnar,
	Linux List Kernel Mailing, Oleg Nesterov, Eric W. Biederman,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner

On Sun, Sep 8, 2019 at 6:49 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> +static void sync_runqueues_membarrier_state(struct mm_struct *mm)
> +{
> +       int membarrier_state = atomic_read(&mm->membarrier_state);
> +       bool fallback = false;
> +       cpumask_var_t tmpmask;
> +
> +       if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> +               /* Fallback for OOM. */
> +               fallback = true;
> +       }
> +
> +       /*
> +        * For each cpu runqueue, if the task's mm match @mm, ensure that all
> +        * @mm's membarrier state set bits are also set in in the runqueue's
> +        * membarrier state. This ensures that a runqueue scheduling
> +        * between threads which are users of @mm has its membarrier state
> +        * updated.
> +        */
> +       cpus_read_lock();
> +       rcu_read_lock();
> +       for_each_online_cpu(cpu) {
> +               struct rq *rq = cpu_rq(cpu);
> +               struct task_struct *p;
> +
> +               p = task_rcu_dereference(&rq->curr);
> +               if (p && p->mm == mm) {
> +                       if (!fallback)
> +                               __cpumask_set_cpu(cpu, tmpmask);
> +                       else
> +                               smp_call_function_single(cpu, ipi_sync_rq_state,
> +                                                        mm, 1);
> +               }
> +       }

I really absolutely detest this whole "fallback" code.

It will never get any real testing, and the code is just broken.

Why don't you just use the mm_cpumask(mm) unconditionally? Yes, it
will possibly call too many CPU's, but this fallback code is just
completely disgusting.

Do a simple and clean implementation. Then, if you can show real
performance issues (which I doubt), maybe do something else, but even
then you should never do something that will effectively create cases
that have absolutely zero test-coverage.

                      Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-08 13:49     ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) Mathieu Desnoyers
  2019-09-08 16:51       ` Linus Torvalds
@ 2019-09-09 11:00       ` Oleg Nesterov
  2019-09-13 15:20         ` Mathieu Desnoyers
  1 sibling, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2019-09-09 11:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Paul E. McKenney, Ingo Molnar, linux-kernel,
	Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner

On 09/08, Mathieu Desnoyers wrote:
>
> +static void sync_runqueues_membarrier_state(struct mm_struct *mm)
> +{
> +	int membarrier_state = atomic_read(&mm->membarrier_state);
> +	bool fallback = false;
> +	cpumask_var_t tmpmask;
> +	int cpu;
> +
> +	if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) {
> +		WRITE_ONCE(this_rq()->membarrier_state, membarrier_state);

This doesn't look safe, this caller can migrate to another CPU after
it calculates the per-cpu ptr.

I think you need do disable preemption or simply use this_cpu_write().

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-08 16:51       ` Linus Torvalds
@ 2019-09-10  9:48         ` Mathieu Desnoyers
  2019-09-12 13:48           ` Will Deacon
  0 siblings, 1 reply; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-10  9:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel,
	Oleg Nesterov, Eric W. Biederman, Russell King, ARM Linux,
	Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Will Deacon

----- On Sep 8, 2019, at 5:51 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Sun, Sep 8, 2019 at 6:49 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> +static void sync_runqueues_membarrier_state(struct mm_struct *mm)
>> +{
>> +       int membarrier_state = atomic_read(&mm->membarrier_state);
>> +       bool fallback = false;
>> +       cpumask_var_t tmpmask;
>> +
>> +       if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
>> +               /* Fallback for OOM. */
>> +               fallback = true;
>> +       }
>> +
>> +       /*
>> +        * For each cpu runqueue, if the task's mm match @mm, ensure that all
>> +        * @mm's membarrier state set bits are also set in in the runqueue's
>> +        * membarrier state. This ensures that a runqueue scheduling
>> +        * between threads which are users of @mm has its membarrier state
>> +        * updated.
>> +        */
>> +       cpus_read_lock();
>> +       rcu_read_lock();
>> +       for_each_online_cpu(cpu) {
>> +               struct rq *rq = cpu_rq(cpu);
>> +               struct task_struct *p;
>> +
>> +               p = task_rcu_dereference(&rq->curr);
>> +               if (p && p->mm == mm) {
>> +                       if (!fallback)
>> +                               __cpumask_set_cpu(cpu, tmpmask);
>> +                       else
>> +                               smp_call_function_single(cpu, ipi_sync_rq_state,
>> +                                                        mm, 1);
>> +               }
>> +       }
> 
> I really absolutely detest this whole "fallback" code.
> 
> It will never get any real testing, and the code is just broken.
> 
> Why don't you just use the mm_cpumask(mm) unconditionally? Yes, it
> will possibly call too many CPU's, but this fallback code is just
> completely disgusting.
> 
> Do a simple and clean implementation. Then, if you can show real
> performance issues (which I doubt), maybe do something else, but even
> then you should never do something that will effectively create cases
> that have absolutely zero test-coverage.

A few points worth mentioning here:

1) As I stated earlier, using mm_cpumask in its current form is not
   an option for membarrier. For two reasons:

   A) The mask is not populated on all architectures (e.g. arm64 does
      not populate it),

   B) Even if it was populated on all architectures, we would need to
      carefully audit and document every spot where this mm_cpumask
      is set or cleared within each architecture code, and ensure we
      have the required memory barriers between user-space memory
      accesses and those stores, documenting those requirements into
      each architecture code in the process. This seems to be a lot of
      useless error-prone code churn.

2) I should actually use GFP_KERNEL rather than GFP_NOWAIT in this
   membarrier registration code. But it can still fail. However, the other
   membarrier code using the same fallback pattern (private and global
   expedited) documents that those membarrier commands do not block in
   the membarrier(2) man page, so GFP_NOWAIT is appropriate in those cases.

3) Testing-wise, I fully agree with your argument of lacking test coverage.
   One option I'm considering would be to add a selftest based on the
   fault-injection infrastructure, which would ensure that we have coverage
   of the failure case in the kernel selftests.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-10  9:48         ` Mathieu Desnoyers
@ 2019-09-12 13:48           ` Will Deacon
  2019-09-12 14:24             ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Will Deacon @ 2019-09-12 13:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Peter Zijlstra, paulmck, Ingo Molnar,
	linux-kernel, Oleg Nesterov, Eric W. Biederman, Russell King,
	ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai,
	Mike Galbraith, Thomas Gleixner

On Tue, Sep 10, 2019 at 05:48:02AM -0400, Mathieu Desnoyers wrote:
> ----- On Sep 8, 2019, at 5:51 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
> 
> > On Sun, Sep 8, 2019 at 6:49 AM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> +static void sync_runqueues_membarrier_state(struct mm_struct *mm)
> >> +{
> >> +       int membarrier_state = atomic_read(&mm->membarrier_state);
> >> +       bool fallback = false;
> >> +       cpumask_var_t tmpmask;
> >> +
> >> +       if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> >> +               /* Fallback for OOM. */
> >> +               fallback = true;
> >> +       }
> >> +
> >> +       /*
> >> +        * For each cpu runqueue, if the task's mm match @mm, ensure that all
> >> +        * @mm's membarrier state set bits are also set in in the runqueue's
> >> +        * membarrier state. This ensures that a runqueue scheduling
> >> +        * between threads which are users of @mm has its membarrier state
> >> +        * updated.
> >> +        */
> >> +       cpus_read_lock();
> >> +       rcu_read_lock();
> >> +       for_each_online_cpu(cpu) {
> >> +               struct rq *rq = cpu_rq(cpu);
> >> +               struct task_struct *p;
> >> +
> >> +               p = task_rcu_dereference(&rq->curr);
> >> +               if (p && p->mm == mm) {
> >> +                       if (!fallback)
> >> +                               __cpumask_set_cpu(cpu, tmpmask);
> >> +                       else
> >> +                               smp_call_function_single(cpu, ipi_sync_rq_state,
> >> +                                                        mm, 1);
> >> +               }
> >> +       }
> > 
> > I really absolutely detest this whole "fallback" code.
> > 
> > It will never get any real testing, and the code is just broken.
> > 
> > Why don't you just use the mm_cpumask(mm) unconditionally? Yes, it
> > will possibly call too many CPU's, but this fallback code is just
> > completely disgusting.
> > 
> > Do a simple and clean implementation. Then, if you can show real
> > performance issues (which I doubt), maybe do something else, but even
> > then you should never do something that will effectively create cases
> > that have absolutely zero test-coverage.
> 
> A few points worth mentioning here:
> 
> 1) As I stated earlier, using mm_cpumask in its current form is not
>    an option for membarrier. For two reasons:
> 
>    A) The mask is not populated on all architectures (e.g. arm64 does
>       not populate it),
> 
>    B) Even if it was populated on all architectures, we would need to
>       carefully audit and document every spot where this mm_cpumask
>       is set or cleared within each architecture code, and ensure we
>       have the required memory barriers between user-space memory
>       accesses and those stores, documenting those requirements into
>       each architecture code in the process. This seems to be a lot of
>       useless error-prone code churn.
> 
> 2) I should actually use GFP_KERNEL rather than GFP_NOWAIT in this
>    membarrier registration code. But it can still fail. However, the other
>    membarrier code using the same fallback pattern (private and global
>    expedited) documents that those membarrier commands do not block in
>    the membarrier(2) man page, so GFP_NOWAIT is appropriate in those cases.
> 
> 3) Testing-wise, I fully agree with your argument of lacking test coverage.
>    One option I'm considering would be to add a selftest based on the
>    fault-injection infrastructure, which would ensure that we have coverage
>    of the failure case in the kernel selftests.
> 
> Thoughts ?

So the man page for sys_membarrier states that the expedited variants "never
block", which feels pretty strong. Do any other system calls claim to
provide this guarantee without a failure path if blocking is necessary?
Given that the whole thing is preemptible, I'm also curious as to how
exactly userspace relies on this non-blocking guarantee.  I'd have thought
that you could either just bite the bullet and block in the rare case that
you need to when allocating the cpumask, or you could just return
-EWOULDBLOCK on allocation failure, given that I suspect there are very few
users of this system call right now and it's not yet supported by glibc
afaik.

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-12 13:48           ` Will Deacon
@ 2019-09-12 14:24             ` Linus Torvalds
  2019-09-12 15:47               ` Will Deacon
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2019-09-12 14:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mathieu Desnoyers, Peter Zijlstra, paulmck, Ingo Molnar,
	linux-kernel, Oleg Nesterov, Eric W. Biederman, Russell King,
	ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai,
	Mike Galbraith, Thomas Gleixner

On Thu, Sep 12, 2019 at 2:48 PM Will Deacon <will@kernel.org> wrote:
>
> So the man page for sys_membarrier states that the expedited variants "never
> block", which feels pretty strong. Do any other system calls claim to
> provide this guarantee without a failure path if blocking is necessary?

The traditional semantics for "we don't block" is that "we block on
memory allocations and locking and user accesses etc, but  we don't
wait for our own IO".

So there may be new IO started (and waited on) as part of allocating
new memory etc, or in just paging in user memory, but the IO that the
operation _itself_ explicitly starts is not waited on.

No system call should ever be considered "atomic" in any sense. If
you're doing RT, you should maybe expect "getpid()" to not ever block,
but that's just about the exclusive list of truly nonblocking system
calls, and even that can be preempted.

            Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-12 14:24             ` Linus Torvalds
@ 2019-09-12 15:47               ` Will Deacon
  2019-09-13 14:22                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 21+ messages in thread
From: Will Deacon @ 2019-09-12 15:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Peter Zijlstra, paulmck, Ingo Molnar,
	linux-kernel, Oleg Nesterov, Eric W. Biederman, Russell King,
	ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai,
	Mike Galbraith, Thomas Gleixner

On Thu, Sep 12, 2019 at 03:24:35PM +0100, Linus Torvalds wrote:
> On Thu, Sep 12, 2019 at 2:48 PM Will Deacon <will@kernel.org> wrote:
> >
> > So the man page for sys_membarrier states that the expedited variants "never
> > block", which feels pretty strong. Do any other system calls claim to
> > provide this guarantee without a failure path if blocking is necessary?
> 
> The traditional semantics for "we don't block" is that "we block on
> memory allocations and locking and user accesses etc, but  we don't
> wait for our own IO".
> 
> So there may be new IO started (and waited on) as part of allocating
> new memory etc, or in just paging in user memory, but the IO that the
> operation _itself_ explicitly starts is not waited on.

Thanks, that makes sense, and I'd be inclined to suggest an update to the
sys_membarrier manpage to make this more clear since the "never blocks"
phrasing doesn't seem to be used like this for other system calls.

> No system call should ever be considered "atomic" in any sense. If
> you're doing RT, you should maybe expect "getpid()" to not ever block,
> but that's just about the exclusive list of truly nonblocking system
> calls, and even that can be preempted.

In which case, why can't we just use GFP_KERNEL for the cpumask allocation
instead of GFP_NOWAIT and then remove the failure path altogether? Mathieu?

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-12 15:47               ` Will Deacon
@ 2019-09-13 14:22                 ` Mathieu Desnoyers
  2019-09-19 16:26                   ` Will Deacon
  0 siblings, 1 reply; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-13 14:22 UTC (permalink / raw)
  To: Will Deacon, Michael Kerrisk, Linus Torvalds
  Cc: Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel,
	Oleg Nesterov, Eric W. Biederman, Russell King, ARM Linux,
	Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner

----- On Sep 12, 2019, at 11:47 AM, Will Deacon will@kernel.org wrote:

> On Thu, Sep 12, 2019 at 03:24:35PM +0100, Linus Torvalds wrote:
>> On Thu, Sep 12, 2019 at 2:48 PM Will Deacon <will@kernel.org> wrote:
>> >
>> > So the man page for sys_membarrier states that the expedited variants "never
>> > block", which feels pretty strong. Do any other system calls claim to
>> > provide this guarantee without a failure path if blocking is necessary?
>> 
>> The traditional semantics for "we don't block" is that "we block on
>> memory allocations and locking and user accesses etc, but  we don't
>> wait for our own IO".
>> 
>> So there may be new IO started (and waited on) as part of allocating
>> new memory etc, or in just paging in user memory, but the IO that the
>> operation _itself_ explicitly starts is not waited on.
> 
> Thanks, that makes sense, and I'd be inclined to suggest an update to the
> sys_membarrier manpage to make this more clear since the "never blocks"
> phrasing doesn't seem to be used like this for other system calls.

The current wording from membarrier(2) is:

              The  "expedited" commands complete faster than the non-expedited
              ones; they never block, but have the downside of  causing  extra
              overhead.

We could simply remove the "; they never block" part then ?

> 
>> No system call should ever be considered "atomic" in any sense. If
>> you're doing RT, you should maybe expect "getpid()" to not ever block,
>> but that's just about the exclusive list of truly nonblocking system
>> calls, and even that can be preempted.
> 
> In which case, why can't we just use GFP_KERNEL for the cpumask allocation
> instead of GFP_NOWAIT and then remove the failure path altogether? Mathieu?

Looking at:

#define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)

I notice that it does not include __GFP_NOFAIL. What prevents GFP_KERNEL from
failing, and where is this guarantee documented ?

Regarding __GFP_NOFAIL, its use seems to be discouraged in linux/gfp.h:

 * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures. The allocation could block
 * indefinitely but will never return with failure. Testing for
 * failure is pointless.
 * New users should be evaluated carefully (and the flag should be
 * used only when there is no reasonable failure policy) but it is
 * definitely preferable to use the flag rather than opencode endless
 * loop around allocator.
 * Using this flag for costly allocations is _highly_ discouraged.

So I am reluctant to use it.

But if we can agree on the right combination of flags that guarantees there
is no failure, I would be perfectly fine with using them to remove the fallback
code.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-09 11:00       ` Oleg Nesterov
@ 2019-09-13 15:20         ` Mathieu Desnoyers
  2019-09-13 16:04           ` Oleg Nesterov
  0 siblings, 1 reply; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-13 15:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel,
	Eric W. Biederman, Linus Torvalds, Russell King, ARM Linux,
	Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner

----- On Sep 9, 2019, at 7:00 AM, Oleg Nesterov oleg@redhat.com wrote:

> On 09/08, Mathieu Desnoyers wrote:
>>
>> +static void sync_runqueues_membarrier_state(struct mm_struct *mm)
>> +{
>> +	int membarrier_state = atomic_read(&mm->membarrier_state);
>> +	bool fallback = false;
>> +	cpumask_var_t tmpmask;
>> +	int cpu;
>> +
>> +	if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) {
>> +		WRITE_ONCE(this_rq()->membarrier_state, membarrier_state);
> 
> This doesn't look safe, this caller can migrate to another CPU after
> it calculates the per-cpu ptr.
> 
> I think you need do disable preemption or simply use this_cpu_write().

Good point! I'll use this_cpu_write() there and within
membarrier_exec_mmap(), which seems to be affected by the same problem.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-13 15:20         ` Mathieu Desnoyers
@ 2019-09-13 16:04           ` Oleg Nesterov
  2019-09-13 17:07             ` Mathieu Desnoyers
  0 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2019-09-13 16:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel,
	Eric W. Biederman, Linus Torvalds, Russell King, ARM Linux,
	Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner

On 09/13, Mathieu Desnoyers wrote:
>
> membarrier_exec_mmap(), which seems to be affected by the same problem.

IIRC, in the last version it is called by exec_mmap() undef task_lock(),
so it should fine.

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-13 16:04           ` Oleg Nesterov
@ 2019-09-13 17:07             ` Mathieu Desnoyers
  0 siblings, 0 replies; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-13 17:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel,
	Eric W. Biederman, Linus Torvalds, Russell King, ARM Linux,
	Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner



----- On Sep 13, 2019, at 12:04 PM, Oleg Nesterov oleg@redhat.com wrote:

> On 09/13, Mathieu Desnoyers wrote:
>>
>> membarrier_exec_mmap(), which seems to be affected by the same problem.
> 
> IIRC, in the last version it is called by exec_mmap() undef task_lock(),
> so it should fine.

Fair point. Although it seems rather cleaner to use this_cpu_write() in
all 3 sites updating this variable rather than a mix of this_cpu_write
and WRITE_ONCE(), unless anyone objects.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-13 14:22                 ` Mathieu Desnoyers
@ 2019-09-19 16:26                   ` Will Deacon
  2019-09-19 17:33                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 21+ messages in thread
From: Will Deacon @ 2019-09-19 16:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Michael Kerrisk, Linus Torvalds, Peter Zijlstra, paulmck,
	Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman,
	Russell King, ARM Linux, Chris Metcalf, Chris Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner

Hi Mathieu,

Sorry for the delay in responding.

On Fri, Sep 13, 2019 at 10:22:28AM -0400, Mathieu Desnoyers wrote:
> ----- On Sep 12, 2019, at 11:47 AM, Will Deacon will@kernel.org wrote:
> 
> > On Thu, Sep 12, 2019 at 03:24:35PM +0100, Linus Torvalds wrote:
> >> On Thu, Sep 12, 2019 at 2:48 PM Will Deacon <will@kernel.org> wrote:
> >> >
> >> > So the man page for sys_membarrier states that the expedited variants "never
> >> > block", which feels pretty strong. Do any other system calls claim to
> >> > provide this guarantee without a failure path if blocking is necessary?
> >> 
> >> The traditional semantics for "we don't block" is that "we block on
> >> memory allocations and locking and user accesses etc, but  we don't
> >> wait for our own IO".
> >> 
> >> So there may be new IO started (and waited on) as part of allocating
> >> new memory etc, or in just paging in user memory, but the IO that the
> >> operation _itself_ explicitly starts is not waited on.
> > 
> > Thanks, that makes sense, and I'd be inclined to suggest an update to the
> > sys_membarrier manpage to make this more clear since the "never blocks"
> > phrasing doesn't seem to be used like this for other system calls.
> 
> The current wording from membarrier(2) is:
> 
>               The  "expedited" commands complete faster than the non-expedited
>               ones; they never block, but have the downside of  causing  extra
>               overhead.
> 
> We could simply remove the "; they never block" part then ?

I think so, yes. That or, "; they do not voluntarily block" or something
like that. Maybe look at other man pages for inspiration ;)

> >> No system call should ever be considered "atomic" in any sense. If
> >> you're doing RT, you should maybe expect "getpid()" to not ever block,
> >> but that's just about the exclusive list of truly nonblocking system
> >> calls, and even that can be preempted.
> > 
> > In which case, why can't we just use GFP_KERNEL for the cpumask allocation
> > instead of GFP_NOWAIT and then remove the failure path altogether? Mathieu?
> 
> Looking at:
> 
> #define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> 
> I notice that it does not include __GFP_NOFAIL. What prevents GFP_KERNEL from
> failing, and where is this guarantee documented ?

There was an lwn article a little while ago about this:

https://lwn.net/Articles/723317/

I'm not sure what (if anything) has changed in this regard since then,
however.

> Regarding __GFP_NOFAIL, its use seems to be discouraged in linux/gfp.h:
> 
>  * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures. The allocation could block
>  * indefinitely but will never return with failure. Testing for
>  * failure is pointless.
>  * New users should be evaluated carefully (and the flag should be
>  * used only when there is no reasonable failure policy) but it is
>  * definitely preferable to use the flag rather than opencode endless
>  * loop around allocator.
>  * Using this flag for costly allocations is _highly_ discouraged.
> 
> So I am reluctant to use it.
> 
> But if we can agree on the right combination of flags that guarantees there
> is no failure, I would be perfectly fine with using them to remove the fallback
> code.

I reckon you'll be fine using GFP_KERNEL and returning -ENOMEM on allocation
failure. This shouldn't happen in practice and it removes the fallback
path.

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)
  2019-09-19 16:26                   ` Will Deacon
@ 2019-09-19 17:33                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-09-19 17:33 UTC (permalink / raw)
  To: Will Deacon
  Cc: Michael Kerrisk, Linus Torvalds, Peter Zijlstra, paulmck,
	Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman,
	Russell King, ARM Linux, Chris Metcalf, Chris Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner

----- On Sep 19, 2019, at 12:26 PM, Will Deacon will@kernel.org wrote:

[...]
>> 
>> The current wording from membarrier(2) is:
>> 
>>               The  "expedited" commands complete faster than the non-expedited
>>               ones; they never block, but have the downside of  causing  extra
>>               overhead.
>> 
>> We could simply remove the "; they never block" part then ?
> 
> I think so, yes. That or, "; they do not voluntarily block" or something
> like that. Maybe look at other man pages for inspiration ;)

OK, let's tackle the man-pages part after the fix reaches mainline though.

[...]
> 
> I reckon you'll be fine using GFP_KERNEL and returning -ENOMEM on allocation
> failure. This shouldn't happen in practice and it removes the fallback
> path.

Works for me! I'll prepare an updated patchset.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-09-19 17:33 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-06  3:12 [RFC PATCH 0/4] Membarrier fixes/cleanups Mathieu Desnoyers
2019-09-06  3:12 ` [RFC PATCH 1/4] Fix: sched/membarrier: private expedited registration check Mathieu Desnoyers
2019-09-06  3:12 ` [RFC PATCH 2/4] Cleanup: sched/membarrier: remove redundant check Mathieu Desnoyers
2019-09-06  3:12 ` [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm Mathieu Desnoyers
2019-09-06  7:41   ` Peter Zijlstra
2019-09-06 13:40     ` Mathieu Desnoyers
2019-09-06  3:13 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load Mathieu Desnoyers
2019-09-06  8:23   ` Peter Zijlstra
2019-09-08 13:49     ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) Mathieu Desnoyers
2019-09-08 16:51       ` Linus Torvalds
2019-09-10  9:48         ` Mathieu Desnoyers
2019-09-12 13:48           ` Will Deacon
2019-09-12 14:24             ` Linus Torvalds
2019-09-12 15:47               ` Will Deacon
2019-09-13 14:22                 ` Mathieu Desnoyers
2019-09-19 16:26                   ` Will Deacon
2019-09-19 17:33                     ` Mathieu Desnoyers
2019-09-09 11:00       ` Oleg Nesterov
2019-09-13 15:20         ` Mathieu Desnoyers
2019-09-13 16:04           ` Oleg Nesterov
2019-09-13 17:07             ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).