* [RFC PATCH 0/4] Membarrier fixes/cleanups @ 2019-09-06 3:12 Mathieu Desnoyers 2019-09-06 3:12 ` [RFC PATCH 1/4] Fix: sched/membarrier: private expedited registration check Mathieu Desnoyers ` (3 more replies) 0 siblings, 4 replies; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-06 3:12 UTC (permalink / raw) To: Peter Zijlstra, Paul E. McKenney Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Mathieu Desnoyers This series of fixes/cleanups is submitted for feedback. It takes care of membarrier issues recently discussed. Thanks, Mathieu Mathieu Desnoyers (4): Fix: sched/membarrier: private expedited registration check Cleanup: sched/membarrier: remove redundant check Cleanup: sched/membarrier: only sync_core before usermode for same mm Fix: sched/membarrier: p->mm->membarrier_state racy load include/linux/mm_types.h | 7 +- include/linux/sched/mm.h | 8 +-- kernel/sched/core.c | 1 + kernel/sched/membarrier.c | 141 +++++++++++++++++++++++++++----------- kernel/sched/sched.h | 33 +++++++++ 5 files changed, 143 insertions(+), 47 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 21+ messages in thread
* [RFC PATCH 1/4] Fix: sched/membarrier: private expedited registration check 2019-09-06 3:12 [RFC PATCH 0/4] Membarrier fixes/cleanups Mathieu Desnoyers @ 2019-09-06 3:12 ` Mathieu Desnoyers 2019-09-06 3:12 ` [RFC PATCH 2/4] Cleanup: sched/membarrier: remove redundant check Mathieu Desnoyers ` (2 subsequent siblings) 3 siblings, 0 replies; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-06 3:12 UTC (permalink / raw) To: Peter Zijlstra, Paul E. McKenney Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Mathieu Desnoyers Fix a logic flaw in the way membarrier_register_private_expedited() handles ready state checks for private expedited sync core and private expedited registrations. If a private expedited membarrier registration is first performed, and then a private expedited sync_core registration is performed, the ready state check will skip the second registration when it really should not. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> --- kernel/sched/membarrier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index aa8d75804108..5110d91b1b0e 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -226,7 +226,7 @@ static int membarrier_register_private_expedited(int flags) * groups, which use the same mm. (CLONE_VM but not * CLONE_THREAD). */ - if (atomic_read(&mm->membarrier_state) & state) + if ((atomic_read(&mm->membarrier_state) & state) == state) return 0; atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state); if (flags & MEMBARRIER_FLAG_SYNC_CORE) -- 2.17.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH 2/4] Cleanup: sched/membarrier: remove redundant check 2019-09-06 3:12 [RFC PATCH 0/4] Membarrier fixes/cleanups Mathieu Desnoyers 2019-09-06 3:12 ` [RFC PATCH 1/4] Fix: sched/membarrier: private expedited registration check Mathieu Desnoyers @ 2019-09-06 3:12 ` Mathieu Desnoyers 2019-09-06 3:12 ` [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm Mathieu Desnoyers 2019-09-06 3:13 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load Mathieu Desnoyers 3 siblings, 0 replies; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-06 3:12 UTC (permalink / raw) To: Peter Zijlstra, Paul E. McKenney Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Mathieu Desnoyers Checking that the number of threads is 1 is redundant with checking mm_users == 1. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> --- kernel/sched/membarrier.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 5110d91b1b0e..7e0a0d6535f3 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -186,7 +186,7 @@ static int membarrier_register_global_expedited(void) MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY) return 0; atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED, &mm->membarrier_state); - if (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1) { + if (atomic_read(&mm->mm_users) == 1) { /* * For single mm user, single threaded process, we can * simply issue a memory barrier after setting @@ -232,7 +232,7 @@ static int membarrier_register_private_expedited(int flags) if (flags & MEMBARRIER_FLAG_SYNC_CORE) atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE, &mm->membarrier_state); - if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) { + if (atomic_read(&mm->mm_users) != 1) { /* * Ensure all future scheduler executions will observe the * new thread flag state for this process. -- 2.17.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm 2019-09-06 3:12 [RFC PATCH 0/4] Membarrier fixes/cleanups Mathieu Desnoyers 2019-09-06 3:12 ` [RFC PATCH 1/4] Fix: sched/membarrier: private expedited registration check Mathieu Desnoyers 2019-09-06 3:12 ` [RFC PATCH 2/4] Cleanup: sched/membarrier: remove redundant check Mathieu Desnoyers @ 2019-09-06 3:12 ` Mathieu Desnoyers 2019-09-06 7:41 ` Peter Zijlstra 2019-09-06 3:13 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load Mathieu Desnoyers 3 siblings, 1 reply; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-06 3:12 UTC (permalink / raw) To: Peter Zijlstra, Paul E. McKenney Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Mathieu Desnoyers When the prev and next task's mm change, switch_mm() provides the core serializing guarantees before returning to usermode. The only case where an explicit core serialization is needed is when the scheduler keeps the same mm for prev and next. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> --- include/linux/sched/mm.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 4a7944078cc3..8557ec664213 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -362,6 +362,8 @@ enum { static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { + if (current->mm != mm) + return; if (likely(!(atomic_read(&mm->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) return; -- 2.17.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm 2019-09-06 3:12 ` [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm Mathieu Desnoyers @ 2019-09-06 7:41 ` Peter Zijlstra 2019-09-06 13:40 ` Mathieu Desnoyers 0 siblings, 1 reply; 21+ messages in thread From: Peter Zijlstra @ 2019-09-06 7:41 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Paul E. McKenney, Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner On Thu, Sep 05, 2019 at 11:12:59PM -0400, Mathieu Desnoyers wrote: > When the prev and next task's mm change, switch_mm() provides the core > serializing guarantees before returning to usermode. The only case > where an explicit core serialization is needed is when the scheduler > keeps the same mm for prev and next. > > Suggested-by: Oleg Nesterov <oleg@redhat.com> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Oleg Nesterov <oleg@redhat.com> > Cc: "Eric W. Biederman" <ebiederm@xmission.com> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> > Cc: Chris Metcalf <cmetcalf@ezchip.com> > Cc: Christoph Lameter <cl@linux.com> > Cc: Kirill Tkhai <tkhai@yandex.ru> > Cc: Mike Galbraith <efault@gmx.de> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Ingo Molnar <mingo@kernel.org> > --- > include/linux/sched/mm.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 4a7944078cc3..8557ec664213 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -362,6 +362,8 @@ enum { > > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > { > + if (current->mm != mm) > + return; > if (likely(!(atomic_read(&mm->membarrier_state) & > MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) > return; So SYNC_CORE is about I$ coherency and funny thing like that. Now it seems 'natural' that if we flip the address space, that I$ also gets wiped/updated, because the whole text mapping changes. But did we just assume that, or did we verify the truth of this? (I'm just being paranoid here) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm 2019-09-06 7:41 ` Peter Zijlstra @ 2019-09-06 13:40 ` Mathieu Desnoyers 0 siblings, 0 replies; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-06 13:40 UTC (permalink / raw) To: Peter Zijlstra Cc: paulmck, Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Linus Torvalds, Russell King, ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner ----- On Sep 6, 2019, at 3:41 AM, Peter Zijlstra peterz@infradead.org wrote: > On Thu, Sep 05, 2019 at 11:12:59PM -0400, Mathieu Desnoyers wrote: >> When the prev and next task's mm change, switch_mm() provides the core >> serializing guarantees before returning to usermode. The only case >> where an explicit core serialization is needed is when the scheduler >> keeps the same mm for prev and next. >> >> Suggested-by: Oleg Nesterov <oleg@redhat.com> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Oleg Nesterov <oleg@redhat.com> >> Cc: "Eric W. Biederman" <ebiederm@xmission.com> >> Cc: Linus Torvalds <torvalds@linux-foundation.org> >> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> >> Cc: Chris Metcalf <cmetcalf@ezchip.com> >> Cc: Christoph Lameter <cl@linux.com> >> Cc: Kirill Tkhai <tkhai@yandex.ru> >> Cc: Mike Galbraith <efault@gmx.de> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Ingo Molnar <mingo@kernel.org> >> --- >> include/linux/sched/mm.h | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h >> index 4a7944078cc3..8557ec664213 100644 >> --- a/include/linux/sched/mm.h >> +++ b/include/linux/sched/mm.h >> @@ -362,6 +362,8 @@ enum { >> >> static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) >> { >> + if (current->mm != mm) >> + return; >> if (likely(!(atomic_read(&mm->membarrier_state) & >> MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) >> return; > > So SYNC_CORE is about I$ coherency and funny thing like that. Now it > seems 'natural' that if we flip the address space, that I$ also gets > wiped/updated, because the whole text mapping changes. > > But did we just assume that, or did we verify the truth of this? (I'm > just being paranoid here) We have documented those here: Documentation/features/sched/membarrier-sync-core/arch-support.txt For instance, x86: # * x86 # # x86-32 uses IRET as return from interrupt, which takes care of the IPI. # However, it uses both IRET and SYSEXIT to go back to user-space. The IRET # instruction is core serializing, but not SYSEXIT. # # x86-64 uses IRET as return from interrupt, which takes care of the IPI. # However, it can return to user-space through either SYSRETL (compat code), # SYSRETQ, or IRET. # # Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely # instead on write_cr3() performed by switch_mm() to provide core serialization # after changing the current mm, and deal with the special case of kthread -> # uthread (temporarily keeping current mm into active_mm) by issuing a # sync_core_before_usermode() in that specific case. I've also made sure x86 switch_mm_irqs_off() has the following comment, just in case someone too keen on optimizing away the CR3 write would forget to look at arch-support.txt: /* * The membarrier system call requires a full memory barrier and * core serialization before returning to user-space, after * storing to rq->curr. Writing to CR3 provides that full * memory barrier and core serializing instruction. */ In the case of arm/arm64, they have no requirement on switch_mm(): # * arm/arm64 # # Rely on implicit context synchronization as a result of exception return # when returning from IPI handler, and when returning to user-space. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load 2019-09-06 3:12 [RFC PATCH 0/4] Membarrier fixes/cleanups Mathieu Desnoyers ` (2 preceding siblings ...) 2019-09-06 3:12 ` [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm Mathieu Desnoyers @ 2019-09-06 3:13 ` Mathieu Desnoyers 2019-09-06 8:23 ` Peter Zijlstra 3 siblings, 1 reply; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-06 3:13 UTC (permalink / raw) To: Peter Zijlstra, Paul E. McKenney Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Mathieu Desnoyers The membarrier_state field is located within the mm_struct, which is not guaranteed to exist when used from runqueue-lock-free iteration on runqueues by the membarrier system call. Copy the membarrier_state from the mm_struct into the scheduler runqueue in the scheduler prepare task switch when the scheduler switches between mm. When registering membarrier for mm, after setting the registration bit in the mm membarrier state, issue a synchronize_rcu() to ensure the scheduler observes the change. In order to take care of the case where a runqueue keeps executing the target mm without swapping to other mm, iterate over each runqueue and issue an IPI to copy the membarrier_state from the mm_struct into each runqueue which have the same mm which state has just been modified. Move the mm membarrier_state field closer to pgd in mm_struct so sched switch prepare hits a cache line which is already touched by the scheduler switch_mm. membarrier_execve() hook now needs to clear the runqueue's membarrier state in addition to clear the mm membarrier state, so move its implementation into the scheduler membarrier code so it can access the runqueue structure. As suggested by Linus, move all membarrier.c RCU read-side locks outside of the for each cpu loops. In fact, it turns out that membarrier_global_expedited() does not need the RCU read-side critical section because it does not need to use task_rcu_dereference() anymore, relying instead on the runqueue's membarrier_state. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> --- include/linux/mm_types.h | 7 +- include/linux/sched/mm.h | 6 +- kernel/sched/core.c | 1 + kernel/sched/membarrier.c | 141 +++++++++++++++++++++++++++----------- kernel/sched/sched.h | 33 +++++++++ 5 files changed, 141 insertions(+), 47 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6a7a1083b6fb..7020572eb605 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -382,6 +382,9 @@ struct mm_struct { unsigned long task_size; /* size of task vm space */ unsigned long highest_vm_end; /* highest vma end address */ pgd_t * pgd; +#ifdef CONFIG_MEMBARRIER + atomic_t membarrier_state; +#endif /** * @mm_users: The number of users including userspace. @@ -452,9 +455,7 @@ struct mm_struct { unsigned long flags; /* Must use atomic bitops to access */ struct core_state *core_state; /* coredumping support */ -#ifdef CONFIG_MEMBARRIER - atomic_t membarrier_state; -#endif + #ifdef CONFIG_AIO spinlock_t ioctx_lock; struct kioctx_table __rcu *ioctx_table; diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 8557ec664213..7070dbef8066 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -370,10 +370,8 @@ static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) sync_core_before_usermode(); } -static inline void membarrier_execve(struct task_struct *t) -{ - atomic_set(&t->mm->membarrier_state, 0); -} +extern void membarrier_execve(struct task_struct *t); + #else #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS static inline void membarrier_arch_switch_mm(struct mm_struct *prev, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 010d578118d6..1cffc1aa403c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3038,6 +3038,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, perf_event_task_sched_out(prev, next); rseq_preempt(prev); fire_sched_out_preempt_notifiers(prev, next); + membarrier_prepare_task_switch(rq, prev, next); prepare_task(next); prepare_arch_switch(next); } diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 7e0a0d6535f3..5744c300d29e 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -30,6 +30,28 @@ static void ipi_mb(void *info) smp_mb(); /* IPIs should be serializing but paranoid. */ } +static void ipi_sync_rq_state(void *info) +{ + struct mm_struct *mm = (struct mm_struct *) info; + + if (current->mm != mm) + return; + WRITE_ONCE(this_rq()->membarrier_state, + atomic_read(&mm->membarrier_state)); + /* + * Issue a memory barrier after setting MEMBARRIER_STATE_GLOBAL_EXPEDITED + * in the current runqueue to guarantee that no memory access following + * registration is reordered before registration. + */ + smp_mb(); +} + +void membarrier_execve(struct task_struct *t) +{ + atomic_set(&t->mm->membarrier_state, 0); + WRITE_ONCE(this_rq()->membarrier_state, 0); +} + static int membarrier_global_expedited(void) { int cpu; @@ -57,8 +79,6 @@ static int membarrier_global_expedited(void) cpus_read_lock(); for_each_online_cpu(cpu) { - struct task_struct *p; - /* * Skipping the current CPU is OK even through we can be * migrated at any point. The current CPU, at the point @@ -70,16 +90,13 @@ static int membarrier_global_expedited(void) if (cpu == raw_smp_processor_id()) continue; - rcu_read_lock(); - p = task_rcu_dereference(&cpu_rq(cpu)->curr); - if (p && p->mm && (atomic_read(&p->mm->membarrier_state) & - MEMBARRIER_STATE_GLOBAL_EXPEDITED)) { + if (READ_ONCE(cpu_rq(cpu)->membarrier_state) & + MEMBARRIER_STATE_GLOBAL_EXPEDITED) { if (!fallback) __cpumask_set_cpu(cpu, tmpmask); else smp_call_function_single(cpu, ipi_mb, NULL, 1); } - rcu_read_unlock(); } if (!fallback) { preempt_disable(); @@ -136,6 +153,7 @@ static int membarrier_private_expedited(int flags) } cpus_read_lock(); + rcu_read_lock(); for_each_online_cpu(cpu) { struct task_struct *p; @@ -149,7 +167,6 @@ static int membarrier_private_expedited(int flags) */ if (cpu == raw_smp_processor_id()) continue; - rcu_read_lock(); p = task_rcu_dereference(&cpu_rq(cpu)->curr); if (p && p->mm == current->mm) { if (!fallback) @@ -157,8 +174,8 @@ static int membarrier_private_expedited(int flags) else smp_call_function_single(cpu, ipi_mb, NULL, 1); } - rcu_read_unlock(); } + rcu_read_unlock(); if (!fallback) { preempt_disable(); smp_call_function_many(tmpmask, ipi_mb, NULL, 1); @@ -177,6 +194,71 @@ static int membarrier_private_expedited(int flags) return 0; } +static void sync_runqueues_membarrier_state(struct mm_struct *mm) +{ + int membarrier_state = atomic_read(&mm->membarrier_state); + bool fallback = false; + cpumask_var_t tmpmask; + int cpu; + + if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) { + WRITE_ONCE(this_rq()->membarrier_state, membarrier_state); + + /* + * For single mm user, we can simply issue a memory barrier + * after setting MEMBARRIER_STATE_GLOBAL_EXPEDITED in the + * mm and in the current runqueue to guarantee that no memory + * access following registration is reordered before + * registration. + */ + smp_mb(); + return; + } + + /* + * For mm with multiple users, we need to ensure all future + * scheduler executions will observe @mm's new membarrier + * state. + */ + synchronize_rcu(); + + if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { + /* Fallback for OOM. */ + fallback = true; + } + + /* + * For each cpu runqueue, if the task's mm match @mm, ensure that all + * @mm's membarrier state set bits are also set in in the runqueue's + * membarrier state. This ensures that a runqueue scheduling + * between threads which are users of @mm has its membarrier state + * updated. + */ + cpus_read_lock(); + rcu_read_lock(); + for_each_online_cpu(cpu) { + struct rq *rq = cpu_rq(cpu); + struct task_struct *p; + + p = task_rcu_dereference(&rq->curr); + if (p && p->mm == mm) { + if (!fallback) + __cpumask_set_cpu(cpu, tmpmask); + else + smp_call_function_single(cpu, ipi_sync_rq_state, + mm, 1); + } + } + rcu_read_unlock(); + if (!fallback) { + preempt_disable(); + smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1); + preempt_enable(); + free_cpumask_var(tmpmask); + } + cpus_read_unlock(); +} + static int membarrier_register_global_expedited(void) { struct task_struct *p = current; @@ -186,23 +268,7 @@ static int membarrier_register_global_expedited(void) MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY) return 0; atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED, &mm->membarrier_state); - if (atomic_read(&mm->mm_users) == 1) { - /* - * For single mm user, single threaded process, we can - * simply issue a memory barrier after setting - * MEMBARRIER_STATE_GLOBAL_EXPEDITED to guarantee that - * no memory access following registration is reordered - * before registration. - */ - smp_mb(); - } else { - /* - * For multi-mm user threads, we need to ensure all - * future scheduler executions will observe the new - * thread flag state for this mm. - */ - synchronize_rcu(); - } + sync_runqueues_membarrier_state(mm); atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY, &mm->membarrier_state); @@ -213,12 +279,14 @@ static int membarrier_register_private_expedited(int flags) { struct task_struct *p = current; struct mm_struct *mm = p->mm; - int state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY; + int ready_state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY, + set_state = MEMBARRIER_STATE_PRIVATE_EXPEDITED; if (flags & MEMBARRIER_FLAG_SYNC_CORE) { if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) return -EINVAL; - state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY; + ready_state = + MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY; } /* @@ -226,20 +294,13 @@ static int membarrier_register_private_expedited(int flags) * groups, which use the same mm. (CLONE_VM but not * CLONE_THREAD). */ - if ((atomic_read(&mm->membarrier_state) & state) == state) + if ((atomic_read(&mm->membarrier_state) & ready_state) == ready_state) return 0; - atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state); if (flags & MEMBARRIER_FLAG_SYNC_CORE) - atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE, - &mm->membarrier_state); - if (atomic_read(&mm->mm_users) != 1) { - /* - * Ensure all future scheduler executions will observe the - * new thread flag state for this process. - */ - synchronize_rcu(); - } - atomic_or(state, &mm->membarrier_state); + set_state |= MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE; + atomic_or(set_state, &mm->membarrier_state); + sync_runqueues_membarrier_state(mm); + atomic_or(ready_state, &mm->membarrier_state); return 0; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 802b1f3405f2..ac26baa855e8 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -907,6 +907,10 @@ struct rq { atomic_t nr_iowait; +#ifdef CONFIG_MEMBARRIER + int membarrier_state; +#endif + #ifdef CONFIG_SMP struct root_domain *rd; struct sched_domain __rcu *sd; @@ -2423,3 +2427,32 @@ static inline bool sched_energy_enabled(void) static inline bool sched_energy_enabled(void) { return false; } #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */ + +#ifdef CONFIG_MEMBARRIER +/* + * The scheduler provides memory barriers required by membarrier between: + * - prior user-space memory accesses and store to rq->membarrier_state, + * - store to rq->membarrier_state and following user-space memory accesses. + * In the same way it provides those guarantees around store to rq->curr. + */ +static inline void membarrier_prepare_task_switch(struct rq *rq, + struct task_struct *prev, + struct task_struct *next) +{ + int membarrier_state = 0; + struct mm_struct *next_mm = next->mm; + + if (prev->mm == next_mm) + return; + if (next_mm) + membarrier_state = atomic_read(&next_mm->membarrier_state); + if (READ_ONCE(rq->membarrier_state) != membarrier_state) + WRITE_ONCE(rq->membarrier_state, membarrier_state); +} +#else +static inline void membarrier_prepare_task_switch(struct rq *rq, + struct task_struct *prev, + struct task_struct *next) +{ +} +#endif -- 2.17.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load 2019-09-06 3:13 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load Mathieu Desnoyers @ 2019-09-06 8:23 ` Peter Zijlstra 2019-09-08 13:49 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) Mathieu Desnoyers 0 siblings, 1 reply; 21+ messages in thread From: Peter Zijlstra @ 2019-09-06 8:23 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Paul E. McKenney, Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner On Thu, Sep 05, 2019 at 11:13:00PM -0400, Mathieu Desnoyers wrote: > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 6a7a1083b6fb..7020572eb605 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -382,6 +382,9 @@ struct mm_struct { > unsigned long task_size; /* size of task vm space */ > unsigned long highest_vm_end; /* highest vma end address */ > pgd_t * pgd; > +#ifdef CONFIG_MEMBARRIER Stick in a comment, on why here. To be close to data already used by switch_mm(). > + atomic_t membarrier_state; > +#endif > > /** > * @mm_users: The number of users including userspace. > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 010d578118d6..1cffc1aa403c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3038,6 +3038,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, > perf_event_task_sched_out(prev, next); > rseq_preempt(prev); > fire_sched_out_preempt_notifiers(prev, next); > + membarrier_prepare_task_switch(rq, prev, next); This had me confused for a while, because I initially thought we'd only do this for switch_mm(), but you're made it agressive and track kernel threads too. I think we can do that slightly different. See below... > prepare_task(next); > prepare_arch_switch(next); > } > diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c > index 7e0a0d6535f3..5744c300d29e 100644 > --- a/kernel/sched/membarrier.c > +++ b/kernel/sched/membarrier.c > @@ -30,6 +30,28 @@ static void ipi_mb(void *info) > +void membarrier_execve(struct task_struct *t) > +{ > + atomic_set(&t->mm->membarrier_state, 0); > + WRITE_ONCE(this_rq()->membarrier_state, 0); It is the callsite of this one that had me puzzled and confused. I think it works by accident more than anything else. You see; I thought the rules were that we'd change it near/before switch_mm(), and this is quite a way _after_. I think it might be best to place the call in exec_mmap(), right before activate_mm(). But that then had me wonder about the membarrier_prepate_task_switch() thing... > +/* > + * The scheduler provides memory barriers required by membarrier between: > + * - prior user-space memory accesses and store to rq->membarrier_state, > + * - store to rq->membarrier_state and following user-space memory accesses. > + * In the same way it provides those guarantees around store to rq->curr. > + */ > +static inline void membarrier_prepare_task_switch(struct rq *rq, > + struct task_struct *prev, > + struct task_struct *next) > +{ > + int membarrier_state = 0; > + struct mm_struct *next_mm = next->mm; > + > + if (prev->mm == next_mm) > + return; > + if (next_mm) > + membarrier_state = atomic_read(&next_mm->membarrier_state); > + if (READ_ONCE(rq->membarrier_state) != membarrier_state) > + WRITE_ONCE(rq->membarrier_state, membarrier_state); > +} So if you make the above something like: static inline void membarrier_switch_mm(struct rq *rq, struct mm_struct *prev_mm, struct mm_struct *next_mm) { int membarrier_state; if (prev_mm == next_mm) return; membarrier_state = atomic_read(&next_mm->membarrier_state); if (READ_ONCE(rq->membarrier_state) == membarrier_state) return; WRITE_ONCE(rq->membarrier_state, membarrier_state); } And put it right in front of switch_mm() in context_switch() then we'll deal with kernel on the other side, like so: > @@ -70,16 +90,13 @@ static int membarrier_global_expedited(void) > if (cpu == raw_smp_processor_id()) > continue; > > - rcu_read_lock(); > - p = task_rcu_dereference(&cpu_rq(cpu)->curr); > - if (p && p->mm && (atomic_read(&p->mm->membarrier_state) & > - MEMBARRIER_STATE_GLOBAL_EXPEDITED)) { > + if (READ_ONCE(cpu_rq(cpu)->membarrier_state) & > + MEMBARRIER_STATE_GLOBAL_EXPEDITED) { p = rcu_dereference(rq->curr); if ((READ_ONCE(cpu_rq(cpu)->membarrier_state) & MEMBARRIER_STATE_GLOBAL_EXPEDITED) && !(p->flags & PF_KTHREAD)) > if (!fallback) > __cpumask_set_cpu(cpu, tmpmask); > else > smp_call_function_single(cpu, ipi_mb, NULL, 1); > } > - rcu_read_unlock(); > } > if (!fallback) { > preempt_disable(); does that make sense? (also, I hate how long all these membarrier names are) ^ permalink raw reply [flat|nested] 21+ messages in thread
* [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-06 8:23 ` Peter Zijlstra @ 2019-09-08 13:49 ` Mathieu Desnoyers 2019-09-08 16:51 ` Linus Torvalds 2019-09-09 11:00 ` Oleg Nesterov 0 siblings, 2 replies; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-08 13:49 UTC (permalink / raw) To: Peter Zijlstra, Paul E. McKenney Cc: Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Mathieu Desnoyers The membarrier_state field is located within the mm_struct, which is not guaranteed to exist when used from runqueue-lock-free iteration on runqueues by the membarrier system call. Copy the membarrier_state from the mm_struct into the scheduler runqueue when the scheduler switches between mm. When registering membarrier for mm, after setting the registration bit in the mm membarrier state, issue a synchronize_rcu() to ensure the scheduler observes the change. In order to take care of the case where a runqueue keeps executing the target mm without swapping to other mm, iterate over each runqueue and issue an IPI to copy the membarrier_state from the mm_struct into each runqueue which have the same mm which state has just been modified. Move the mm membarrier_state field closer to pgd in mm_struct to use a cache line already touched by the scheduler switch_mm. The membarrier_execve() (now membarrier_exec_mmap) hook now needs to clear the runqueue's membarrier state in addition to clear the mm membarrier state, so move its implementation into the scheduler membarrier code so it can access the runqueue structure. Add memory barrier in membarrier_exec_mmap() prior to clearing the membarrier state, ensuring memory accesses executed prior to exec are not reordered with the stores clearing the membarrier state. As suggested by Linus, move all membarrier.c RCU read-side locks outside of the for each cpu loops. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> --- Changes since v1: - Take care of Peter Zijlstra's feedback, moving callsites closer to switch_mm() (scheduler) and activate_mm() (execve). - Add memory barrier in membarrier_exec_mmap() prior to clearing the membarrier state. --- fs/exec.c | 2 +- include/linux/mm_types.h | 14 +++- include/linux/sched/mm.h | 8 +- kernel/sched/core.c | 4 +- kernel/sched/membarrier.c | 170 ++++++++++++++++++++++++++++---------- kernel/sched/sched.h | 34 ++++++++ 6 files changed, 180 insertions(+), 52 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index f7f6a140856a..ed39e2c81338 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1035,6 +1035,7 @@ static int exec_mmap(struct mm_struct *mm) active_mm = tsk->active_mm; tsk->mm = mm; tsk->active_mm = mm; + membarrier_exec_mmap(mm); activate_mm(active_mm, mm); tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); @@ -1825,7 +1826,6 @@ static int __do_execve_file(int fd, struct filename *filename, /* execve succeeded */ current->fs->in_exec = 0; current->in_execve = 0; - membarrier_execve(current); rseq_execve(current); acct_update_integrals(current); task_numa_free(current, false); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6a7a1083b6fb..ec9bd3a6c827 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -383,6 +383,16 @@ struct mm_struct { unsigned long highest_vm_end; /* highest vma end address */ pgd_t * pgd; +#ifdef CONFIG_MEMBARRIER + /** + * @membarrier_state: Flags controlling membarrier behavior. + * + * This field is close to @pgd to hopefully fit in the same + * cache-line, which needs to be touched by switch_mm(). + */ + atomic_t membarrier_state; +#endif + /** * @mm_users: The number of users including userspace. * @@ -452,9 +462,7 @@ struct mm_struct { unsigned long flags; /* Must use atomic bitops to access */ struct core_state *core_state; /* coredumping support */ -#ifdef CONFIG_MEMBARRIER - atomic_t membarrier_state; -#endif + #ifdef CONFIG_AIO spinlock_t ioctx_lock; struct kioctx_table __rcu *ioctx_table; diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 8557ec664213..e6770012db18 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -370,10 +370,8 @@ static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) sync_core_before_usermode(); } -static inline void membarrier_execve(struct task_struct *t) -{ - atomic_set(&t->mm->membarrier_state, 0); -} +extern void membarrier_exec_mmap(struct mm_struct *mm); + #else #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS static inline void membarrier_arch_switch_mm(struct mm_struct *prev, @@ -382,7 +380,7 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev, { } #endif -static inline void membarrier_execve(struct task_struct *t) +static inline void membarrier_exec_mmap(struct mm_struct *mm) { } static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 010d578118d6..7b118be5cc88 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3238,8 +3238,10 @@ context_switch(struct rq *rq, struct task_struct *prev, next->active_mm = oldmm; mmgrab(oldmm); enter_lazy_tlb(oldmm, next); - } else + } else { + membarrier_switch_mm(rq, oldmm, mm); switch_mm_irqs_off(oldmm, mm, next); + } if (!prev->mm) { prev->active_mm = NULL; diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 7e0a0d6535f3..805a621bbf88 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -30,6 +30,39 @@ static void ipi_mb(void *info) smp_mb(); /* IPIs should be serializing but paranoid. */ } +static void ipi_sync_rq_state(void *info) +{ + struct mm_struct *mm = (struct mm_struct *) info; + + if (current->mm != mm) + return; + WRITE_ONCE(this_rq()->membarrier_state, + atomic_read(&mm->membarrier_state)); + /* + * Issue a memory barrier after setting + * MEMBARRIER_STATE_GLOBAL_EXPEDITED in the current runqueue to + * guarantee that no memory access following registration is reordered + * before registration. + */ + smp_mb(); +} + +void membarrier_exec_mmap(struct mm_struct *mm) +{ + /* + * Issue a memory barrier before clearing membarrier_state to + * guarantee that no memory access prior to exec is reordered after + * clearing this state. + */ + smp_mb(); + atomic_set(&mm->membarrier_state, 0); + /* + * Keep the runqueue membarrier_state in sync with this mm + * membarrier_state. + */ + WRITE_ONCE(this_rq()->membarrier_state, 0); +} + static int membarrier_global_expedited(void) { int cpu; @@ -56,6 +89,7 @@ static int membarrier_global_expedited(void) } cpus_read_lock(); + rcu_read_lock(); for_each_online_cpu(cpu) { struct task_struct *p; @@ -70,17 +104,25 @@ static int membarrier_global_expedited(void) if (cpu == raw_smp_processor_id()) continue; - rcu_read_lock(); + if (!(READ_ONCE(cpu_rq(cpu)->membarrier_state) & + MEMBARRIER_STATE_GLOBAL_EXPEDITED)) + continue; + + /* + * Skip the CPU if it runs a kernel thread. The scheduler + * leaves the prior task mm in place as an optimization when + * scheduling a kthread. + */ p = task_rcu_dereference(&cpu_rq(cpu)->curr); - if (p && p->mm && (atomic_read(&p->mm->membarrier_state) & - MEMBARRIER_STATE_GLOBAL_EXPEDITED)) { - if (!fallback) - __cpumask_set_cpu(cpu, tmpmask); - else - smp_call_function_single(cpu, ipi_mb, NULL, 1); - } - rcu_read_unlock(); + if (p->flags & PF_KTHREAD) + continue; + + if (!fallback) + __cpumask_set_cpu(cpu, tmpmask); + else + smp_call_function_single(cpu, ipi_mb, NULL, 1); } + rcu_read_unlock(); if (!fallback) { preempt_disable(); smp_call_function_many(tmpmask, ipi_mb, NULL, 1); @@ -136,6 +178,7 @@ static int membarrier_private_expedited(int flags) } cpus_read_lock(); + rcu_read_lock(); for_each_online_cpu(cpu) { struct task_struct *p; @@ -149,7 +192,6 @@ static int membarrier_private_expedited(int flags) */ if (cpu == raw_smp_processor_id()) continue; - rcu_read_lock(); p = task_rcu_dereference(&cpu_rq(cpu)->curr); if (p && p->mm == current->mm) { if (!fallback) @@ -157,8 +199,8 @@ static int membarrier_private_expedited(int flags) else smp_call_function_single(cpu, ipi_mb, NULL, 1); } - rcu_read_unlock(); } + rcu_read_unlock(); if (!fallback) { preempt_disable(); smp_call_function_many(tmpmask, ipi_mb, NULL, 1); @@ -177,6 +219,71 @@ static int membarrier_private_expedited(int flags) return 0; } +static void sync_runqueues_membarrier_state(struct mm_struct *mm) +{ + int membarrier_state = atomic_read(&mm->membarrier_state); + bool fallback = false; + cpumask_var_t tmpmask; + int cpu; + + if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) { + WRITE_ONCE(this_rq()->membarrier_state, membarrier_state); + + /* + * For single mm user, we can simply issue a memory barrier + * after setting MEMBARRIER_STATE_GLOBAL_EXPEDITED in the + * mm and in the current runqueue to guarantee that no memory + * access following registration is reordered before + * registration. + */ + smp_mb(); + return; + } + + /* + * For mm with multiple users, we need to ensure all future + * scheduler executions will observe @mm's new membarrier + * state. + */ + synchronize_rcu(); + + if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { + /* Fallback for OOM. */ + fallback = true; + } + + /* + * For each cpu runqueue, if the task's mm match @mm, ensure that all + * @mm's membarrier state set bits are also set in in the runqueue's + * membarrier state. This ensures that a runqueue scheduling + * between threads which are users of @mm has its membarrier state + * updated. + */ + cpus_read_lock(); + rcu_read_lock(); + for_each_online_cpu(cpu) { + struct rq *rq = cpu_rq(cpu); + struct task_struct *p; + + p = task_rcu_dereference(&rq->curr); + if (p && p->mm == mm) { + if (!fallback) + __cpumask_set_cpu(cpu, tmpmask); + else + smp_call_function_single(cpu, ipi_sync_rq_state, + mm, 1); + } + } + rcu_read_unlock(); + if (!fallback) { + preempt_disable(); + smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1); + preempt_enable(); + free_cpumask_var(tmpmask); + } + cpus_read_unlock(); +} + static int membarrier_register_global_expedited(void) { struct task_struct *p = current; @@ -186,23 +293,7 @@ static int membarrier_register_global_expedited(void) MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY) return 0; atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED, &mm->membarrier_state); - if (atomic_read(&mm->mm_users) == 1) { - /* - * For single mm user, single threaded process, we can - * simply issue a memory barrier after setting - * MEMBARRIER_STATE_GLOBAL_EXPEDITED to guarantee that - * no memory access following registration is reordered - * before registration. - */ - smp_mb(); - } else { - /* - * For multi-mm user threads, we need to ensure all - * future scheduler executions will observe the new - * thread flag state for this mm. - */ - synchronize_rcu(); - } + sync_runqueues_membarrier_state(mm); atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY, &mm->membarrier_state); @@ -213,12 +304,14 @@ static int membarrier_register_private_expedited(int flags) { struct task_struct *p = current; struct mm_struct *mm = p->mm; - int state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY; + int ready_state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY, + set_state = MEMBARRIER_STATE_PRIVATE_EXPEDITED; if (flags & MEMBARRIER_FLAG_SYNC_CORE) { if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) return -EINVAL; - state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY; + ready_state = + MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY; } /* @@ -226,20 +319,13 @@ static int membarrier_register_private_expedited(int flags) * groups, which use the same mm. (CLONE_VM but not * CLONE_THREAD). */ - if ((atomic_read(&mm->membarrier_state) & state) == state) + if ((atomic_read(&mm->membarrier_state) & ready_state) == ready_state) return 0; - atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state); if (flags & MEMBARRIER_FLAG_SYNC_CORE) - atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE, - &mm->membarrier_state); - if (atomic_read(&mm->mm_users) != 1) { - /* - * Ensure all future scheduler executions will observe the - * new thread flag state for this process. - */ - synchronize_rcu(); - } - atomic_or(state, &mm->membarrier_state); + set_state |= MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE; + atomic_or(set_state, &mm->membarrier_state); + sync_runqueues_membarrier_state(mm); + atomic_or(ready_state, &mm->membarrier_state); return 0; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 802b1f3405f2..85cb30b1b6b7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -907,6 +907,10 @@ struct rq { atomic_t nr_iowait; +#ifdef CONFIG_MEMBARRIER + int membarrier_state; +#endif + #ifdef CONFIG_SMP struct root_domain *rd; struct sched_domain __rcu *sd; @@ -2423,3 +2427,33 @@ static inline bool sched_energy_enabled(void) static inline bool sched_energy_enabled(void) { return false; } #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */ + +#ifdef CONFIG_MEMBARRIER +/* + * The scheduler provides memory barriers required by membarrier between: + * - prior user-space memory accesses and store to rq->membarrier_state, + * - store to rq->membarrier_state and following user-space memory accesses. + * In the same way it provides those guarantees around store to rq->curr. + */ +static inline void membarrier_switch_mm(struct rq *rq, + struct mm_struct *prev_mm, + struct mm_struct *next_mm) +{ + int membarrier_state; + + if (prev_mm == next_mm) + return; + + membarrier_state = atomic_read(&next_mm->membarrier_state); + if (READ_ONCE(rq->membarrier_state) == membarrier_state) + return; + + WRITE_ONCE(rq->membarrier_state, membarrier_state); +} +#else +static inline void membarrier_switch_mm(struct rq *rq, + struct mm_struct *prev_mm, + struct mm_struct *next_mm) +{ +} +#endif -- 2.17.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-08 13:49 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) Mathieu Desnoyers @ 2019-09-08 16:51 ` Linus Torvalds 2019-09-10 9:48 ` Mathieu Desnoyers 2019-09-09 11:00 ` Oleg Nesterov 1 sibling, 1 reply; 21+ messages in thread From: Linus Torvalds @ 2019-09-08 16:51 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, Paul E. McKenney, Ingo Molnar, Linux List Kernel Mailing, Oleg Nesterov, Eric W. Biederman, Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner On Sun, Sep 8, 2019 at 6:49 AM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > +static void sync_runqueues_membarrier_state(struct mm_struct *mm) > +{ > + int membarrier_state = atomic_read(&mm->membarrier_state); > + bool fallback = false; > + cpumask_var_t tmpmask; > + > + if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > + /* Fallback for OOM. */ > + fallback = true; > + } > + > + /* > + * For each cpu runqueue, if the task's mm match @mm, ensure that all > + * @mm's membarrier state set bits are also set in in the runqueue's > + * membarrier state. This ensures that a runqueue scheduling > + * between threads which are users of @mm has its membarrier state > + * updated. > + */ > + cpus_read_lock(); > + rcu_read_lock(); > + for_each_online_cpu(cpu) { > + struct rq *rq = cpu_rq(cpu); > + struct task_struct *p; > + > + p = task_rcu_dereference(&rq->curr); > + if (p && p->mm == mm) { > + if (!fallback) > + __cpumask_set_cpu(cpu, tmpmask); > + else > + smp_call_function_single(cpu, ipi_sync_rq_state, > + mm, 1); > + } > + } I really absolutely detest this whole "fallback" code. It will never get any real testing, and the code is just broken. Why don't you just use the mm_cpumask(mm) unconditionally? Yes, it will possibly call too many CPU's, but this fallback code is just completely disgusting. Do a simple and clean implementation. Then, if you can show real performance issues (which I doubt), maybe do something else, but even then you should never do something that will effectively create cases that have absolutely zero test-coverage. Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-08 16:51 ` Linus Torvalds @ 2019-09-10 9:48 ` Mathieu Desnoyers 2019-09-12 13:48 ` Will Deacon 0 siblings, 1 reply; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-10 9:48 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Russell King, ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Will Deacon ----- On Sep 8, 2019, at 5:51 PM, Linus Torvalds torvalds@linux-foundation.org wrote: > On Sun, Sep 8, 2019 at 6:49 AM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> +static void sync_runqueues_membarrier_state(struct mm_struct *mm) >> +{ >> + int membarrier_state = atomic_read(&mm->membarrier_state); >> + bool fallback = false; >> + cpumask_var_t tmpmask; >> + >> + if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { >> + /* Fallback for OOM. */ >> + fallback = true; >> + } >> + >> + /* >> + * For each cpu runqueue, if the task's mm match @mm, ensure that all >> + * @mm's membarrier state set bits are also set in in the runqueue's >> + * membarrier state. This ensures that a runqueue scheduling >> + * between threads which are users of @mm has its membarrier state >> + * updated. >> + */ >> + cpus_read_lock(); >> + rcu_read_lock(); >> + for_each_online_cpu(cpu) { >> + struct rq *rq = cpu_rq(cpu); >> + struct task_struct *p; >> + >> + p = task_rcu_dereference(&rq->curr); >> + if (p && p->mm == mm) { >> + if (!fallback) >> + __cpumask_set_cpu(cpu, tmpmask); >> + else >> + smp_call_function_single(cpu, ipi_sync_rq_state, >> + mm, 1); >> + } >> + } > > I really absolutely detest this whole "fallback" code. > > It will never get any real testing, and the code is just broken. > > Why don't you just use the mm_cpumask(mm) unconditionally? Yes, it > will possibly call too many CPU's, but this fallback code is just > completely disgusting. > > Do a simple and clean implementation. Then, if you can show real > performance issues (which I doubt), maybe do something else, but even > then you should never do something that will effectively create cases > that have absolutely zero test-coverage. A few points worth mentioning here: 1) As I stated earlier, using mm_cpumask in its current form is not an option for membarrier. For two reasons: A) The mask is not populated on all architectures (e.g. arm64 does not populate it), B) Even if it was populated on all architectures, we would need to carefully audit and document every spot where this mm_cpumask is set or cleared within each architecture code, and ensure we have the required memory barriers between user-space memory accesses and those stores, documenting those requirements into each architecture code in the process. This seems to be a lot of useless error-prone code churn. 2) I should actually use GFP_KERNEL rather than GFP_NOWAIT in this membarrier registration code. But it can still fail. However, the other membarrier code using the same fallback pattern (private and global expedited) documents that those membarrier commands do not block in the membarrier(2) man page, so GFP_NOWAIT is appropriate in those cases. 3) Testing-wise, I fully agree with your argument of lacking test coverage. One option I'm considering would be to add a selftest based on the fault-injection infrastructure, which would ensure that we have coverage of the failure case in the kernel selftests. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-10 9:48 ` Mathieu Desnoyers @ 2019-09-12 13:48 ` Will Deacon 2019-09-12 14:24 ` Linus Torvalds 0 siblings, 1 reply; 21+ messages in thread From: Will Deacon @ 2019-09-12 13:48 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Russell King, ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner On Tue, Sep 10, 2019 at 05:48:02AM -0400, Mathieu Desnoyers wrote: > ----- On Sep 8, 2019, at 5:51 PM, Linus Torvalds torvalds@linux-foundation.org wrote: > > > On Sun, Sep 8, 2019 at 6:49 AM Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > >> > >> +static void sync_runqueues_membarrier_state(struct mm_struct *mm) > >> +{ > >> + int membarrier_state = atomic_read(&mm->membarrier_state); > >> + bool fallback = false; > >> + cpumask_var_t tmpmask; > >> + > >> + if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > >> + /* Fallback for OOM. */ > >> + fallback = true; > >> + } > >> + > >> + /* > >> + * For each cpu runqueue, if the task's mm match @mm, ensure that all > >> + * @mm's membarrier state set bits are also set in in the runqueue's > >> + * membarrier state. This ensures that a runqueue scheduling > >> + * between threads which are users of @mm has its membarrier state > >> + * updated. > >> + */ > >> + cpus_read_lock(); > >> + rcu_read_lock(); > >> + for_each_online_cpu(cpu) { > >> + struct rq *rq = cpu_rq(cpu); > >> + struct task_struct *p; > >> + > >> + p = task_rcu_dereference(&rq->curr); > >> + if (p && p->mm == mm) { > >> + if (!fallback) > >> + __cpumask_set_cpu(cpu, tmpmask); > >> + else > >> + smp_call_function_single(cpu, ipi_sync_rq_state, > >> + mm, 1); > >> + } > >> + } > > > > I really absolutely detest this whole "fallback" code. > > > > It will never get any real testing, and the code is just broken. > > > > Why don't you just use the mm_cpumask(mm) unconditionally? Yes, it > > will possibly call too many CPU's, but this fallback code is just > > completely disgusting. > > > > Do a simple and clean implementation. Then, if you can show real > > performance issues (which I doubt), maybe do something else, but even > > then you should never do something that will effectively create cases > > that have absolutely zero test-coverage. > > A few points worth mentioning here: > > 1) As I stated earlier, using mm_cpumask in its current form is not > an option for membarrier. For two reasons: > > A) The mask is not populated on all architectures (e.g. arm64 does > not populate it), > > B) Even if it was populated on all architectures, we would need to > carefully audit and document every spot where this mm_cpumask > is set or cleared within each architecture code, and ensure we > have the required memory barriers between user-space memory > accesses and those stores, documenting those requirements into > each architecture code in the process. This seems to be a lot of > useless error-prone code churn. > > 2) I should actually use GFP_KERNEL rather than GFP_NOWAIT in this > membarrier registration code. But it can still fail. However, the other > membarrier code using the same fallback pattern (private and global > expedited) documents that those membarrier commands do not block in > the membarrier(2) man page, so GFP_NOWAIT is appropriate in those cases. > > 3) Testing-wise, I fully agree with your argument of lacking test coverage. > One option I'm considering would be to add a selftest based on the > fault-injection infrastructure, which would ensure that we have coverage > of the failure case in the kernel selftests. > > Thoughts ? So the man page for sys_membarrier states that the expedited variants "never block", which feels pretty strong. Do any other system calls claim to provide this guarantee without a failure path if blocking is necessary? Given that the whole thing is preemptible, I'm also curious as to how exactly userspace relies on this non-blocking guarantee. I'd have thought that you could either just bite the bullet and block in the rare case that you need to when allocating the cpumask, or you could just return -EWOULDBLOCK on allocation failure, given that I suspect there are very few users of this system call right now and it's not yet supported by glibc afaik. Will ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-12 13:48 ` Will Deacon @ 2019-09-12 14:24 ` Linus Torvalds 2019-09-12 15:47 ` Will Deacon 0 siblings, 1 reply; 21+ messages in thread From: Linus Torvalds @ 2019-09-12 14:24 UTC (permalink / raw) To: Will Deacon Cc: Mathieu Desnoyers, Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Russell King, ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner On Thu, Sep 12, 2019 at 2:48 PM Will Deacon <will@kernel.org> wrote: > > So the man page for sys_membarrier states that the expedited variants "never > block", which feels pretty strong. Do any other system calls claim to > provide this guarantee without a failure path if blocking is necessary? The traditional semantics for "we don't block" is that "we block on memory allocations and locking and user accesses etc, but we don't wait for our own IO". So there may be new IO started (and waited on) as part of allocating new memory etc, or in just paging in user memory, but the IO that the operation _itself_ explicitly starts is not waited on. No system call should ever be considered "atomic" in any sense. If you're doing RT, you should maybe expect "getpid()" to not ever block, but that's just about the exclusive list of truly nonblocking system calls, and even that can be preempted. Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-12 14:24 ` Linus Torvalds @ 2019-09-12 15:47 ` Will Deacon 2019-09-13 14:22 ` Mathieu Desnoyers 0 siblings, 1 reply; 21+ messages in thread From: Will Deacon @ 2019-09-12 15:47 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Russell King, ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner On Thu, Sep 12, 2019 at 03:24:35PM +0100, Linus Torvalds wrote: > On Thu, Sep 12, 2019 at 2:48 PM Will Deacon <will@kernel.org> wrote: > > > > So the man page for sys_membarrier states that the expedited variants "never > > block", which feels pretty strong. Do any other system calls claim to > > provide this guarantee without a failure path if blocking is necessary? > > The traditional semantics for "we don't block" is that "we block on > memory allocations and locking and user accesses etc, but we don't > wait for our own IO". > > So there may be new IO started (and waited on) as part of allocating > new memory etc, or in just paging in user memory, but the IO that the > operation _itself_ explicitly starts is not waited on. Thanks, that makes sense, and I'd be inclined to suggest an update to the sys_membarrier manpage to make this more clear since the "never blocks" phrasing doesn't seem to be used like this for other system calls. > No system call should ever be considered "atomic" in any sense. If > you're doing RT, you should maybe expect "getpid()" to not ever block, > but that's just about the exclusive list of truly nonblocking system > calls, and even that can be preempted. In which case, why can't we just use GFP_KERNEL for the cpumask allocation instead of GFP_NOWAIT and then remove the failure path altogether? Mathieu? Will ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-12 15:47 ` Will Deacon @ 2019-09-13 14:22 ` Mathieu Desnoyers 2019-09-19 16:26 ` Will Deacon 0 siblings, 1 reply; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-13 14:22 UTC (permalink / raw) To: Will Deacon, Michael Kerrisk, Linus Torvalds Cc: Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Russell King, ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner ----- On Sep 12, 2019, at 11:47 AM, Will Deacon will@kernel.org wrote: > On Thu, Sep 12, 2019 at 03:24:35PM +0100, Linus Torvalds wrote: >> On Thu, Sep 12, 2019 at 2:48 PM Will Deacon <will@kernel.org> wrote: >> > >> > So the man page for sys_membarrier states that the expedited variants "never >> > block", which feels pretty strong. Do any other system calls claim to >> > provide this guarantee without a failure path if blocking is necessary? >> >> The traditional semantics for "we don't block" is that "we block on >> memory allocations and locking and user accesses etc, but we don't >> wait for our own IO". >> >> So there may be new IO started (and waited on) as part of allocating >> new memory etc, or in just paging in user memory, but the IO that the >> operation _itself_ explicitly starts is not waited on. > > Thanks, that makes sense, and I'd be inclined to suggest an update to the > sys_membarrier manpage to make this more clear since the "never blocks" > phrasing doesn't seem to be used like this for other system calls. The current wording from membarrier(2) is: The "expedited" commands complete faster than the non-expedited ones; they never block, but have the downside of causing extra overhead. We could simply remove the "; they never block" part then ? > >> No system call should ever be considered "atomic" in any sense. If >> you're doing RT, you should maybe expect "getpid()" to not ever block, >> but that's just about the exclusive list of truly nonblocking system >> calls, and even that can be preempted. > > In which case, why can't we just use GFP_KERNEL for the cpumask allocation > instead of GFP_NOWAIT and then remove the failure path altogether? Mathieu? Looking at: #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS) I notice that it does not include __GFP_NOFAIL. What prevents GFP_KERNEL from failing, and where is this guarantee documented ? Regarding __GFP_NOFAIL, its use seems to be discouraged in linux/gfp.h: * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller * cannot handle allocation failures. The allocation could block * indefinitely but will never return with failure. Testing for * failure is pointless. * New users should be evaluated carefully (and the flag should be * used only when there is no reasonable failure policy) but it is * definitely preferable to use the flag rather than opencode endless * loop around allocator. * Using this flag for costly allocations is _highly_ discouraged. So I am reluctant to use it. But if we can agree on the right combination of flags that guarantees there is no failure, I would be perfectly fine with using them to remove the fallback code. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-13 14:22 ` Mathieu Desnoyers @ 2019-09-19 16:26 ` Will Deacon 2019-09-19 17:33 ` Mathieu Desnoyers 0 siblings, 1 reply; 21+ messages in thread From: Will Deacon @ 2019-09-19 16:26 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Michael Kerrisk, Linus Torvalds, Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Russell King, ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner Hi Mathieu, Sorry for the delay in responding. On Fri, Sep 13, 2019 at 10:22:28AM -0400, Mathieu Desnoyers wrote: > ----- On Sep 12, 2019, at 11:47 AM, Will Deacon will@kernel.org wrote: > > > On Thu, Sep 12, 2019 at 03:24:35PM +0100, Linus Torvalds wrote: > >> On Thu, Sep 12, 2019 at 2:48 PM Will Deacon <will@kernel.org> wrote: > >> > > >> > So the man page for sys_membarrier states that the expedited variants "never > >> > block", which feels pretty strong. Do any other system calls claim to > >> > provide this guarantee without a failure path if blocking is necessary? > >> > >> The traditional semantics for "we don't block" is that "we block on > >> memory allocations and locking and user accesses etc, but we don't > >> wait for our own IO". > >> > >> So there may be new IO started (and waited on) as part of allocating > >> new memory etc, or in just paging in user memory, but the IO that the > >> operation _itself_ explicitly starts is not waited on. > > > > Thanks, that makes sense, and I'd be inclined to suggest an update to the > > sys_membarrier manpage to make this more clear since the "never blocks" > > phrasing doesn't seem to be used like this for other system calls. > > The current wording from membarrier(2) is: > > The "expedited" commands complete faster than the non-expedited > ones; they never block, but have the downside of causing extra > overhead. > > We could simply remove the "; they never block" part then ? I think so, yes. That or, "; they do not voluntarily block" or something like that. Maybe look at other man pages for inspiration ;) > >> No system call should ever be considered "atomic" in any sense. If > >> you're doing RT, you should maybe expect "getpid()" to not ever block, > >> but that's just about the exclusive list of truly nonblocking system > >> calls, and even that can be preempted. > > > > In which case, why can't we just use GFP_KERNEL for the cpumask allocation > > instead of GFP_NOWAIT and then remove the failure path altogether? Mathieu? > > Looking at: > > #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS) > > I notice that it does not include __GFP_NOFAIL. What prevents GFP_KERNEL from > failing, and where is this guarantee documented ? There was an lwn article a little while ago about this: https://lwn.net/Articles/723317/ I'm not sure what (if anything) has changed in this regard since then, however. > Regarding __GFP_NOFAIL, its use seems to be discouraged in linux/gfp.h: > > * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. The allocation could block > * indefinitely but will never return with failure. Testing for > * failure is pointless. > * New users should be evaluated carefully (and the flag should be > * used only when there is no reasonable failure policy) but it is > * definitely preferable to use the flag rather than opencode endless > * loop around allocator. > * Using this flag for costly allocations is _highly_ discouraged. > > So I am reluctant to use it. > > But if we can agree on the right combination of flags that guarantees there > is no failure, I would be perfectly fine with using them to remove the fallback > code. I reckon you'll be fine using GFP_KERNEL and returning -ENOMEM on allocation failure. This shouldn't happen in practice and it removes the fallback path. Will ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-19 16:26 ` Will Deacon @ 2019-09-19 17:33 ` Mathieu Desnoyers 0 siblings, 0 replies; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-19 17:33 UTC (permalink / raw) To: Will Deacon Cc: Michael Kerrisk, Linus Torvalds, Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel, Oleg Nesterov, Eric W. Biederman, Russell King, ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner ----- On Sep 19, 2019, at 12:26 PM, Will Deacon will@kernel.org wrote: [...] >> >> The current wording from membarrier(2) is: >> >> The "expedited" commands complete faster than the non-expedited >> ones; they never block, but have the downside of causing extra >> overhead. >> >> We could simply remove the "; they never block" part then ? > > I think so, yes. That or, "; they do not voluntarily block" or something > like that. Maybe look at other man pages for inspiration ;) OK, let's tackle the man-pages part after the fix reaches mainline though. [...] > > I reckon you'll be fine using GFP_KERNEL and returning -ENOMEM on allocation > failure. This shouldn't happen in practice and it removes the fallback > path. Works for me! I'll prepare an updated patchset. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-08 13:49 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) Mathieu Desnoyers 2019-09-08 16:51 ` Linus Torvalds @ 2019-09-09 11:00 ` Oleg Nesterov 2019-09-13 15:20 ` Mathieu Desnoyers 1 sibling, 1 reply; 21+ messages in thread From: Oleg Nesterov @ 2019-09-09 11:00 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, Paul E. McKenney, Ingo Molnar, linux-kernel, Eric W. Biederman, Linus Torvalds, Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner On 09/08, Mathieu Desnoyers wrote: > > +static void sync_runqueues_membarrier_state(struct mm_struct *mm) > +{ > + int membarrier_state = atomic_read(&mm->membarrier_state); > + bool fallback = false; > + cpumask_var_t tmpmask; > + int cpu; > + > + if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) { > + WRITE_ONCE(this_rq()->membarrier_state, membarrier_state); This doesn't look safe, this caller can migrate to another CPU after it calculates the per-cpu ptr. I think you need do disable preemption or simply use this_cpu_write(). Oleg. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-09 11:00 ` Oleg Nesterov @ 2019-09-13 15:20 ` Mathieu Desnoyers 2019-09-13 16:04 ` Oleg Nesterov 0 siblings, 1 reply; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-13 15:20 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel, Eric W. Biederman, Linus Torvalds, Russell King, ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner ----- On Sep 9, 2019, at 7:00 AM, Oleg Nesterov oleg@redhat.com wrote: > On 09/08, Mathieu Desnoyers wrote: >> >> +static void sync_runqueues_membarrier_state(struct mm_struct *mm) >> +{ >> + int membarrier_state = atomic_read(&mm->membarrier_state); >> + bool fallback = false; >> + cpumask_var_t tmpmask; >> + int cpu; >> + >> + if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) { >> + WRITE_ONCE(this_rq()->membarrier_state, membarrier_state); > > This doesn't look safe, this caller can migrate to another CPU after > it calculates the per-cpu ptr. > > I think you need do disable preemption or simply use this_cpu_write(). Good point! I'll use this_cpu_write() there and within membarrier_exec_mmap(), which seems to be affected by the same problem. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-13 15:20 ` Mathieu Desnoyers @ 2019-09-13 16:04 ` Oleg Nesterov 2019-09-13 17:07 ` Mathieu Desnoyers 0 siblings, 1 reply; 21+ messages in thread From: Oleg Nesterov @ 2019-09-13 16:04 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel, Eric W. Biederman, Linus Torvalds, Russell King, ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner On 09/13, Mathieu Desnoyers wrote: > > membarrier_exec_mmap(), which seems to be affected by the same problem. IIRC, in the last version it is called by exec_mmap() undef task_lock(), so it should fine. Oleg. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) 2019-09-13 16:04 ` Oleg Nesterov @ 2019-09-13 17:07 ` Mathieu Desnoyers 0 siblings, 0 replies; 21+ messages in thread From: Mathieu Desnoyers @ 2019-09-13 17:07 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, paulmck, Ingo Molnar, linux-kernel, Eric W. Biederman, Linus Torvalds, Russell King, ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner ----- On Sep 13, 2019, at 12:04 PM, Oleg Nesterov oleg@redhat.com wrote: > On 09/13, Mathieu Desnoyers wrote: >> >> membarrier_exec_mmap(), which seems to be affected by the same problem. > > IIRC, in the last version it is called by exec_mmap() undef task_lock(), > so it should fine. Fair point. Although it seems rather cleaner to use this_cpu_write() in all 3 sites updating this variable rather than a mix of this_cpu_write and WRITE_ONCE(), unless anyone objects. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2019-09-19 17:33 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-06 3:12 [RFC PATCH 0/4] Membarrier fixes/cleanups Mathieu Desnoyers 2019-09-06 3:12 ` [RFC PATCH 1/4] Fix: sched/membarrier: private expedited registration check Mathieu Desnoyers 2019-09-06 3:12 ` [RFC PATCH 2/4] Cleanup: sched/membarrier: remove redundant check Mathieu Desnoyers 2019-09-06 3:12 ` [RFC PATCH 3/4] Cleanup: sched/membarrier: only sync_core before usermode for same mm Mathieu Desnoyers 2019-09-06 7:41 ` Peter Zijlstra 2019-09-06 13:40 ` Mathieu Desnoyers 2019-09-06 3:13 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load Mathieu Desnoyers 2019-09-06 8:23 ` Peter Zijlstra 2019-09-08 13:49 ` [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2) Mathieu Desnoyers 2019-09-08 16:51 ` Linus Torvalds 2019-09-10 9:48 ` Mathieu Desnoyers 2019-09-12 13:48 ` Will Deacon 2019-09-12 14:24 ` Linus Torvalds 2019-09-12 15:47 ` Will Deacon 2019-09-13 14:22 ` Mathieu Desnoyers 2019-09-19 16:26 ` Will Deacon 2019-09-19 17:33 ` Mathieu Desnoyers 2019-09-09 11:00 ` Oleg Nesterov 2019-09-13 15:20 ` Mathieu Desnoyers 2019-09-13 16:04 ` Oleg Nesterov 2019-09-13 17:07 ` Mathieu Desnoyers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).