* [PATCH -tip 0/3] locking/percpu-rwsem: writer-side optimizations @ 2016-11-18 18:54 Davidlohr Bueso 2016-11-18 18:54 ` [PATCH 1/3] locking/percpu-rwsem: Move text file into Documentation/locking/ Davidlohr Bueso ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Davidlohr Bueso @ 2016-11-18 18:54 UTC (permalink / raw) To: mingo, peterz, oleg; +Cc: john.stultz, dimitrysh, linux-kernel, dave Hi, Here are two updates intended for those rare updater cases. wrt numbers, I'm still trying to figure out this android boot environment (that calls cgroup_proc_write ipc paths) which might care about these changes. In the mean time, I'd like to know if anyone has any objections to these optimizations. Passed kernel builds with lockdep, as well as an overnight run doing locktoture. Thanks. Davidlohr Bueso (3): locking/percpu-rwsem: Move text file into Documentation/locking/ locking/percpu-rwsem: Replace bulky wait-queues with swait locking/percpu-rwsem: Avoid unnecessary writer wakeups Documentation/locking/percpu-rw-semaphore.txt | 27 ++++++++++ Documentation/percpu-rw-semaphore.txt | 27 ---------- include/linux/percpu-rwsem.h | 6 +-- kernel/locking/percpu-rwsem.c | 78 +++++++++++++++------------ 4 files changed, 73 insertions(+), 65 deletions(-) create mode 100644 Documentation/locking/percpu-rw-semaphore.txt delete mode 100644 Documentation/percpu-rw-semaphore.txt -- 2.6.6 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/3] locking/percpu-rwsem: Move text file into Documentation/locking/ 2016-11-18 18:54 [PATCH -tip 0/3] locking/percpu-rwsem: writer-side optimizations Davidlohr Bueso @ 2016-11-18 18:54 ` Davidlohr Bueso 2016-11-18 18:54 ` [PATCH 2/3] locking/percpu-rwsem: Replace bulky wait-queues with swait Davidlohr Bueso 2016-11-18 18:54 ` [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups Davidlohr Bueso 2 siblings, 0 replies; 19+ messages in thread From: Davidlohr Bueso @ 2016-11-18 18:54 UTC (permalink / raw) To: mingo, peterz, oleg Cc: john.stultz, dimitrysh, linux-kernel, dave, Davidlohr Bueso Although this is rather useless and the actual code is orders of magnitude more detailed and informative than this document. Might as well keep it instead, I guess its already there and could give a user a quick why/when use them vs regular rwsems. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> --- Documentation/locking/percpu-rw-semaphore.txt | 27 +++++++++++++++++++++++++++ Documentation/percpu-rw-semaphore.txt | 27 --------------------------- 2 files changed, 27 insertions(+), 27 deletions(-) create mode 100644 Documentation/locking/percpu-rw-semaphore.txt delete mode 100644 Documentation/percpu-rw-semaphore.txt diff --git a/Documentation/locking/percpu-rw-semaphore.txt b/Documentation/locking/percpu-rw-semaphore.txt new file mode 100644 index 000000000000..7d3c82431909 --- /dev/null +++ b/Documentation/locking/percpu-rw-semaphore.txt @@ -0,0 +1,27 @@ +Percpu rw semaphores +-------------------- + +Percpu rw semaphores is a new read-write semaphore design that is +optimized for locking for reading. + +The problem with traditional read-write semaphores is that when multiple +cores take the lock for reading, the cache line containing the semaphore +is bouncing between L1 caches of the cores, causing performance +degradation. + +Locking for reading is very fast, it uses RCU and it avoids any atomic +instruction in the lock and unlock path. On the other hand, locking for +writing is very expensive, it calls synchronize_rcu() that can take +hundreds of milliseconds. + +The lock is declared with "struct percpu_rw_semaphore" type. +The lock is initialized percpu_init_rwsem, it returns 0 on success and +-ENOMEM on allocation failure. +The lock must be freed with percpu_free_rwsem to avoid memory leak. + +The lock is locked for read with percpu_down_read, percpu_up_read and +for write with percpu_down_write, percpu_up_write. + +The idea of using RCU for optimized rw-lock was introduced by +Eric Dumazet <eric.dumazet@gmail.com>. +The code was written by Mikulas Patocka <mpatocka@redhat.com> diff --git a/Documentation/percpu-rw-semaphore.txt b/Documentation/percpu-rw-semaphore.txt deleted file mode 100644 index 7d3c82431909..000000000000 --- a/Documentation/percpu-rw-semaphore.txt +++ /dev/null @@ -1,27 +0,0 @@ -Percpu rw semaphores --------------------- - -Percpu rw semaphores is a new read-write semaphore design that is -optimized for locking for reading. - -The problem with traditional read-write semaphores is that when multiple -cores take the lock for reading, the cache line containing the semaphore -is bouncing between L1 caches of the cores, causing performance -degradation. - -Locking for reading is very fast, it uses RCU and it avoids any atomic -instruction in the lock and unlock path. On the other hand, locking for -writing is very expensive, it calls synchronize_rcu() that can take -hundreds of milliseconds. - -The lock is declared with "struct percpu_rw_semaphore" type. -The lock is initialized percpu_init_rwsem, it returns 0 on success and --ENOMEM on allocation failure. -The lock must be freed with percpu_free_rwsem to avoid memory leak. - -The lock is locked for read with percpu_down_read, percpu_up_read and -for write with percpu_down_write, percpu_up_write. - -The idea of using RCU for optimized rw-lock was introduced by -Eric Dumazet <eric.dumazet@gmail.com>. -The code was written by Mikulas Patocka <mpatocka@redhat.com> -- 2.6.6 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/3] locking/percpu-rwsem: Replace bulky wait-queues with swait 2016-11-18 18:54 [PATCH -tip 0/3] locking/percpu-rwsem: writer-side optimizations Davidlohr Bueso 2016-11-18 18:54 ` [PATCH 1/3] locking/percpu-rwsem: Move text file into Documentation/locking/ Davidlohr Bueso @ 2016-11-18 18:54 ` Davidlohr Bueso 2016-11-21 12:55 ` Oleg Nesterov 2016-12-03 2:18 ` [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues Davidlohr Bueso 2016-11-18 18:54 ` [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups Davidlohr Bueso 2 siblings, 2 replies; 19+ messages in thread From: Davidlohr Bueso @ 2016-11-18 18:54 UTC (permalink / raw) To: mingo, peterz, oleg Cc: john.stultz, dimitrysh, linux-kernel, dave, Davidlohr Bueso In the case of the percpu-rwsem, they don't need any of the fancy/bulky features, such as custom callbacks or fine grained wakeups. Users that can convert to simple wait-queues are encouraged to do so for the various rt and (indirect) performance benefits. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> --- include/linux/percpu-rwsem.h | 6 +++--- kernel/locking/percpu-rwsem.c | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 5b2e6159b744..82d54a4b9988 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -4,7 +4,7 @@ #include <linux/atomic.h> #include <linux/rwsem.h> #include <linux/percpu.h> -#include <linux/wait.h> +#include <linux/swait.h> #include <linux/rcu_sync.h> #include <linux/lockdep.h> @@ -12,7 +12,7 @@ struct percpu_rw_semaphore { struct rcu_sync rss; unsigned int __percpu *read_count; struct rw_semaphore rw_sem; - wait_queue_head_t writer; + struct swait_queue_head writer; int readers_block; }; @@ -22,7 +22,7 @@ static struct percpu_rw_semaphore name = { \ .rss = __RCU_SYNC_INITIALIZER(name.rss, RCU_SCHED_SYNC), \ .read_count = &__percpu_rwsem_rc_##name, \ .rw_sem = __RWSEM_INITIALIZER(name.rw_sem), \ - .writer = __WAIT_QUEUE_HEAD_INITIALIZER(name.writer), \ + .writer = __SWAIT_QUEUE_HEAD_INITIALIZER(name.writer), \ } extern int __percpu_down_read(struct percpu_rw_semaphore *, int); diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index ce182599cf2e..cb71201855f2 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -1,7 +1,7 @@ #include <linux/atomic.h> #include <linux/rwsem.h> #include <linux/percpu.h> -#include <linux/wait.h> +#include <linux/swait.h> #include <linux/lockdep.h> #include <linux/percpu-rwsem.h> #include <linux/rcupdate.h> @@ -18,7 +18,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, /* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */ rcu_sync_init(&sem->rss, RCU_SCHED_SYNC); __init_rwsem(&sem->rw_sem, name, rwsem_key); - init_waitqueue_head(&sem->writer); + init_swait_queue_head(&sem->writer); sem->readers_block = 0; return 0; } @@ -103,7 +103,7 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem) __this_cpu_dec(*sem->read_count); /* Prod writer to recheck readers_active */ - wake_up(&sem->writer); + swake_up(&sem->writer); } EXPORT_SYMBOL_GPL(__percpu_up_read); @@ -160,7 +160,7 @@ void percpu_down_write(struct percpu_rw_semaphore *sem) */ /* Wait for all now active readers to complete. */ - wait_event(sem->writer, readers_active_check(sem)); + swait_event(sem->writer, readers_active_check(sem)); } EXPORT_SYMBOL_GPL(percpu_down_write); -- 2.6.6 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] locking/percpu-rwsem: Replace bulky wait-queues with swait 2016-11-18 18:54 ` [PATCH 2/3] locking/percpu-rwsem: Replace bulky wait-queues with swait Davidlohr Bueso @ 2016-11-21 12:55 ` Oleg Nesterov 2016-11-21 17:26 ` Davidlohr Bueso 2016-12-03 2:18 ` [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues Davidlohr Bueso 1 sibling, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2016-11-21 12:55 UTC (permalink / raw) To: Davidlohr Bueso Cc: mingo, peterz, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On 11/18, Davidlohr Bueso wrote: > > @@ -12,7 +12,7 @@ struct percpu_rw_semaphore { > struct rcu_sync rss; > unsigned int __percpu *read_count; > struct rw_semaphore rw_sem; > - wait_queue_head_t writer; > + struct swait_queue_head writer; I won't argue, but even swait_queue_head is overkill in this case. We can just add "struct task_struct *writer" into percpu_rw_semaphore, __percpu_up_read: rcu_read_lock(); writer = task_rcu_dereference(&sem->writer); if (writer) wake_up_process(writer); rcu_read_unlock(); percpu_down_write() can set sem->writer == current and do the simple while-not-condition-schedule() loop. But this probably needs a couple of new helpers, and probably they can have more users. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] locking/percpu-rwsem: Replace bulky wait-queues with swait 2016-11-21 12:55 ` Oleg Nesterov @ 2016-11-21 17:26 ` Davidlohr Bueso 0 siblings, 0 replies; 19+ messages in thread From: Davidlohr Bueso @ 2016-11-21 17:26 UTC (permalink / raw) To: Oleg Nesterov Cc: mingo, peterz, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On Mon, 21 Nov 2016, Oleg Nesterov wrote: >On 11/18, Davidlohr Bueso wrote: >> >> @@ -12,7 +12,7 @@ struct percpu_rw_semaphore { >> struct rcu_sync rss; >> unsigned int __percpu *read_count; >> struct rw_semaphore rw_sem; >> - wait_queue_head_t writer; >> + struct swait_queue_head writer; > >I won't argue, but even swait_queue_head is overkill in this case. > >We can just add "struct task_struct *writer" into percpu_rw_semaphore, Given that this is how all things locking/ work, I very much agree with you, lemme send a v2. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues 2016-11-18 18:54 ` [PATCH 2/3] locking/percpu-rwsem: Replace bulky wait-queues with swait Davidlohr Bueso 2016-11-21 12:55 ` Oleg Nesterov @ 2016-12-03 2:18 ` Davidlohr Bueso 2016-12-05 8:36 ` Peter Zijlstra 2016-12-05 17:13 ` Oleg Nesterov 1 sibling, 2 replies; 19+ messages in thread From: Davidlohr Bueso @ 2016-12-03 2:18 UTC (permalink / raw) To: mingo, peterz, oleg; +Cc: john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso The use of any kind of wait queue is an overkill for pcpu-rwsems. While one option would be to use the less heavy simple (swait) flavor, this is still too much for what pcpu-rwsems needs. For one, we do not care about any sort of queuing in that the only (rare) time writers (and readers, for that matter) are queued is when trying to acquire the regular contended rw_sem. There cannot be any further queuing as writers are serialized by the rw_sem in the first place. This patch, therefore, implements custom wait/wake, with an rcu-aware writer task pointer. The only time this is !nil is when a writer is determining if it is going to block, and reset as soon as we know that the percpu_down_write() call has succeeded. All this is obviously done while holding the regular rw_sem. As such, we can avoid the queue handling and locking overhead (although we currently end up taking the waitqueue spinlock fastpath, so it wouldn't be a very big an impact). Signed-off-by: Davidlohr Bueso <dbueso@suse.de> --- include/linux/percpu-rwsem.h | 5 ++--- kernel/locking/percpu-rwsem.c | 26 +++++++++++++++++++++----- 2 files changed, 23 insertions(+), 8 deletions(-) diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 5b2e6159b744..9942b7e8bde8 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -4,7 +4,6 @@ #include <linux/atomic.h> #include <linux/rwsem.h> #include <linux/percpu.h> -#include <linux/wait.h> #include <linux/rcu_sync.h> #include <linux/lockdep.h> @@ -12,7 +11,7 @@ struct percpu_rw_semaphore { struct rcu_sync rss; unsigned int __percpu *read_count; struct rw_semaphore rw_sem; - wait_queue_head_t writer; + struct task_struct *writer; /* blocked writer */ int readers_block; }; @@ -22,7 +21,7 @@ static struct percpu_rw_semaphore name = { \ .rss = __RCU_SYNC_INITIALIZER(name.rss, RCU_SCHED_SYNC), \ .read_count = &__percpu_rwsem_rc_##name, \ .rw_sem = __RWSEM_INITIALIZER(name.rw_sem), \ - .writer = __WAIT_QUEUE_HEAD_INITIALIZER(name.writer), \ + .writer = NULL, \ } extern int __percpu_down_read(struct percpu_rw_semaphore *, int); diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index ce182599cf2e..7856a77396d3 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -1,7 +1,6 @@ #include <linux/atomic.h> #include <linux/rwsem.h> #include <linux/percpu.h> -#include <linux/wait.h> #include <linux/lockdep.h> #include <linux/percpu-rwsem.h> #include <linux/rcupdate.h> @@ -18,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, /* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */ rcu_sync_init(&sem->rss, RCU_SCHED_SYNC); __init_rwsem(&sem->rw_sem, name, rwsem_key); - init_waitqueue_head(&sem->writer); + sem->writer = NULL; sem->readers_block = 0; return 0; } @@ -94,6 +93,8 @@ EXPORT_SYMBOL_GPL(__percpu_down_read); void __percpu_up_read(struct percpu_rw_semaphore *sem) { + struct task_struct *writer; + smp_mb(); /* B matches C */ /* * In other words, if they see our decrement (presumably to aggregate @@ -102,8 +103,13 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem) */ __this_cpu_dec(*sem->read_count); + rcu_read_lock(); + writer = rcu_dereference(sem->writer); + /* Prod writer to recheck readers_active */ - wake_up(&sem->writer); + if (writer) + wake_up_process(writer); + rcu_read_unlock(); } EXPORT_SYMBOL_GPL(__percpu_up_read); @@ -159,8 +165,18 @@ void percpu_down_write(struct percpu_rw_semaphore *sem) * will wait for them. */ - /* Wait for all now active readers to complete. */ - wait_event(sem->writer, readers_active_check(sem)); + WRITE_ONCE(sem->writer, current); + for (;;) { + set_current_state(TASK_UNINTERRUPTIBLE); + + if (readers_active_check(sem)) + break; + + schedule(); + } + + rcu_assign_pointer(sem->writer, NULL); + __set_current_state(TASK_RUNNING); } EXPORT_SYMBOL_GPL(percpu_down_write); -- 2.6.6 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues 2016-12-03 2:18 ` [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues Davidlohr Bueso @ 2016-12-05 8:36 ` Peter Zijlstra 2016-12-05 11:26 ` Oleg Nesterov 2016-12-05 17:19 ` Oleg Nesterov 2016-12-05 17:13 ` Oleg Nesterov 1 sibling, 2 replies; 19+ messages in thread From: Peter Zijlstra @ 2016-12-05 8:36 UTC (permalink / raw) To: Davidlohr Bueso Cc: mingo, oleg, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On Fri, Dec 02, 2016 at 06:18:39PM -0800, Davidlohr Bueso wrote: > @@ -102,8 +103,13 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem) > */ > __this_cpu_dec(*sem->read_count); > > + rcu_read_lock(); > + writer = rcu_dereference(sem->writer); Don't think this is correct, I think Oleg suggested using task_rcu_dereference(), which is a giant pile of magic. The problem is that task_struct isn't RCU protected as such. > + > /* Prod writer to recheck readers_active */ > - wake_up(&sem->writer); > + if (writer) > + wake_up_process(writer); > + rcu_read_unlock(); > } > EXPORT_SYMBOL_GPL(__percpu_up_read); > > @@ -159,8 +165,18 @@ void percpu_down_write(struct percpu_rw_semaphore *sem) > * will wait for them. > */ > > - /* Wait for all now active readers to complete. */ > - wait_event(sem->writer, readers_active_check(sem)); > + WRITE_ONCE(sem->writer, current); So this one matches rcu_dereference(), which is weird, because you now have unmatched barriers. > + for (;;) { > + set_current_state(TASK_UNINTERRUPTIBLE); > + > + if (readers_active_check(sem)) > + break; > + > + schedule(); > + } > + > + rcu_assign_pointer(sem->writer, NULL); And this one does not, and the value being NULL this actually reverts to WRITE_ONCE(). > + __set_current_state(TASK_RUNNING); > } > EXPORT_SYMBOL_GPL(percpu_down_write); ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues 2016-12-05 8:36 ` Peter Zijlstra @ 2016-12-05 11:26 ` Oleg Nesterov 2016-12-05 11:32 ` Oleg Nesterov 2016-12-05 17:37 ` Davidlohr Bueso 2016-12-05 17:19 ` Oleg Nesterov 1 sibling, 2 replies; 19+ messages in thread From: Oleg Nesterov @ 2016-12-05 11:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, mingo, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso Davidlohr, Peter, I'll try to read this patch later, just one note. On 12/05, Peter Zijlstra wrote: > > On Fri, Dec 02, 2016 at 06:18:39PM -0800, Davidlohr Bueso wrote: > > @@ -102,8 +103,13 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem) > > */ > > __this_cpu_dec(*sem->read_count); > > > > + rcu_read_lock(); > > + writer = rcu_dereference(sem->writer); > > Don't think this is correct, I think Oleg suggested using > task_rcu_dereference(), which is a giant pile of magic. Yes, but on a second thought task_rcu_dereference() won't really help, but we can just use rcu_dereference(). > The problem is that task_struct isn't RCU protected as such. Yes. But percpu_down_write() should not be used after exit_notify(), so we can rely on rcu_read_lock(), release_task()->call_rcu(delayed_put_task_struct) can't be called until an exiting task passes exit_notify(). But then we probably need WARN_ON(current->exit_state) in percpu_down_write(). And personally I think this change should add the new helpers, they can have more users. Something like struct xxx { struct task_struct *task; }; xxx_wake_up(struct xxx *xxx) { rcu_read_lock(); task = rcu_dereference(xxx->task); if (task) wake_up_process(task); rcu_read_unlock(); } #define xxx_wait_event(xxx, event) { // comment to explain why WARN_ON(current->exit_state); xxx->task = current; ... } Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues 2016-12-05 11:26 ` Oleg Nesterov @ 2016-12-05 11:32 ` Oleg Nesterov 2016-12-05 17:37 ` Davidlohr Bueso 1 sibling, 0 replies; 19+ messages in thread From: Oleg Nesterov @ 2016-12-05 11:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, mingo, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On 12/05, Oleg Nesterov wrote: > > Yes, but on a second thought task_rcu_dereference() won't really help, I forgot to explain why, see below. > #define xxx_wait_event(xxx, event) { > // comment to explain why > WARN_ON(current->exit_state); Otherwise this process/thread can be already (auto)reaped and wakeup can't rely on rcu. And task_rcu_dereference() can't help because it can return NULL in this case. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues 2016-12-05 11:26 ` Oleg Nesterov 2016-12-05 11:32 ` Oleg Nesterov @ 2016-12-05 17:37 ` Davidlohr Bueso 1 sibling, 0 replies; 19+ messages in thread From: Davidlohr Bueso @ 2016-12-05 17:37 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, mingo, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On Mon, 05 Dec 2016, Oleg Nesterov wrote: >Yes. But percpu_down_write() should not be used after exit_notify(), so we >can rely on rcu_read_lock(), release_task()->call_rcu(delayed_put_task_struct) >can't be called until an exiting task passes exit_notify(). > >But then we probably need WARN_ON(current->exit_state) in percpu_down_write(). Hmm, my immediate thought would have been doing a PF_EXITING check, but of course this enlarges the window of the warn being triggered, yet maintains what you are saying in that percpu_down_write should not be used after do_exit/exit_notify. Furthermore, reading the comment in task_rcu_dereference, I get your point and we can loose the reference to the task, iow busted rcu read cr. > >And personally I think this change should add the new helpers, they can have >more users. Something like > > struct xxx { What do you think of s/xxx/rcuwait? Thanks, Davidlohr ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues 2016-12-05 8:36 ` Peter Zijlstra 2016-12-05 11:26 ` Oleg Nesterov @ 2016-12-05 17:19 ` Oleg Nesterov 1 sibling, 0 replies; 19+ messages in thread From: Oleg Nesterov @ 2016-12-05 17:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, mingo, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On 12/05, Peter Zijlstra wrote: > > > + for (;;) { > > + set_current_state(TASK_UNINTERRUPTIBLE); > > + > > + if (readers_active_check(sem)) > > + break; > > + > > + schedule(); > > + } > > + > > + rcu_assign_pointer(sem->writer, NULL); > > And this one does not, and the value being NULL this actually reverts to > WRITE_ONCE(). Do we really care? We do not even need WRITE_ONCE() afaics, this is like __set_current_state(TASK_RUNNING) after the main loop. We can't avoid the spurious wakeups anyway after return from percpu_down_write(). Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues 2016-12-03 2:18 ` [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues Davidlohr Bueso 2016-12-05 8:36 ` Peter Zijlstra @ 2016-12-05 17:13 ` Oleg Nesterov 1 sibling, 0 replies; 19+ messages in thread From: Oleg Nesterov @ 2016-12-05 17:13 UTC (permalink / raw) To: Davidlohr Bueso Cc: mingo, peterz, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On 12/02, Davidlohr Bueso wrote: > > @@ -102,8 +103,13 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem) > */ > __this_cpu_dec(*sem->read_count); > > + rcu_read_lock(); > + writer = rcu_dereference(sem->writer); > + > /* Prod writer to recheck readers_active */ > - wake_up(&sem->writer); > + if (writer) > + wake_up_process(writer); > + rcu_read_unlock(); This needs a barrier between __this_cpu_dec() and rcu_dereference(), I think. > @@ -159,8 +165,18 @@ void percpu_down_write(struct percpu_rw_semaphore *sem) > * will wait for them. > */ > > - /* Wait for all now active readers to complete. */ > - wait_event(sem->writer, readers_active_check(sem)); > + WRITE_ONCE(sem->writer, current); > + for (;;) { > + set_current_state(TASK_UNINTERRUPTIBLE); > + > + if (readers_active_check(sem)) > + break; This looks fine, we can rely on set_current_state() which inserts a barrier between WRITE_ONCE() and readers_active_check(). So we do not even need WRITE_ONCE(). And the fact this needs the barriers and the comments makes me think again you should add the new helpers. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups 2016-11-18 18:54 [PATCH -tip 0/3] locking/percpu-rwsem: writer-side optimizations Davidlohr Bueso 2016-11-18 18:54 ` [PATCH 1/3] locking/percpu-rwsem: Move text file into Documentation/locking/ Davidlohr Bueso 2016-11-18 18:54 ` [PATCH 2/3] locking/percpu-rwsem: Replace bulky wait-queues with swait Davidlohr Bueso @ 2016-11-18 18:54 ` Davidlohr Bueso 2016-11-21 12:23 ` Oleg Nesterov 2 siblings, 1 reply; 19+ messages in thread From: Davidlohr Bueso @ 2016-11-18 18:54 UTC (permalink / raw) To: mingo, peterz, oleg Cc: john.stultz, dimitrysh, linux-kernel, dave, Davidlohr Bueso There is obviously no point in doing the wakeup if the reader wakee task can do the active readers check on behalf of the writer on its way to unlocking itself. The downside is that when readers_active_check() does return true we end up iterating the per-CPU counter twice. This trade-off, however, does seem reasonable in that if we are here, (i) we have already lost any hope for reader side performance and; (ii) because this lock is used mainly for sharing, it is not crazy to expect to have readers incoming for the lock during this window -- therefore sending writers right back to sleep for every reader up(). Signed-off-by: Davidlohr Bueso <dbueso@suse.de> --- kernel/locking/percpu-rwsem.c | 72 ++++++++++++++++++++++++------------------- 1 file changed, 40 insertions(+), 32 deletions(-) diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index cb71201855f2..8e6fbf117f14 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -8,6 +8,44 @@ #include <linux/sched.h> #include <linux/errno.h> +#define per_cpu_sum(var) \ +({ \ + typeof(var) __sum = 0; \ + int cpu; \ + compiletime_assert_atomic_type(__sum); \ + for_each_possible_cpu(cpu) \ + __sum += per_cpu(var, cpu); \ + __sum; \ +}) + +/* + * Return true if the modular sum of the sem->read_count per-CPU variable is + * zero. If this sum is zero, then it is stable due to the fact that if any + * newly arriving readers increment a given counter, they will immediately + * decrement that same counter. + */ +static bool readers_active_check(struct percpu_rw_semaphore *sem) +{ + if (per_cpu_sum(*sem->read_count) != 0) + return false; + + /* + * If we observed the decrement; ensure we see the entire critical + * section. In the case of __readers_active_check we avoid the + * critical section sync, as the writer wakee will fully re-check + * to continue. + */ + + smp_mb(); /* C matches B */ + + return true; +} + +static bool __readers_active_check(struct percpu_rw_semaphore *sem) +{ + return !(per_cpu_sum(*sem->read_count) !=0); +} + int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, const char *name, struct lock_class_key *rwsem_key) { @@ -103,41 +141,11 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem) __this_cpu_dec(*sem->read_count); /* Prod writer to recheck readers_active */ - swake_up(&sem->writer); + if (__readers_active_check(sem)) + swake_up(&sem->writer); } EXPORT_SYMBOL_GPL(__percpu_up_read); -#define per_cpu_sum(var) \ -({ \ - typeof(var) __sum = 0; \ - int cpu; \ - compiletime_assert_atomic_type(__sum); \ - for_each_possible_cpu(cpu) \ - __sum += per_cpu(var, cpu); \ - __sum; \ -}) - -/* - * Return true if the modular sum of the sem->read_count per-CPU variable is - * zero. If this sum is zero, then it is stable due to the fact that if any - * newly arriving readers increment a given counter, they will immediately - * decrement that same counter. - */ -static bool readers_active_check(struct percpu_rw_semaphore *sem) -{ - if (per_cpu_sum(*sem->read_count) != 0) - return false; - - /* - * If we observed the decrement; ensure we see the entire critical - * section. - */ - - smp_mb(); /* C matches B */ - - return true; -} - void percpu_down_write(struct percpu_rw_semaphore *sem) { /* Notify readers to take the slow path. */ -- 2.6.6 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups 2016-11-18 18:54 ` [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups Davidlohr Bueso @ 2016-11-21 12:23 ` Oleg Nesterov 2016-11-21 12:29 ` Peter Zijlstra 0 siblings, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2016-11-21 12:23 UTC (permalink / raw) To: Davidlohr Bueso Cc: mingo, peterz, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On 11/18, Davidlohr Bueso wrote: > > +static bool __readers_active_check(struct percpu_rw_semaphore *sem) > +{ > + return !(per_cpu_sum(*sem->read_count) !=0); > +} Hmm, return per_cpu_sum(*sem->read_count) == 0; looks more clear, but this is minor, > int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, > const char *name, struct lock_class_key *rwsem_key) > { > @@ -103,41 +141,11 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem) > __this_cpu_dec(*sem->read_count); > > /* Prod writer to recheck readers_active */ > - swake_up(&sem->writer); > + if (__readers_active_check(sem)) > + swake_up(&sem->writer); Suppose we have 2 active readers which call __percpu_up_read() at the same time and the pending writer sleeps. What guarantees that one of these readers will observe per_cpu_sum() == 0 ? They both can read the old value of the remote per-cpu counter, no? Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups 2016-11-21 12:23 ` Oleg Nesterov @ 2016-11-21 12:29 ` Peter Zijlstra 2016-11-21 12:47 ` Oleg Nesterov 0 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2016-11-21 12:29 UTC (permalink / raw) To: Oleg Nesterov Cc: Davidlohr Bueso, mingo, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On Mon, Nov 21, 2016 at 01:23:44PM +0100, Oleg Nesterov wrote: > On 11/18, Davidlohr Bueso wrote: > > > > +static bool __readers_active_check(struct percpu_rw_semaphore *sem) > > +{ > > + return !(per_cpu_sum(*sem->read_count) !=0); > > +} > > Hmm, > > return per_cpu_sum(*sem->read_count) == 0; > > looks more clear, but this is minor, Very much so; that must be one of the most convoluted statements possible :-). > > > int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, > > const char *name, struct lock_class_key *rwsem_key) > > { > > @@ -103,41 +141,11 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem) > > __this_cpu_dec(*sem->read_count); > > > > /* Prod writer to recheck readers_active */ > > - swake_up(&sem->writer); > > + if (__readers_active_check(sem)) > > + swake_up(&sem->writer); > > Suppose we have 2 active readers which call __percpu_up_read() at the same > time and the pending writer sleeps. > > What guarantees that one of these readers will observe per_cpu_sum() == 0 ? > They both can read the old value of the remote per-cpu counter, no? In particular, you're thinking of what provides the guarantee that the woken CPU observes the same state the wakee saw? Isn't this one of the Program-Order guarantees the scheduler _should_ provide? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups 2016-11-21 12:29 ` Peter Zijlstra @ 2016-11-21 12:47 ` Oleg Nesterov 2016-11-21 15:07 ` Oleg Nesterov 0 siblings, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2016-11-21 12:47 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, mingo, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On 11/21, Peter Zijlstra wrote: > > On Mon, Nov 21, 2016 at 01:23:44PM +0100, Oleg Nesterov wrote: > > > int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, > > > const char *name, struct lock_class_key *rwsem_key) > > > { > > > @@ -103,41 +141,11 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem) > > > __this_cpu_dec(*sem->read_count); > > > > > > /* Prod writer to recheck readers_active */ > > > - swake_up(&sem->writer); > > > + if (__readers_active_check(sem)) > > > + swake_up(&sem->writer); > > > > Suppose we have 2 active readers which call __percpu_up_read() at the same > > time and the pending writer sleeps. > > > > What guarantees that one of these readers will observe per_cpu_sum() == 0 ? > > They both can read the old value of the remote per-cpu counter, no? > > In particular, you're thinking of what provides the guarantee that the > woken CPU observes the same state the wakee saw? No, no, I meant that afaics both readers can see per_cpu_sum() != 0 and thus the writer won't be woken up. Till the next down_read/up_read. Suppose that we have 2 CPU's, both counters == 1, both readers decrement. its counter at the same time. READER_ON_CPU_0 READER_ON_CPU_1 --ctr_0; --ctr_1; if (ctr_0 + ctr_1) if (ctr_0 + ctr_1) wakeup(); wakeup(); Why we can't miss a wakeup? This patch doesn't even add a barrier, but I think wmb() won't be enough anyway. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups 2016-11-21 12:47 ` Oleg Nesterov @ 2016-11-21 15:07 ` Oleg Nesterov 2016-11-22 3:59 ` Davidlohr Bueso 0 siblings, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2016-11-21 15:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, mingo, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On 11/21, Oleg Nesterov wrote: > > No, no, I meant that afaics both readers can see per_cpu_sum() != 0 and > thus the writer won't be woken up. Till the next down_read/up_read. > > Suppose that we have 2 CPU's, both counters == 1, both readers decrement. > its counter at the same time. > > READER_ON_CPU_0 READER_ON_CPU_1 > > --ctr_0; --ctr_1; > > if (ctr_0 + ctr_1) if (ctr_0 + ctr_1) > wakeup(); wakeup(); > > Why we can't miss a wakeup? > > This patch doesn't even add a barrier, but I think wmb() won't be enough > anyway. And in fact I am not sure this optimization makes sense... But it would be nice to avoid wake_up() when the writer sleeps in rcu_sync_enter(). Or this is the "slow mode" sem (cgroup_threadgroup_rwsem). I need to re-check, but what do you think about the change below? Oleg. --- x/kernel/locking/percpu-rwsem.c +++ x/kernel/locking/percpu-rwsem.c @@ -103,7 +103,9 @@ void __percpu_up_read(struct percpu_rw_s __this_cpu_dec(*sem->read_count); /* Prod writer to recheck readers_active */ - wake_up(&sem->writer); + smp_mb(); + if (sem->readers_block) + wake_up(&sem->writer); } EXPORT_SYMBOL_GPL(__percpu_up_read); ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups 2016-11-21 15:07 ` Oleg Nesterov @ 2016-11-22 3:59 ` Davidlohr Bueso 2016-11-23 14:43 ` Oleg Nesterov 0 siblings, 1 reply; 19+ messages in thread From: Davidlohr Bueso @ 2016-11-22 3:59 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, mingo, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On Mon, 21 Nov 2016, Oleg Nesterov wrote: >On 11/21, Oleg Nesterov wrote: >> >> No, no, I meant that afaics both readers can see per_cpu_sum() != 0 and >> thus the writer won't be woken up. Till the next down_read/up_read. >> >> Suppose that we have 2 CPU's, both counters == 1, both readers decrement. >> its counter at the same time. >> >> READER_ON_CPU_0 READER_ON_CPU_1 >> >> --ctr_0; --ctr_1; >> >> if (ctr_0 + ctr_1) if (ctr_0 + ctr_1) >> wakeup(); wakeup(); >> >> Why we can't miss a wakeup? But the patch is really: if (!(ctr_0 + ctr_1)). wrt to stale values is this like due to the data dependency we only see the real value of this_cpu ctr, and no guarantee for the other cpus? If so I had not considered that scenario, and yes we'd need stronger guarantees. I'd have to wonder if other users of per_cpu_sum() would fall into a similar trap. Hmm and each user seems to implement its own copy of the same thing. >And in fact I am not sure this optimization makes sense... But it would be >nice to avoid wake_up() when the writer sleeps in rcu_sync_enter(). Or this >is the "slow mode" sem (cgroup_threadgroup_rwsem). Why do you think using per_cpu_sum() does not make sense? As mentioned in the changelog it optimizes for incoming readers while the writer is doing sync_enter and getting the regular rwsem. What am I missing? > >I need to re-check, but what do you think about the change below? While optimizing for multiple writers (rcu_sync_enter) is certainly valid (at least considering the cgroups rwsem you mention), I think that my heuristic covers the otherwise more common case. Could both optimizations not work together? Of course, the window of where readers_block == 1 is quite large, so there can be a lot of false positives. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups 2016-11-22 3:59 ` Davidlohr Bueso @ 2016-11-23 14:43 ` Oleg Nesterov 0 siblings, 0 replies; 19+ messages in thread From: Oleg Nesterov @ 2016-11-23 14:43 UTC (permalink / raw) To: Davidlohr Bueso Cc: Peter Zijlstra, mingo, john.stultz, dimitrysh, linux-kernel, Davidlohr Bueso On 11/21, Davidlohr Bueso wrote: > > On Mon, 21 Nov 2016, Oleg Nesterov wrote: > >> On 11/21, Oleg Nesterov wrote: >>> >>> No, no, I meant that afaics both readers can see per_cpu_sum() != 0 and >>> thus the writer won't be woken up. Till the next down_read/up_read. >>> >>> Suppose that we have 2 CPU's, both counters == 1, both readers decrement. >>> its counter at the same time. >>> >>> READER_ON_CPU_0 READER_ON_CPU_1 >>> >>> --ctr_0; --ctr_1; >>> >>> if (ctr_0 + ctr_1) if (ctr_0 + ctr_1) >>> wakeup(); wakeup(); >>> >>> Why we can't miss a wakeup? > > But the patch is really: if (!(ctr_0 + ctr_1)). Of course, I meant if (ctr_0 + ctr_1 == 0). >> And in fact I am not sure this optimization makes sense... But it would be >> nice to avoid wake_up() when the writer sleeps in rcu_sync_enter(). Or this >> is the "slow mode" sem (cgroup_threadgroup_rwsem). > > Why do you think using per_cpu_sum() does not make sense? As mentioned in the > changelog it optimizes for incoming readers while the writer is doing sync_enter > and getting the regular rwsem. What am I missing? And this does make sense, but see below, >> I need to re-check, but what do you think about the change below? > > While optimizing for multiple writers (rcu_sync_enter) is certainly valid > (at least considering the cgroups rwsem you mention), No, it is not for multiple writers. rcu_sync_enter() is slow, the new readers can come and acquire/release this lock. And if it is a "slow mode" sem then every up() does wakeup which we want to eliminate. But after sem->readers_block is already true, I am not sure the additional per_cpu_sum() is a win (even if it was correct), the new readers can't come. Except __percpu_down_read()->__percpu_up_read() which we want to optimize too, but in this case we do not need per_cpu_sum() too. I'll try to make a patch this week... I had this optimization in mind from the very beginning, I event mentioned it during the last discussion, but never had time. Basically we should not inc if readers_block == T. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2016-12-05 17:37 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-11-18 18:54 [PATCH -tip 0/3] locking/percpu-rwsem: writer-side optimizations Davidlohr Bueso 2016-11-18 18:54 ` [PATCH 1/3] locking/percpu-rwsem: Move text file into Documentation/locking/ Davidlohr Bueso 2016-11-18 18:54 ` [PATCH 2/3] locking/percpu-rwsem: Replace bulky wait-queues with swait Davidlohr Bueso 2016-11-21 12:55 ` Oleg Nesterov 2016-11-21 17:26 ` Davidlohr Bueso 2016-12-03 2:18 ` [PATCH v2 2/3] locking/percpu-rwsem: Rework writer block/wake to not use wait-queues Davidlohr Bueso 2016-12-05 8:36 ` Peter Zijlstra 2016-12-05 11:26 ` Oleg Nesterov 2016-12-05 11:32 ` Oleg Nesterov 2016-12-05 17:37 ` Davidlohr Bueso 2016-12-05 17:19 ` Oleg Nesterov 2016-12-05 17:13 ` Oleg Nesterov 2016-11-18 18:54 ` [PATCH 3/3] locking/percpu-rwsem: Avoid unnecessary writer wakeups Davidlohr Bueso 2016-11-21 12:23 ` Oleg Nesterov 2016-11-21 12:29 ` Peter Zijlstra 2016-11-21 12:47 ` Oleg Nesterov 2016-11-21 15:07 ` Oleg Nesterov 2016-11-22 3:59 ` Davidlohr Bueso 2016-11-23 14:43 ` Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).