* [RFC PATCH 0/2] uprobes: register/unregister can race with fork @ 2012-10-15 19:09 Oleg Nesterov 2012-10-15 19:10 ` [PATCH 1/2] brw_mutex: big read-write mutex Oleg Nesterov 2012-10-15 19:10 ` [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race Oleg Nesterov 0 siblings, 2 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-15 19:09 UTC (permalink / raw) To: Ingo Molnar, Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Srikar Dronamraju Cc: Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel Hello. Well. The very fact this series adds the new locking primitive probably means we should try to find another fix. And yes, it is possible to fix this differently, afaics. But this will need more complications, I think. So please review. As for 1/2: - I really hope paulmck/peterz will tell me if it is correct or not - The naming sucks, and I agree with any suggestions - Probably this code should be compiled only if CONFIG_UPRPOBES Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-15 19:09 [RFC PATCH 0/2] uprobes: register/unregister can race with fork Oleg Nesterov @ 2012-10-15 19:10 ` Oleg Nesterov 2012-10-15 23:28 ` Paul E. McKenney 2012-10-16 19:56 ` Linus Torvalds 2012-10-15 19:10 ` [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race Oleg Nesterov 1 sibling, 2 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-15 19:10 UTC (permalink / raw) To: Ingo Molnar, Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Srikar Dronamraju Cc: Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore it allows multiple writers too, just "read" and "write" are mutually exclusive. brw_start_read() and brw_end_read() are extremely cheap, they only do this_cpu_inc(read_ctr) + atomic_read() if there are no waiting writers. OTOH it is write-biased, any brw_start_write() blocks the new readers. But "write" is slow, it does synchronize_sched() to serialize with preempt_disable() in brw_start_read(), and wait_event(write_waitq) can have a lot of extra wakeups before percpu-counter-sum becomes zero. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- include/linux/brw_mutex.h | 22 +++++++++++++++ lib/Makefile | 2 +- lib/brw_mutex.c | 67 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 90 insertions(+), 1 deletions(-) create mode 100644 include/linux/brw_mutex.h create mode 100644 lib/brw_mutex.c diff --git a/include/linux/brw_mutex.h b/include/linux/brw_mutex.h new file mode 100644 index 0000000..16b8d5f --- /dev/null +++ b/include/linux/brw_mutex.h @@ -0,0 +1,22 @@ +#ifndef _LINUX_BRW_MUTEX_H +#define _LINUX_BRW_MUTEX_H + +#include <linux/percpu.h> +#include <linux/wait.h> + +struct brw_mutex { + long __percpu *read_ctr; + atomic_t write_ctr; + wait_queue_head_t read_waitq; + wait_queue_head_t write_waitq; +}; + +extern int brw_mutex_init(struct brw_mutex *brw); + +extern void brw_start_read(struct brw_mutex *brw); +extern void brw_end_read(struct brw_mutex *brw); + +extern void brw_start_write(struct brw_mutex *brw); +extern void brw_end_write(struct brw_mutex *brw); + +#endif diff --git a/lib/Makefile b/lib/Makefile index 3128e35..18f2876 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ idr.o int_sqrt.o extable.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ - is_single_threaded.o plist.o decompress.o + is_single_threaded.o plist.o decompress.o brw_mutex.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/brw_mutex.c b/lib/brw_mutex.c new file mode 100644 index 0000000..41984a6 --- /dev/null +++ b/lib/brw_mutex.c @@ -0,0 +1,67 @@ +#include <linux/brw_mutex.h> +#include <linux/rcupdate.h> +#include <linux/sched.h> + +int brw_mutex_init(struct brw_mutex *brw) +{ + atomic_set(&brw->write_ctr, 0); + init_waitqueue_head(&brw->read_waitq); + init_waitqueue_head(&brw->write_waitq); + brw->read_ctr = alloc_percpu(long); + return brw->read_ctr ? 0 : -ENOMEM; +} + +void brw_start_read(struct brw_mutex *brw) +{ + for (;;) { + bool done = false; + + preempt_disable(); + if (likely(!atomic_read(&brw->write_ctr))) { + __this_cpu_inc(*brw->read_ctr); + done = true; + } + preempt_enable(); + + if (likely(done)) + break; + + __wait_event(brw->read_waitq, !atomic_read(&brw->write_ctr)); + } +} + +void brw_end_read(struct brw_mutex *brw) +{ + this_cpu_dec(*brw->read_ctr); + + if (unlikely(atomic_read(&brw->write_ctr))) + wake_up_all(&brw->write_waitq); +} + +static inline long brw_read_ctr(struct brw_mutex *brw) +{ + long sum = 0; + int cpu; + + for_each_possible_cpu(cpu) + sum += per_cpu(*brw->read_ctr, cpu); + + return sum; +} + +void brw_start_write(struct brw_mutex *brw) +{ + atomic_inc(&brw->write_ctr); + synchronize_sched(); + /* + * Thereafter brw_*_read() must see write_ctr != 0, + * and we should see the result of __this_cpu_inc(). + */ + wait_event(brw->write_waitq, brw_read_ctr(brw) == 0); +} + +void brw_end_write(struct brw_mutex *brw) +{ + if (atomic_dec_and_test(&brw->write_ctr)) + wake_up_all(&brw->read_waitq); +} -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-15 19:10 ` [PATCH 1/2] brw_mutex: big read-write mutex Oleg Nesterov @ 2012-10-15 23:28 ` Paul E. McKenney 2012-10-16 15:56 ` Oleg Nesterov 2012-10-16 19:56 ` Linus Torvalds 1 sibling, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-10-15 23:28 UTC (permalink / raw) To: Oleg Nesterov Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Mon, Oct 15, 2012 at 09:10:18PM +0200, Oleg Nesterov wrote: > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore > it allows multiple writers too, just "read" and "write" are mutually > exclusive. > > brw_start_read() and brw_end_read() are extremely cheap, they only do > this_cpu_inc(read_ctr) + atomic_read() if there are no waiting writers. > > OTOH it is write-biased, any brw_start_write() blocks the new readers. > But "write" is slow, it does synchronize_sched() to serialize with > preempt_disable() in brw_start_read(), and wait_event(write_waitq) can > have a lot of extra wakeups before percpu-counter-sum becomes zero. A few questions and comments below, as always. Thanx, Paul > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > include/linux/brw_mutex.h | 22 +++++++++++++++ > lib/Makefile | 2 +- > lib/brw_mutex.c | 67 +++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 90 insertions(+), 1 deletions(-) > create mode 100644 include/linux/brw_mutex.h > create mode 100644 lib/brw_mutex.c > > diff --git a/include/linux/brw_mutex.h b/include/linux/brw_mutex.h > new file mode 100644 > index 0000000..16b8d5f > --- /dev/null > +++ b/include/linux/brw_mutex.h > @@ -0,0 +1,22 @@ > +#ifndef _LINUX_BRW_MUTEX_H > +#define _LINUX_BRW_MUTEX_H > + > +#include <linux/percpu.h> > +#include <linux/wait.h> > + > +struct brw_mutex { > + long __percpu *read_ctr; > + atomic_t write_ctr; > + wait_queue_head_t read_waitq; > + wait_queue_head_t write_waitq; > +}; > + > +extern int brw_mutex_init(struct brw_mutex *brw); > + > +extern void brw_start_read(struct brw_mutex *brw); > +extern void brw_end_read(struct brw_mutex *brw); > + > +extern void brw_start_write(struct brw_mutex *brw); > +extern void brw_end_write(struct brw_mutex *brw); > + > +#endif > diff --git a/lib/Makefile b/lib/Makefile > index 3128e35..18f2876 100644 > --- a/lib/Makefile > +++ b/lib/Makefile > @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ > idr.o int_sqrt.o extable.o \ > sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ > proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ > - is_single_threaded.o plist.o decompress.o > + is_single_threaded.o plist.o decompress.o brw_mutex.o > > lib-$(CONFIG_MMU) += ioremap.o > lib-$(CONFIG_SMP) += cpumask.o > diff --git a/lib/brw_mutex.c b/lib/brw_mutex.c > new file mode 100644 > index 0000000..41984a6 > --- /dev/null > +++ b/lib/brw_mutex.c > @@ -0,0 +1,67 @@ > +#include <linux/brw_mutex.h> > +#include <linux/rcupdate.h> > +#include <linux/sched.h> > + > +int brw_mutex_init(struct brw_mutex *brw) > +{ > + atomic_set(&brw->write_ctr, 0); > + init_waitqueue_head(&brw->read_waitq); > + init_waitqueue_head(&brw->write_waitq); > + brw->read_ctr = alloc_percpu(long); > + return brw->read_ctr ? 0 : -ENOMEM; > +} > + > +void brw_start_read(struct brw_mutex *brw) > +{ > + for (;;) { > + bool done = false; > + > + preempt_disable(); > + if (likely(!atomic_read(&brw->write_ctr))) { > + __this_cpu_inc(*brw->read_ctr); > + done = true; > + } brw_start_read() is not recursive -- attempting to call it recursively can result in deadlock if a writer has shown up in the meantime. Which is often OK, but not sure what you intended. > + preempt_enable(); > + > + if (likely(done)) > + break; > + > + __wait_event(brw->read_waitq, !atomic_read(&brw->write_ctr)); > + } > +} > + > +void brw_end_read(struct brw_mutex *brw) > +{ I believe that you need smp_mb() here. The wake_up_all()'s memory barriers do not suffice because some other reader might have awakened the writer between this_cpu_dec() and wake_up_all(). IIRC, this smp_mb() is also needed if the timing is such that the writer does not actually block. > + this_cpu_dec(*brw->read_ctr); > + > + if (unlikely(atomic_read(&brw->write_ctr))) > + wake_up_all(&brw->write_waitq); > +} Of course, it would be good to avoid smp_mb on the fast path. Here is one way to avoid it: void brw_end_read(struct brw_mutex *brw) { if (unlikely(atomic_read(&brw->write_ctr))) { smp_mb(); this_cpu_dec(*brw->read_ctr); wake_up_all(&brw->write_waitq); } else { this_cpu_dec(*brw->read_ctr); } } > +static inline long brw_read_ctr(struct brw_mutex *brw) > +{ > + long sum = 0; > + int cpu; > + > + for_each_possible_cpu(cpu) > + sum += per_cpu(*brw->read_ctr, cpu); > + > + return sum; > +} > + > +void brw_start_write(struct brw_mutex *brw) > +{ > + atomic_inc(&brw->write_ctr); > + synchronize_sched(); > + /* > + * Thereafter brw_*_read() must see write_ctr != 0, > + * and we should see the result of __this_cpu_inc(). > + */ > + wait_event(brw->write_waitq, brw_read_ctr(brw) == 0); This looks like it allows multiple writers to proceed concurrently. They both increment, do a synchronize_sched(), do the wait_event(), and then are both awakened by the last reader. Was that the intent? (The implementation of brw_end_write() makes it look like it is in fact the intent.) > +} > + > +void brw_end_write(struct brw_mutex *brw) > +{ > + if (atomic_dec_and_test(&brw->write_ctr)) > + wake_up_all(&brw->read_waitq); > +} > -- > 1.5.5.1 > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-15 23:28 ` Paul E. McKenney @ 2012-10-16 15:56 ` Oleg Nesterov 2012-10-16 18:58 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-16 15:56 UTC (permalink / raw) To: Paul E. McKenney Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel Paul, thanks for looking! On 10/15, Paul E. McKenney wrote: > > > +void brw_start_read(struct brw_mutex *brw) > > +{ > > + for (;;) { > > + bool done = false; > > + > > + preempt_disable(); > > + if (likely(!atomic_read(&brw->write_ctr))) { > > + __this_cpu_inc(*brw->read_ctr); > > + done = true; > > + } > > brw_start_read() is not recursive -- attempting to call it recursively > can result in deadlock if a writer has shown up in the meantime. Yes, yes, it is not recursive. Like rw_semaphore. > Which is often OK, but not sure what you intended. I forgot to document this in the changelog. > > +void brw_end_read(struct brw_mutex *brw) > > +{ > > I believe that you need smp_mb() here. I don't understand why... > The wake_up_all()'s memory barriers > do not suffice because some other reader might have awakened the writer > between this_cpu_dec() and wake_up_all(). But __wake_up(q) takes q->lock? And the same lock is taken by prepare_to_wait(), so how can the writer miss the result of _dec? > > + this_cpu_dec(*brw->read_ctr); > > + > > + if (unlikely(atomic_read(&brw->write_ctr))) > > + wake_up_all(&brw->write_waitq); > > +} > > Of course, it would be good to avoid smp_mb on the fast path. Here is > one way to avoid it: > > void brw_end_read(struct brw_mutex *brw) > { > if (unlikely(atomic_read(&brw->write_ctr))) { > smp_mb(); > this_cpu_dec(*brw->read_ctr); > wake_up_all(&brw->write_waitq); Hmm... still can't understand. It seems that this mb() is needed to ensure that brw_end_read() can't miss write_ctr != 0. But we do not care unless the writer already does wait_event(). And before it does wait_event() it calls synchronize_sched() after it sets write_ctr != 0. Doesn't this mean that after that any preempt-disabled section must see write_ctr != 0 ? This code actually checks write_ctr after preempt_disable + enable, but I think this doesn't matter? Paul, most probably I misunderstood you. Could you spell please? > > +void brw_start_write(struct brw_mutex *brw) > > +{ > > + atomic_inc(&brw->write_ctr); > > + synchronize_sched(); > > + /* > > + * Thereafter brw_*_read() must see write_ctr != 0, > > + * and we should see the result of __this_cpu_inc(). > > + */ > > + wait_event(brw->write_waitq, brw_read_ctr(brw) == 0); > > This looks like it allows multiple writers to proceed concurrently. > They both increment, do a synchronize_sched(), do the wait_event(), > and then are both awakened by the last reader. Yes. From the changelog: Unlike rw_semaphore it allows multiple writers too, just "read" and "write" are mutually exclusive. > Was that the intent? (The implementation of brw_end_write() makes > it look like it is in fact the intent.) Please look at 2/2. Multiple uprobe_register() or uprobe_unregister() can run at the same time to install/remove the system-wide breakpoint, and brw_start_write() is used to block dup_mmap() to avoid the race. But they do not block each other. Thanks! Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-16 15:56 ` Oleg Nesterov @ 2012-10-16 18:58 ` Paul E. McKenney 2012-10-17 16:37 ` Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-10-16 18:58 UTC (permalink / raw) To: Oleg Nesterov Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote: > Paul, thanks for looking! > > On 10/15, Paul E. McKenney wrote: > > > > > +void brw_start_read(struct brw_mutex *brw) > > > +{ > > > + for (;;) { > > > + bool done = false; > > > + > > > + preempt_disable(); > > > + if (likely(!atomic_read(&brw->write_ctr))) { > > > + __this_cpu_inc(*brw->read_ctr); > > > + done = true; > > > + } > > > > brw_start_read() is not recursive -- attempting to call it recursively > > can result in deadlock if a writer has shown up in the meantime. > > Yes, yes, it is not recursive. Like rw_semaphore. > > > Which is often OK, but not sure what you intended. > > I forgot to document this in the changelog. Hey, I had to ask. ;-) > > > +void brw_end_read(struct brw_mutex *brw) > > > +{ > > > > I believe that you need smp_mb() here. > > I don't understand why... > > > The wake_up_all()'s memory barriers > > do not suffice because some other reader might have awakened the writer > > between this_cpu_dec() and wake_up_all(). > > But __wake_up(q) takes q->lock? And the same lock is taken by > prepare_to_wait(), so how can the writer miss the result of _dec? Suppose that the writer arrives and sees that the value of the counter is zero, and thus never sleeps, and so is also not awakened? Unless I am missing something, there are no memory barriers in that case. Which means that you also need an smp_mb() after the wait_event() in the writer, now that I think on it. > > > + this_cpu_dec(*brw->read_ctr); > > > + > > > + if (unlikely(atomic_read(&brw->write_ctr))) > > > + wake_up_all(&brw->write_waitq); > > > +} > > > > Of course, it would be good to avoid smp_mb on the fast path. Here is > > one way to avoid it: > > > > void brw_end_read(struct brw_mutex *brw) > > { > > if (unlikely(atomic_read(&brw->write_ctr))) { > > smp_mb(); > > this_cpu_dec(*brw->read_ctr); > > wake_up_all(&brw->write_waitq); > > Hmm... still can't understand. > > It seems that this mb() is needed to ensure that brw_end_read() can't > miss write_ctr != 0. > > But we do not care unless the writer already does wait_event(). And > before it does wait_event() it calls synchronize_sched() after it sets > write_ctr != 0. Doesn't this mean that after that any preempt-disabled > section must see write_ctr != 0 ? > > This code actually checks write_ctr after preempt_disable + enable, > but I think this doesn't matter? > > Paul, most probably I misunderstood you. Could you spell please? Let me try outlining the sequence of events that I am worried about... 1. Task A invokes brw_start_read(). There is no writer, so it takes the fastpath. 2. Task B invokes brw_start_write(), atomically increments &brw->write_ctr, and executes synchronize_sched(). 3. Task A invokes brw_end_read() and does this_cpu_dec(). 4. Task B invokes wait_event(), which invokes brw_read_ctr() and sees the result as zero. Therefore, Task B does not sleep, does not acquire locks, and does not execute any memory barriers. As a result, ordering is not guaranteed between Task A's read-side critical section and Task B's upcoming write-side critical section. So I believe that you need smp_mb() in both brw_end_read() and brw_start_write(). Sigh... It is quite possible that you also need an smp_mb() in brw_start_read(), but let's start with just the scenario above. So, does the above scenario show a problem, or am I confused? > > > +void brw_start_write(struct brw_mutex *brw) > > > +{ > > > + atomic_inc(&brw->write_ctr); > > > + synchronize_sched(); > > > + /* > > > + * Thereafter brw_*_read() must see write_ctr != 0, > > > + * and we should see the result of __this_cpu_inc(). > > > + */ > > > + wait_event(brw->write_waitq, brw_read_ctr(brw) == 0); > > > > This looks like it allows multiple writers to proceed concurrently. > > They both increment, do a synchronize_sched(), do the wait_event(), > > and then are both awakened by the last reader. > > Yes. From the changelog: > > Unlike rw_semaphore it allows multiple writers too, > just "read" and "write" are mutually exclusive. OK, color me blind! ;-) > > Was that the intent? (The implementation of brw_end_write() makes > > it look like it is in fact the intent.) > > Please look at 2/2. > > Multiple uprobe_register() or uprobe_unregister() can run at the > same time to install/remove the system-wide breakpoint, and > brw_start_write() is used to block dup_mmap() to avoid the race. > But they do not block each other. Ah, makes sense, thank you! Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-16 18:58 ` Paul E. McKenney @ 2012-10-17 16:37 ` Oleg Nesterov 2012-10-17 22:28 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-17 16:37 UTC (permalink / raw) To: Paul E. McKenney Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/16, Paul E. McKenney wrote: > > On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote: > > > > > > I believe that you need smp_mb() here. > > > > I don't understand why... > > > > > The wake_up_all()'s memory barriers > > > do not suffice because some other reader might have awakened the writer > > > between this_cpu_dec() and wake_up_all(). > > > > But __wake_up(q) takes q->lock? And the same lock is taken by > > prepare_to_wait(), so how can the writer miss the result of _dec? > > Suppose that the writer arrives and sees that the value of the counter > is zero, after synchronize_sched(). So there are no readers (but perhaps there are brw_end_read's in flight which already decremented read_ctr) > and thus never sleeps, and so is also not awakened? and why do we need wakeup in this case? > > > void brw_end_read(struct brw_mutex *brw) > > > { > > > if (unlikely(atomic_read(&brw->write_ctr))) { > > > smp_mb(); > > > this_cpu_dec(*brw->read_ctr); > > > wake_up_all(&brw->write_waitq); > > > > Hmm... still can't understand. > > > > It seems that this mb() is needed to ensure that brw_end_read() can't > > miss write_ctr != 0. > > > > But we do not care unless the writer already does wait_event(). And > > before it does wait_event() it calls synchronize_sched() after it sets > > write_ctr != 0. Doesn't this mean that after that any preempt-disabled > > section must see write_ctr != 0 ? > > > > This code actually checks write_ctr after preempt_disable + enable, > > but I think this doesn't matter? > > > > Paul, most probably I misunderstood you. Could you spell please? > > Let me try outlining the sequence of events that I am worried about... > > 1. Task A invokes brw_start_read(). There is no writer, so it > takes the fastpath. > > 2. Task B invokes brw_start_write(), atomically increments > &brw->write_ctr, and executes synchronize_sched(). > > 3. Task A invokes brw_end_read() and does this_cpu_dec(). OK. And to simplify this discussion, suppose that A invoked brw_start_read() on CPU_0 and thus incremented read_ctr[0], and then it migrates to CPU_1 and brw_end_read() uses read_ctr[1]. My understanding was, brw_start_write() must see read_ctr[0] == 1 after synchronize_sched(). > 4. Task B invokes wait_event(), which invokes brw_read_ctr() > and sees the result as zero. So my understanding is completely wrong? I thought that after synchronize_sched() we should see the result of any operation which were done inside the preempt-disable section. No? Hmm. Suppose that we have long A = B = STOP = 0, and void func(void) { preempt_disable(); if (!STOP) { A = 1; B = 1; } preempt_enable(); } Now, you are saying that this code STOP = 1; synchronize_sched(); BUG_ON(A != B); is not correct? (yes, yes, this example is not very good). The comment above synchronize_sched() says: return ... after all currently executing rcu-sched read-side critical sections have completed. But if this code is wrong, then what "completed" actually means? I thought that it also means "all memory operations have completed", but this is not true? Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-17 16:37 ` Oleg Nesterov @ 2012-10-17 22:28 ` Paul E. McKenney 0 siblings, 0 replies; 103+ messages in thread From: Paul E. McKenney @ 2012-10-17 22:28 UTC (permalink / raw) To: Oleg Nesterov Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote: > On 10/16, Paul E. McKenney wrote: > > > > On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote: > > > > > > > > I believe that you need smp_mb() here. > > > > > > I don't understand why... > > > > > > > The wake_up_all()'s memory barriers > > > > do not suffice because some other reader might have awakened the writer > > > > between this_cpu_dec() and wake_up_all(). > > > > > > But __wake_up(q) takes q->lock? And the same lock is taken by > > > prepare_to_wait(), so how can the writer miss the result of _dec? > > > > Suppose that the writer arrives and sees that the value of the counter > > is zero, > > after synchronize_sched(). So there are no readers (but perhaps there > are brw_end_read's in flight which already decremented read_ctr) But the preempt_disable() region only covers read acquisition. So synchronize_sched() waits only for all the brw_start_read() calls to reach the preempt_enable() -- it cannot wait for all the resulting readers to reach the corresponding brw_end_read(). > > and thus never sleeps, and so is also not awakened? > > and why do we need wakeup in this case? To get the memory barriers required to keep the critical sections ordered -- to ensure that everyone sees the reader's critical section as ending before the writer's critical section starts. > > > > void brw_end_read(struct brw_mutex *brw) > > > > { > > > > if (unlikely(atomic_read(&brw->write_ctr))) { > > > > smp_mb(); > > > > this_cpu_dec(*brw->read_ctr); > > > > wake_up_all(&brw->write_waitq); > > > > > > Hmm... still can't understand. > > > > > > It seems that this mb() is needed to ensure that brw_end_read() can't > > > miss write_ctr != 0. > > > > > > But we do not care unless the writer already does wait_event(). And > > > before it does wait_event() it calls synchronize_sched() after it sets > > > write_ctr != 0. Doesn't this mean that after that any preempt-disabled > > > section must see write_ctr != 0 ? > > > > > > This code actually checks write_ctr after preempt_disable + enable, > > > but I think this doesn't matter? > > > > > > Paul, most probably I misunderstood you. Could you spell please? > > > > Let me try outlining the sequence of events that I am worried about... > > > > 1. Task A invokes brw_start_read(). There is no writer, so it > > takes the fastpath. > > > > 2. Task B invokes brw_start_write(), atomically increments > > &brw->write_ctr, and executes synchronize_sched(). > > > > 3. Task A invokes brw_end_read() and does this_cpu_dec(). > > OK. And to simplify this discussion, suppose that A invoked > brw_start_read() on CPU_0 and thus incremented read_ctr[0], and > then it migrates to CPU_1 and brw_end_read() uses read_ctr[1]. > > My understanding was, brw_start_write() must see read_ctr[0] == 1 > after synchronize_sched(). Yep. But it makes absolutely no guarantee about ordering of the decrement of read_ctr[1]. > > 4. Task B invokes wait_event(), which invokes brw_read_ctr() > > and sees the result as zero. > > So my understanding is completely wrong? I thought that after > synchronize_sched() we should see the result of any operation > which were done inside the preempt-disable section. We should indeed. But the decrement of read_ctr[1] is not done within the preempt_disable() section, and the guarantee therefore does not apply to it. This means that there is no guarantee that Task A's read-side critical section will be ordered before Task B's read-side critical section. Now, maybe you don't need that guarantee, but if you don't, I am missing what exactly these primitives are doing for you. > No? > > Hmm. Suppose that we have long A = B = STOP = 0, and > > void func(void) > { > preempt_disable(); > if (!STOP) { > A = 1; > B = 1; > } > preempt_enable(); > } > > Now, you are saying that this code > > STOP = 1; > > synchronize_sched(); > > BUG_ON(A != B); > > is not correct? (yes, yes, this example is not very good). Yep. Assuming no other modifications to A and B, at the point of the BUG_ON(), we should have A==1 and B==1. The thing is that the preempt_disable() in your patch only covers brw_start_read(), but not brw_end_read(). So the decrement (along with the rest of the read-side critical section) is unordered with respect to the write-side critical section started by the brw_start_write(). > The comment above synchronize_sched() says: > > return ... after all currently executing > rcu-sched read-side critical sections have completed. > > But if this code is wrong, then what "completed" actually means? > I thought that it also means "all memory operations have completed", > but this is not true? >From what I can see, your interpretation of synchronize_sched() is correct. The problem is that brw_end_read() isn't within the relevant rcu-sched read-side critical section. Or that I am confused.... Thanx, Paul > Oleg. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-15 19:10 ` [PATCH 1/2] brw_mutex: big read-write mutex Oleg Nesterov 2012-10-15 23:28 ` Paul E. McKenney @ 2012-10-16 19:56 ` Linus Torvalds 2012-10-17 16:59 ` Oleg Nesterov 1 sibling, 1 reply; 103+ messages in thread From: Linus Torvalds @ 2012-10-16 19:56 UTC (permalink / raw) To: Oleg Nesterov Cc: Ingo Molnar, Paul E. McKenney, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote: > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore > it allows multiple writers too, just "read" and "write" are mutually > exclusive. So those semantics just don't sound sane. It's also not what any kind of normal "rw" lock ever does. So can you explain why these particular insane semantics are useful, and what for? Linus ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-16 19:56 ` Linus Torvalds @ 2012-10-17 16:59 ` Oleg Nesterov 2012-10-17 22:44 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-17 16:59 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Paul E. McKenney, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/16, Linus Torvalds wrote: > > On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote: > > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore > > it allows multiple writers too, just "read" and "write" are mutually > > exclusive. > > So those semantics just don't sound sane. It's also not what any kind > of normal "rw" lock ever does. Yes, this is not usual. And initially I made brw_sem which allows only 1 writer, but then I changed this patch. > So can you explain why these particular insane semantics are useful, > and what for? To allow multiple uprobe_register/unregister at the same time. Mostly to not add the "regression", currently this is possible. It is not that I think this is terribly important, but still. And personally I think that "multiple writers" is not necessarily insane in general. Suppose you have the complex object/subsystem, the readers can use a single brw_mutex to access it "lockless", start_read() is very cheap. But start_write() is slow. Multiple writes can use the fine-grained inside the start_write/end_write section and do not block each other. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-17 16:59 ` Oleg Nesterov @ 2012-10-17 22:44 ` Paul E. McKenney 2012-10-18 16:24 ` Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-10-17 22:44 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 17, 2012 at 06:59:02PM +0200, Oleg Nesterov wrote: > On 10/16, Linus Torvalds wrote: > > > > On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote: > > > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore > > > it allows multiple writers too, just "read" and "write" are mutually > > > exclusive. > > > > So those semantics just don't sound sane. It's also not what any kind > > of normal "rw" lock ever does. > > Yes, this is not usual. > > And initially I made brw_sem which allows only 1 writer, but then > I changed this patch. > > > So can you explain why these particular insane semantics are useful, > > and what for? > > To allow multiple uprobe_register/unregister at the same time. Mostly > to not add the "regression", currently this is possible. > > It is not that I think this is terribly important, but still. And > personally I think that "multiple writers" is not necessarily insane > in general. Suppose you have the complex object/subsystem, the readers > can use a single brw_mutex to access it "lockless", start_read() is > very cheap. > > But start_write() is slow. Multiple writes can use the fine-grained > inside the start_write/end_write section and do not block each other. Strangely enough, the old VAXCluster locking primitives allowed this sort of thing. The brw_start_read() would be a "protected read", and brw_start_write() would be a "concurrent write". Even more interesting, they gave the same advice you give -- concurrent writes should use fine-grained locking to protect the actual accesses. It seems like it should be possible to come up with better names, but I cannot think of any at the moment. Thanx, Paul PS. For the sufficiently masochistic, here is the exclusion table for the six VAXCluster locking modes: NL CR CW PR PW EX NL CR X CW X X X PR X X X PW X X X X EX X X X X X "X" means that the pair of modes exclude each other, otherwise the lock may be held in both of the modes simultaneously. Modes: NL: Null, or "not held". CR: Concurrent read. CW: Concurrent write. PR: Protected read. PW: Protected write. EX: Exclusive. A reader-writer lock could use protected read for readers and either of protected write or exclusive for writers, the difference between protected write and exclusive being irrelevant in the absence of concurrent readers. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-17 22:44 ` Paul E. McKenney @ 2012-10-18 16:24 ` Oleg Nesterov 2012-10-18 16:38 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-18 16:24 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/17, Paul E. McKenney wrote: > > On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote: > > On 10/16, Paul E. McKenney wrote: > > > > > > Suppose that the writer arrives and sees that the value of the counter > > > is zero, > > > > after synchronize_sched(). So there are no readers (but perhaps there > > are brw_end_read's in flight which already decremented read_ctr) > > But the preempt_disable() region only covers read acquisition. So > synchronize_sched() waits only for all the brw_start_read() calls to > reach the preempt_enable() Yes. > -- it cannot wait for all the resulting > readers to reach the corresponding brw_end_read(). Indeed. > > > and thus never sleeps, and so is also not awakened? > > > > and why do we need wakeup in this case? > > To get the memory barriers required to keep the critical sections > ordered -- to ensure that everyone sees the reader's critical section > as ending before the writer's critical section starts. And now I am starting to think I misunderstood your concern from the very beginning. I thought that you meant that without mb() brw_start_write() can race with brw_end_read() and hang forever. But probably you meant that we need the barriers to ensure that, say, if the reader does brw_start_read(); CONDITION = 1; brw_end_read(); then the writer must see CONDITION != 0 after brw_start_write() ? (or vice-versa) In this case we need the barrier, yes. Obviously brw_start_write() can return right after this_cpu_dec() and before wake_up_all(). 2/2 doesn't need this guarantee but I agree, this doesn't look sane in gerenal... Or I misunderstood you again? Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-18 16:24 ` Oleg Nesterov @ 2012-10-18 16:38 ` Paul E. McKenney 2012-10-18 17:57 ` Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-10-18 16:38 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote: > On 10/17, Paul E. McKenney wrote: > > > > On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote: > > > On 10/16, Paul E. McKenney wrote: > > > > > > > > Suppose that the writer arrives and sees that the value of the counter > > > > is zero, > > > > > > after synchronize_sched(). So there are no readers (but perhaps there > > > are brw_end_read's in flight which already decremented read_ctr) > > > > But the preempt_disable() region only covers read acquisition. So > > synchronize_sched() waits only for all the brw_start_read() calls to > > reach the preempt_enable() > > Yes. > > > -- it cannot wait for all the resulting > > readers to reach the corresponding brw_end_read(). > > Indeed. > > > > > and thus never sleeps, and so is also not awakened? > > > > > > and why do we need wakeup in this case? > > > > To get the memory barriers required to keep the critical sections > > ordered -- to ensure that everyone sees the reader's critical section > > as ending before the writer's critical section starts. > > And now I am starting to think I misunderstood your concern from > the very beginning. > > I thought that you meant that without mb() brw_start_write() can > race with brw_end_read() and hang forever. > > But probably you meant that we need the barriers to ensure that, > say, if the reader does > > brw_start_read(); > CONDITION = 1; > brw_end_read(); > > then the writer must see CONDITION != 0 after brw_start_write() ? > (or vice-versa) Yes, this is exactly my concern. > In this case we need the barrier, yes. Obviously brw_start_write() > can return right after this_cpu_dec() and before wake_up_all(). > > 2/2 doesn't need this guarantee but I agree, this doesn't look > sane in gerenal... Or name it something not containing "lock". And clearly document the behavior and how it is to be used. ;-) Otherwise, someone will get confused and introduce bugs. > Or I misunderstood you again? No, this was indeed my concern. Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-18 16:38 ` Paul E. McKenney @ 2012-10-18 17:57 ` Oleg Nesterov 2012-10-18 19:28 ` Mikulas Patocka 2012-10-19 19:28 ` Paul E. McKenney 0 siblings, 2 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-18 17:57 UTC (permalink / raw) To: Paul E. McKenney, Mikulas Patocka Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/18, Paul E. McKenney wrote: > > On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote: > > > > I thought that you meant that without mb() brw_start_write() can > > race with brw_end_read() and hang forever. > > > > But probably you meant that we need the barriers to ensure that, > > say, if the reader does > > > > brw_start_read(); > > CONDITION = 1; > > brw_end_read(); > > > > then the writer must see CONDITION != 0 after brw_start_write() ? > > (or vice-versa) > > Yes, this is exactly my concern. Oh, thanks at lot Paul (as always). > > In this case we need the barrier, yes. Obviously brw_start_write() > > can return right after this_cpu_dec() and before wake_up_all(). > > > > 2/2 doesn't need this guarantee but I agree, this doesn't look > > sane in gerenal... > > Or name it something not containing "lock". And clearly document > the behavior and how it is to be used. ;-) this would be insane, I guess ;) So. Ignoring the possible optimization you mentioned before, brw_end_read() should do: smp_mb(); this_cpu_dec(); wake_up_all(); And yes, we need the full mb(). wmb() is enough to ensure that the writer will see the memory modifications done by the reader. But we also need to ensure that any LOAD inside start_read/end_read can not be moved outside of the critical section. But we should also ensure that "read" will see all modifications which were done under start_write/end_write. This means that brw_end_write() needs another synchronize_sched() before atomic_dec_and_test(), or brw_start_read() needs mb() in the fast-path. Correct? Ooooh. And I just noticed include/linux/percpu-rwsem.h which does something similar. Certainly it was not in my tree when I started this patch... percpu_down_write() doesn't allow multiple writers, but the main problem it uses msleep(1). It should not, I think. But. It seems that percpu_up_write() is equally wrong? Doesn't it need synchronize_rcu() before "p->locked = false" ? (add Mikulas) Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-18 17:57 ` Oleg Nesterov @ 2012-10-18 19:28 ` Mikulas Patocka 2012-10-19 12:38 ` Peter Zijlstra 2012-10-19 19:28 ` Paul E. McKenney 1 sibling, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-10-18 19:28 UTC (permalink / raw) To: Oleg Nesterov Cc: Paul E. McKenney, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, 18 Oct 2012, Oleg Nesterov wrote: > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > something similar. Certainly it was not in my tree when I started > this patch... percpu_down_write() doesn't allow multiple writers, > but the main problem it uses msleep(1). It should not, I think. synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not a big problem. > But. It seems that percpu_up_write() is equally wrong? Doesn't > it need synchronize_rcu() before "p->locked = false" ? Yes, it does ... and I sent patch for that to Linus. > (add Mikulas) > > Oleg. Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-18 19:28 ` Mikulas Patocka @ 2012-10-19 12:38 ` Peter Zijlstra 2012-10-19 15:32 ` Mikulas Patocka 0 siblings, 1 reply; 103+ messages in thread From: Peter Zijlstra @ 2012-10-19 12:38 UTC (permalink / raw) To: Mikulas Patocka Cc: Oleg Nesterov, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote: > > On Thu, 18 Oct 2012, Oleg Nesterov wrote: > > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > > something similar. Certainly it was not in my tree when I started > > this patch... percpu_down_write() doesn't allow multiple writers, > > but the main problem it uses msleep(1). It should not, I think. > > synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not > a big problem. That code is beyond ugly though.. it should really not have been merged. There's absolutely no reason for it to use RCU except to make it more complicated. And as Oleg pointed out that msleep() is very ill considered. The very worst part of it seems to be that nobody who's usually involved with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo, tglx, dhowells and me). It doesn't even have lockdep annotations :/ So the only reason you appear to use RCU is because you don't actually have a sane way to wait for count==0. And I'm contesting rcu_sync() is sane here -- for the very simple reason you still need while (count) loop right after it. So it appears you want an totally reader biased, sleepable rw-lock like thing? So did you consider keeping the inc/dec on the same per-cpu variable? Yes this adds a potential remote access to dec and requires you to use atomics, but I would not be surprised if the inc/dec were mostly on the same cpu most of the times -- which might be plenty fast for what you want. If you've got coherent per-cpu counts, you can better do the waitqueue/wake condition for write_down. It might also make sense to do away with the mutex, there's no point in serializing the wakeups in the p->locked case of down_read. Furthermore, p->locked seems a complete duplicate of the mutex state, so removing the mutex also removes that duplication. Also, that CONFIG_x86 thing.. *shudder*... ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-19 12:38 ` Peter Zijlstra @ 2012-10-19 15:32 ` Mikulas Patocka 2012-10-19 17:40 ` Peter Zijlstra 2012-10-19 17:49 ` Oleg Nesterov 0 siblings, 2 replies; 103+ messages in thread From: Mikulas Patocka @ 2012-10-19 15:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On Fri, 19 Oct 2012, Peter Zijlstra wrote: > On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote: > > > > On Thu, 18 Oct 2012, Oleg Nesterov wrote: > > > > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > > > something similar. Certainly it was not in my tree when I started > > > this patch... percpu_down_write() doesn't allow multiple writers, > > > but the main problem it uses msleep(1). It should not, I think. > > > > synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not > > a big problem. > > That code is beyond ugly though.. it should really not have been merged. > > There's absolutely no reason for it to use RCU except to make it more So if you can do an alternative implementation without RCU, show it. The goal is - there should be no LOCK instructions on the read path and as few barriers as possible. > complicated. And as Oleg pointed out that msleep() is very ill > considered. > > The very worst part of it seems to be that nobody who's usually involved > with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo, > tglx, dhowells and me). It doesn't even have lockdep annotations :/ > > So the only reason you appear to use RCU is because you don't actually > have a sane way to wait for count==0. And I'm contesting rcu_sync() is > sane here -- for the very simple reason you still need while (count) > loop right after it. > > So it appears you want an totally reader biased, sleepable rw-lock like > thing? Yes. > So did you consider keeping the inc/dec on the same per-cpu variable? > Yes this adds a potential remote access to dec and requires you to use > atomics, but I would not be surprised if the inc/dec were mostly on the > same cpu most of the times -- which might be plenty fast for what you > want. Yes, I tried this approach - it involves doing LOCK instruction on read lock, remembering the cpu and doing another LOCK instruction on read unlock (which will hopefully be on the same CPU, so no cacheline bouncing happens in the common case). It was slower than the approach without any LOCK instructions (43.3 seconds seconds for the implementation with per-cpu LOCKed access, 42.7 seconds for this implementation without atomic instruction; the benchmark involved doing 512-byte direct-io reads and writes on a ramdisk with 8 processes on 8-core machine). > If you've got coherent per-cpu counts, you can better do the > waitqueue/wake condition for write_down. synchronize_rcu() is way slower than msleep(1) - so I don't see a reason why should it be complicated to avoid msleep(1). > It might also make sense to do away with the mutex, there's no point in > serializing the wakeups in the p->locked case of down_read. The advantage of a mutex is that it is already protected against starvation. If I replace the mutex with a wait queue and retry, there is no starvation protection. > Furthermore, > p->locked seems a complete duplicate of the mutex state, so removing the > mutex also removes that duplication. We could replace if (p->locked) with if (mutex_is_locked(p->mtx)) > Also, that CONFIG_x86 thing.. *shudder*... Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-19 15:32 ` Mikulas Patocka @ 2012-10-19 17:40 ` Peter Zijlstra 2012-10-19 17:57 ` Oleg Nesterov 2012-10-19 22:54 ` Mikulas Patocka 2012-10-19 17:49 ` Oleg Nesterov 1 sibling, 2 replies; 103+ messages in thread From: Peter Zijlstra @ 2012-10-19 17:40 UTC (permalink / raw) To: Mikulas Patocka Cc: Oleg Nesterov, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On Fri, 2012-10-19 at 11:32 -0400, Mikulas Patocka wrote: > So if you can do an alternative implementation without RCU, show it. Uhm,,. no that's not how it works. You just don't push through crap like this and then demand someone else does it better. But using preempt_{disable,enable} and using synchronize_sched() would be better (for PREEMPT_RCU) although it wouldn't fix anything fundamental. > The > goal is - there should be no LOCK instructions on the read path and as > few barriers as possible. Fine goal, although somewhat arch specific. Also note that there's a relation between atomics and memory barriers, one isn't necessarily worse than the other, they all require synchronization of sorts. > > So did you consider keeping the inc/dec on the same per-cpu variable? > > Yes this adds a potential remote access to dec and requires you to use > > atomics, but I would not be surprised if the inc/dec were mostly on the > > same cpu most of the times -- which might be plenty fast for what you > > want. > > Yes, I tried this approach - it involves doing LOCK instruction on read > lock, remembering the cpu and doing another LOCK instruction on read > unlock (which will hopefully be on the same CPU, so no cacheline bouncing > happens in the common case). It was slower than the approach without any > LOCK instructions (43.3 seconds seconds for the implementation with > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic > instruction; the benchmark involved doing 512-byte direct-io reads and > writes on a ramdisk with 8 processes on 8-core machine). So why is that a problem? Surely that's already tons better then what you've currently got. Also uncontended LOCK is something all x86 vendors keep optimizing, they'll have to if they want to keep adding CPUs. > > If you've got coherent per-cpu counts, you can better do the > > waitqueue/wake condition for write_down. > > synchronize_rcu() is way slower than msleep(1) - so I don't see a reason > why should it be complicated to avoid msleep(1). Its not about slow, a polling write side is just fscking ugly. Also, if you're already polling that *_sync() is bloody pointless. > > It might also make sense to do away with the mutex, there's no point in > > serializing the wakeups in the p->locked case of down_read. > > The advantage of a mutex is that it is already protected against > starvation. If I replace the mutex with a wait queue and retry, there is > no starvation protection. Which starvation? writer-writer order? What stops you from adding a list there yourself? Also, writers had better be rare for this thing, so who gives a crap? > > Furthermore, > > p->locked seems a complete duplicate of the mutex state, so removing the > > mutex also removes that duplication. > > We could replace if (p->locked) with if (mutex_is_locked(p->mtx)) Quite so.. You're also still lacking lockdep annotations... ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-19 17:40 ` Peter Zijlstra @ 2012-10-19 17:57 ` Oleg Nesterov 2012-10-19 22:54 ` Mikulas Patocka 1 sibling, 0 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-19 17:57 UTC (permalink / raw) To: Peter Zijlstra Cc: Mikulas Patocka, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On 10/19, Peter Zijlstra wrote: > > But using preempt_{disable,enable} and using synchronize_sched() would > be better (for PREEMPT_RCU) although it wouldn't fix anything > fundamental. BTW, I agree. I didn't even notice percpu-rwsem.h uses _rcu, not _sched. > Fine goal, although somewhat arch specific. Also note that there's a > relation between atomics and memory barriers, one isn't necessarily > worse than the other, they all require synchronization of sorts. As Paul pointed out, the fast path can avoid mb(). It is only needed when "up_read" detects the writer. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-19 17:40 ` Peter Zijlstra 2012-10-19 17:57 ` Oleg Nesterov @ 2012-10-19 22:54 ` Mikulas Patocka 2012-10-24 3:08 ` Dave Chinner 1 sibling, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-10-19 22:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On Fri, 19 Oct 2012, Peter Zijlstra wrote: > > Yes, I tried this approach - it involves doing LOCK instruction on read > > lock, remembering the cpu and doing another LOCK instruction on read > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing > > happens in the common case). It was slower than the approach without any > > LOCK instructions (43.3 seconds seconds for the implementation with > > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic > > instruction; the benchmark involved doing 512-byte direct-io reads and > > writes on a ramdisk with 8 processes on 8-core machine). > > So why is that a problem? Surely that's already tons better then what > you've currently got. Percpu rw-semaphores do not improve performance at all. I put them there to avoid performance regression, not to improve performance. All Linux kernels have a race condition - when you change block size of a block device and you read or write the device at the same time, a crash may happen. This bug is there since ever. Recently, this bug started to cause major trouble - multiple high profile business sites report crashes because of this race condition. You can fix this race by using a read lock around I/O paths and write lock around block size changing, but normal rw semaphore cause cache line bouncing when taken for read by multiple processors and I/O performance degradation because of it is measurable. So I put this percpu-rw-semaphore there to fix the crashes and minimize performance impact - on x86 it doesn't take any interlocked instructions in the read path. I don't quite understand why are people opposing to this and what do they want to do instead? If you pull percpu-rw-semaphores out of the kernel, you introduce a performance regression (raw device i/o will be slower on 3.7 than on 3.6, because on 3.6 it doesn't take any lock at all and on 3.7 it takes a read lock). So you have options: 1) don't lock i/o just like on 3.6 and previous versions - you get a fast kernel that randomly crashes 2) lock i/o with normal rw semaphore - you get a kernel that doesn't crash, but that is slower than previous versions 3) lock i/o with percpu rw semaphore - you get kernel that is almost as fast as previous kernels and that doesn't crash For the users, the option 3) is the best. The users don't care whether it looks ugly or not, they care about correctness and performance, that's all. Obviously, you can improve rw semaphores by adding lockdep annotations, or by other things (turning rcu_read_lock/sychronize_rcu into preempt_disable/synchronize_sched, by using barrier()-synchronize_sched() instead of smp_mb()...), but I don't see a reason why do you want to hurt users' experience by pulling it out and reverting to state 1) or 2) and then, two kernel cycles later, come up with percpu-rw-semaphores again. Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-19 22:54 ` Mikulas Patocka @ 2012-10-24 3:08 ` Dave Chinner 2012-10-25 14:09 ` Mikulas Patocka 0 siblings, 1 reply; 103+ messages in thread From: Dave Chinner @ 2012-10-24 3:08 UTC (permalink / raw) To: Mikulas Patocka Cc: Peter Zijlstra, Oleg Nesterov, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote: > > > On Fri, 19 Oct 2012, Peter Zijlstra wrote: > > > > Yes, I tried this approach - it involves doing LOCK instruction on read > > > lock, remembering the cpu and doing another LOCK instruction on read > > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing > > > happens in the common case). It was slower than the approach without any > > > LOCK instructions (43.3 seconds seconds for the implementation with > > > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic > > > instruction; the benchmark involved doing 512-byte direct-io reads and > > > writes on a ramdisk with 8 processes on 8-core machine). > > > > So why is that a problem? Surely that's already tons better then what > > you've currently got. > > Percpu rw-semaphores do not improve performance at all. I put them there > to avoid performance regression, not to improve performance. > > All Linux kernels have a race condition - when you change block size of a > block device and you read or write the device at the same time, a crash > may happen. This bug is there since ever. Recently, this bug started to > cause major trouble - multiple high profile business sites report crashes > because of this race condition. > > You can fix this race by using a read lock around I/O paths and write lock > around block size changing, but normal rw semaphore cause cache line > bouncing when taken for read by multiple processors and I/O performance > degradation because of it is measurable. This doesn't sound like a new problem. Hasn't this global access, single modifier exclusion problem been solved before in the VFS? e.g. mnt_want_write()/mnt_make_readonly() Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-24 3:08 ` Dave Chinner @ 2012-10-25 14:09 ` Mikulas Patocka 2012-10-25 23:40 ` Dave Chinner 0 siblings, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-10-25 14:09 UTC (permalink / raw) To: Dave Chinner Cc: Peter Zijlstra, Oleg Nesterov, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On Wed, 24 Oct 2012, Dave Chinner wrote: > On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote: > > > > > > On Fri, 19 Oct 2012, Peter Zijlstra wrote: > > > > > > Yes, I tried this approach - it involves doing LOCK instruction on read > > > > lock, remembering the cpu and doing another LOCK instruction on read > > > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing > > > > happens in the common case). It was slower than the approach without any > > > > LOCK instructions (43.3 seconds seconds for the implementation with > > > > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic > > > > instruction; the benchmark involved doing 512-byte direct-io reads and > > > > writes on a ramdisk with 8 processes on 8-core machine). > > > > > > So why is that a problem? Surely that's already tons better then what > > > you've currently got. > > > > Percpu rw-semaphores do not improve performance at all. I put them there > > to avoid performance regression, not to improve performance. > > > > All Linux kernels have a race condition - when you change block size of a > > block device and you read or write the device at the same time, a crash > > may happen. This bug is there since ever. Recently, this bug started to > > cause major trouble - multiple high profile business sites report crashes > > because of this race condition. > > > > You can fix this race by using a read lock around I/O paths and write lock > > around block size changing, but normal rw semaphore cause cache line > > bouncing when taken for read by multiple processors and I/O performance > > degradation because of it is measurable. > > This doesn't sound like a new problem. Hasn't this global access, > single modifier exclusion problem been solved before in the VFS? > e.g. mnt_want_write()/mnt_make_readonly() > > Cheers, > > Dave. Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() to use percpu rw semaphores and remove the duplicated code. Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-25 14:09 ` Mikulas Patocka @ 2012-10-25 23:40 ` Dave Chinner 2012-10-26 12:06 ` Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Dave Chinner @ 2012-10-25 23:40 UTC (permalink / raw) To: Mikulas Patocka Cc: Peter Zijlstra, Oleg Nesterov, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote: > > > On Wed, 24 Oct 2012, Dave Chinner wrote: > > > On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote: > > > > > > > > > On Fri, 19 Oct 2012, Peter Zijlstra wrote: > > > > > > > > Yes, I tried this approach - it involves doing LOCK instruction on read > > > > > lock, remembering the cpu and doing another LOCK instruction on read > > > > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing > > > > > happens in the common case). It was slower than the approach without any > > > > > LOCK instructions (43.3 seconds seconds for the implementation with > > > > > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic > > > > > instruction; the benchmark involved doing 512-byte direct-io reads and > > > > > writes on a ramdisk with 8 processes on 8-core machine). > > > > > > > > So why is that a problem? Surely that's already tons better then what > > > > you've currently got. > > > > > > Percpu rw-semaphores do not improve performance at all. I put them there > > > to avoid performance regression, not to improve performance. > > > > > > All Linux kernels have a race condition - when you change block size of a > > > block device and you read or write the device at the same time, a crash > > > may happen. This bug is there since ever. Recently, this bug started to > > > cause major trouble - multiple high profile business sites report crashes > > > because of this race condition. > > > > > > You can fix this race by using a read lock around I/O paths and write lock > > > around block size changing, but normal rw semaphore cause cache line > > > bouncing when taken for read by multiple processors and I/O performance > > > degradation because of it is measurable. > > > > This doesn't sound like a new problem. Hasn't this global access, > > single modifier exclusion problem been solved before in the VFS? > > e.g. mnt_want_write()/mnt_make_readonly() > > > > Cheers, > > > > Dave. > > Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw > semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() > to use percpu rw semaphores and remove the duplicated code. I think you misunderstood my point - that rather than re-inventing the wheel, why didn't you just copy something that is known to work? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-25 23:40 ` Dave Chinner @ 2012-10-26 12:06 ` Oleg Nesterov 2012-10-26 13:22 ` Mikulas Patocka 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-26 12:06 UTC (permalink / raw) To: Dave Chinner Cc: Mikulas Patocka, Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On 10/26, Dave Chinner wrote: > > On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote: > > > > Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw > > semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() > > to use percpu rw semaphores and remove the duplicated code. > > I think you misunderstood my point - that rather than re-inventing > the wheel, why didn't you just copy something that is known to > work? I don't understand why do you both think that __mnt_want_write() and mnt_make_readonly() provides the same functionality. I looked at this code before I started this patch, and unless I completely misread it this does very different things. It is not "lock" at all. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-26 12:06 ` Oleg Nesterov @ 2012-10-26 13:22 ` Mikulas Patocka 2012-10-26 14:12 ` Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-10-26 13:22 UTC (permalink / raw) To: Oleg Nesterov Cc: Dave Chinner, Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On Fri, 26 Oct 2012, Oleg Nesterov wrote: > On 10/26, Dave Chinner wrote: > > > > On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote: > > > > > > Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw > > > semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() > > > to use percpu rw semaphores and remove the duplicated code. > > > > I think you misunderstood my point - that rather than re-inventing > > the wheel, why didn't you just copy something that is known to > > work? I didn't know about. The code is not reusable, and it doesn't really do locking. And it has two barriers on the read path, while percpu rw semaphores have none. > I don't understand why do you both think that __mnt_want_write() > and mnt_make_readonly() provides the same functionality. I looked > at this code before I started this patch, and unless I completely > misread it this does very different things. It is not "lock" at all. > > Oleg. mnt_want_write uses percpu array of counters, just like percpu semaphores. The code is different, but it can be changed to use percpu rw semaphores (if we add percpu_down_write_trylock). __mnt_want_write could call percpu_down_read and check if it is readonly (if it is, drop the lock and return -EROFS) __mnt_drop_write could call percpu_up_read mnt_make_readonly and sb_prepare_remount_readonly could call percpu_down_write_trylock instead of mnt_get_writers (if they get the write lock, set it to readonly and drop the write lock) ... and that's it, then, you can remove MNT_WRITE_HOLD, the barriers, spinning and other complexity from fs/namespace.c Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-26 13:22 ` Mikulas Patocka @ 2012-10-26 14:12 ` Oleg Nesterov 2012-10-26 15:23 ` mark_files_ro && sb_end_write Oleg Nesterov 2012-10-26 16:09 ` [PATCH 1/2] brw_mutex: big read-write mutex Mikulas Patocka 0 siblings, 2 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-26 14:12 UTC (permalink / raw) To: Mikulas Patocka Cc: Dave Chinner, Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On 10/26, Mikulas Patocka wrote: > > On Fri, 26 Oct 2012, Oleg Nesterov wrote: > > I didn't know about. The code is not reusable, and it doesn't really do > locking. That was my main point. As for the changing fs/namespace.c to use percpu_rwsem, I am not sure it is that simple and even worthwhile but I won't argue, I do not pretend I understand this code. > > I don't understand why do you both think that __mnt_want_write() > > and mnt_make_readonly() provides the same functionality. I looked > > at this code before I started this patch, and unless I completely > > misread it this does very different things. It is not "lock" at all. > > > > Oleg. > > mnt_want_write uses percpu array of counters, just like percpu semaphores. and this is all imo ;) > The code is different, but it can be changed to use percpu rw semaphores > (if we add percpu_down_write_trylock). I don't really understand how you can make percpu_down_write_trylock() atomic so that it can be called under br_write_lock(vfsmount_lock) in sb_prepare_remount_readonly(). So I guess you also need to replace vfsmount_lock at least. Or _trylock needs the barriers in _down_read. Or I missed something. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* mark_files_ro && sb_end_write 2012-10-26 14:12 ` Oleg Nesterov @ 2012-10-26 15:23 ` Oleg Nesterov 2012-10-26 16:09 ` [PATCH 1/2] brw_mutex: big read-write mutex Mikulas Patocka 1 sibling, 0 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-26 15:23 UTC (permalink / raw) To: Mikulas Patocka, Al Viro Cc: Dave Chinner, Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On 10/26, Oleg Nesterov wrote: > > As for the changing fs/namespace.c to use percpu_rwsem, I am not sure > it is that simple and even worthwhile but I won't argue, I do not > pretend I understand this code. BTW, speaking about these counters... Is mark_files_ro()->mnt_drop_write_file() properly balanced? __mnt_drop_write() looks fine, but sb_end_write? Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-26 14:12 ` Oleg Nesterov 2012-10-26 15:23 ` mark_files_ro && sb_end_write Oleg Nesterov @ 2012-10-26 16:09 ` Mikulas Patocka 1 sibling, 0 replies; 103+ messages in thread From: Mikulas Patocka @ 2012-10-26 16:09 UTC (permalink / raw) To: Oleg Nesterov Cc: Dave Chinner, Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On Fri, 26 Oct 2012, Oleg Nesterov wrote: > > The code is different, but it can be changed to use percpu rw semaphores > > (if we add percpu_down_write_trylock). > > I don't really understand how you can make percpu_down_write_trylock() > atomic so that it can be called under br_write_lock(vfsmount_lock) in > sb_prepare_remount_readonly(). So I guess you also need to replace > vfsmount_lock at least. Or _trylock needs the barriers in _down_read. > Or I missed something. > > Oleg. That's true - that code is under spinlock and you can't implement non-blocking percpu_down_write_trylock. Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-19 15:32 ` Mikulas Patocka 2012-10-19 17:40 ` Peter Zijlstra @ 2012-10-19 17:49 ` Oleg Nesterov 2012-10-22 23:09 ` Mikulas Patocka 1 sibling, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-19 17:49 UTC (permalink / raw) To: Mikulas Patocka Cc: Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On 10/19, Mikulas Patocka wrote: > > synchronize_rcu() is way slower than msleep(1) - This depends, I guess. but this doesn't mmatter, > so I don't see a reason > why should it be complicated to avoid msleep(1). I don't think this really needs complications. Please look at this patch for example. Or initial (single writer) version below. It is not finished and lacks the barriers too, but I do not think it is more complex. Oleg. struct brw_sem { long __percpu *read_ctr; wait_queue_head_t read_waitq; struct mutex writer_mutex; struct task_struct *writer; }; int brw_init(struct brw_sem *brw) { brw->writer = NULL; mutex_init(&brw->writer_mutex); init_waitqueue_head(&brw->read_waitq); brw->read_ctr = alloc_percpu(long); return brw->read_ctr ? 0 : -ENOMEM; } void brw_down_read(struct brw_sem *brw) { for (;;) { bool done = false; preempt_disable(); if (likely(!brw->writer)) { __this_cpu_inc(*brw->read_ctr); done = true; } preempt_enable(); if (likely(done)) break; __wait_event(brw->read_waitq, !brw->writer); } } void brw_up_read(struct brw_sem *brw) { struct task_struct *writer; preempt_disable(); __this_cpu_dec(*brw->read_ctr); writer = ACCESS_ONCE(brw->writer); if (unlikely(writer)) wake_up_process(writer); preempt_enable(); } static inline long brw_read_ctr(struct brw_sem *brw) { long sum = 0; int cpu; for_each_possible_cpu(cpu) sum += per_cpu(*brw->read_ctr, cpu); return sum; } void brw_down_write(struct brw_sem *brw) { mutex_lock(&brw->writer_mutex); brw->writer = current; synchronize_sched(); /* * Thereafter brw_*_read() must see ->writer != NULL, * and we should see the result of __this_cpu_inc(). */ for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (brw_read_ctr(brw) == 0) break; schedule(); } __set_current_state(TASK_RUNNING); /* * We can add another synchronize_sched() to avoid the * spurious wakeups from brw_up_read() after return. */ } void brw_up_write(struct brw_sem *brw) { brw->writer = NULL; synchronize_sched(); wake_up_all(&brw->read_waitq); mutex_unlock(&brw->writer_mutex); } ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-19 17:49 ` Oleg Nesterov @ 2012-10-22 23:09 ` Mikulas Patocka 2012-10-23 15:12 ` Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-10-22 23:09 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner On Fri, 19 Oct 2012, Oleg Nesterov wrote: > On 10/19, Mikulas Patocka wrote: > > > > synchronize_rcu() is way slower than msleep(1) - > > This depends, I guess. but this doesn't mmatter, > > > so I don't see a reason > > why should it be complicated to avoid msleep(1). > > I don't think this really needs complications. Please look at this > patch for example. Or initial (single writer) version below. It is > not finished and lacks the barriers too, but I do not think it is > more complex. Hi My implementation has a smaller structure (it doesn't have wait_queue_head_t). Using preempt_disable()/synchronize_sched() instead of RCU seems like a good idea. Here, the locked region is so small that it doesn't make sense to play tricks with preemptible RCU. Your implementation is prone to starvation - if the writer has a high priority and if it is doing back-to-back write unlocks/locks, it may happen that the readers have no chance to run. The use of mutex instead of a wait queue in my implementation is unusual, but I don't see anything wrong with it - it makes the structure smaller and it solves the starvation problem (which would otherwise be complicated to solve). Mikulas > Oleg. > > struct brw_sem { > long __percpu *read_ctr; > wait_queue_head_t read_waitq; > struct mutex writer_mutex; > struct task_struct *writer; > }; > > int brw_init(struct brw_sem *brw) > { > brw->writer = NULL; > mutex_init(&brw->writer_mutex); > init_waitqueue_head(&brw->read_waitq); > brw->read_ctr = alloc_percpu(long); > return brw->read_ctr ? 0 : -ENOMEM; > } > > void brw_down_read(struct brw_sem *brw) > { > for (;;) { > bool done = false; > > preempt_disable(); > if (likely(!brw->writer)) { > __this_cpu_inc(*brw->read_ctr); > done = true; > } > preempt_enable(); > > if (likely(done)) > break; > > __wait_event(brw->read_waitq, !brw->writer); > } > } > > void brw_up_read(struct brw_sem *brw) > { > struct task_struct *writer; > > preempt_disable(); > __this_cpu_dec(*brw->read_ctr); > writer = ACCESS_ONCE(brw->writer); > if (unlikely(writer)) > wake_up_process(writer); > preempt_enable(); > } > > static inline long brw_read_ctr(struct brw_sem *brw) > { > long sum = 0; > int cpu; > > for_each_possible_cpu(cpu) > sum += per_cpu(*brw->read_ctr, cpu); Integer overflow on signed types is undefined - you should use unsigned long - you can use -fwrapv option to gcc to make signed overflow defined, but Linux doesn't use it. > > return sum; > } > > void brw_down_write(struct brw_sem *brw) > { > mutex_lock(&brw->writer_mutex); > brw->writer = current; > synchronize_sched(); > /* > * Thereafter brw_*_read() must see ->writer != NULL, > * and we should see the result of __this_cpu_inc(). > */ > for (;;) { > set_current_state(TASK_UNINTERRUPTIBLE); > if (brw_read_ctr(brw) == 0) > break; > schedule(); > } > __set_current_state(TASK_RUNNING); > /* > * We can add another synchronize_sched() to avoid the > * spurious wakeups from brw_up_read() after return. > */ > } > > void brw_up_write(struct brw_sem *brw) > { > brw->writer = NULL; > synchronize_sched(); That synchronize_sched should be put before brw->writer = NULL. This is incorrect, because brw->writer = NULL may be reordered with previous writes done by this process and the other CPU may see brw->writer == NULL (and think that the lock is unlocked) while it doesn't see previous writes done by the writer. I had this bug in my implementation too. > wake_up_all(&brw->read_waitq); > mutex_unlock(&brw->writer_mutex); > } Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-22 23:09 ` Mikulas Patocka @ 2012-10-23 15:12 ` Oleg Nesterov 0 siblings, 0 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-23 15:12 UTC (permalink / raw) To: Mikulas Patocka Cc: Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Thomas Gleixner Hi Mikulas, On 10/22, Mikulas Patocka wrote: > > On Fri, 19 Oct 2012, Oleg Nesterov wrote: > > > On 10/19, Mikulas Patocka wrote: > > > > > > synchronize_rcu() is way slower than msleep(1) - > > > > This depends, I guess. but this doesn't mmatter, > > > > > so I don't see a reason > > > why should it be complicated to avoid msleep(1). > > > > I don't think this really needs complications. Please look at this > > patch for example. Or initial (single writer) version below. It is > > not finished and lacks the barriers too, but I do not think it is ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ please note the comment above ;) > > more complex. > > Hi > > My implementation has a smaller structure (it doesn't have > wait_queue_head_t). Oh, I don't think sizeof() really matters in this case. > Your implementation is prone to starvation - if the writer has a high > priority and if it is doing back-to-back write unlocks/locks, it may > happen that the readers have no chance to run. Yes, it is write-biased, this was intent. writers should be rare. > The use of mutex instead of a wait queue in my implementation is unusual, > but I don't see anything wrong with it Neither me. Mikulas, apart from _rcu/_sched change, my only point was msleep() can (and imho should) be avoided. > > static inline long brw_read_ctr(struct brw_sem *brw) > > { > > long sum = 0; > > int cpu; > > > > for_each_possible_cpu(cpu) > > sum += per_cpu(*brw->read_ctr, cpu); > > Integer overflow on signed types is undefined - you should use unsigned > long - you can use -fwrapv option to gcc to make signed overflow defined, > but Linux doesn't use it. I don't think -fwrapv can make any difference in this case, but I agree that "unsigned long" makes more sense. > > void brw_up_write(struct brw_sem *brw) > > { > > brw->writer = NULL; > > synchronize_sched(); > > That synchronize_sched should be put before brw->writer = NULL. Yes, I know. I mentioned this at the start, this lacks the necessary barrier between this writer and the next reader. > I had this bug in my implementation too. Yes, exactly. And this is why I cc'ed you initially ;) Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-18 17:57 ` Oleg Nesterov 2012-10-18 19:28 ` Mikulas Patocka @ 2012-10-19 19:28 ` Paul E. McKenney 2012-10-22 23:36 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka 1 sibling, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-10-19 19:28 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, Oct 18, 2012 at 07:57:47PM +0200, Oleg Nesterov wrote: > On 10/18, Paul E. McKenney wrote: > > > > On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote: > > > > > > I thought that you meant that without mb() brw_start_write() can > > > race with brw_end_read() and hang forever. > > > > > > But probably you meant that we need the barriers to ensure that, > > > say, if the reader does > > > > > > brw_start_read(); > > > CONDITION = 1; > > > brw_end_read(); > > > > > > then the writer must see CONDITION != 0 after brw_start_write() ? > > > (or vice-versa) > > > > Yes, this is exactly my concern. > > Oh, thanks at lot Paul (as always). Glad it helped. ;-) > > > In this case we need the barrier, yes. Obviously brw_start_write() > > > can return right after this_cpu_dec() and before wake_up_all(). > > > > > > 2/2 doesn't need this guarantee but I agree, this doesn't look > > > sane in gerenal... > > > > Or name it something not containing "lock". And clearly document > > the behavior and how it is to be used. ;-) > > this would be insane, I guess ;) Well, I suppose you could call it a "phase" : brw_start_phase_1() and so on. > So. Ignoring the possible optimization you mentioned before, > brw_end_read() should do: > > smp_mb(); > this_cpu_dec(); > > wake_up_all(); > > And yes, we need the full mb(). wmb() is enough to ensure that the > writer will see the memory modifications done by the reader. But we > also need to ensure that any LOAD inside start_read/end_read can not > be moved outside of the critical section. > > But we should also ensure that "read" will see all modifications > which were done under start_write/end_write. This means that > brw_end_write() needs another synchronize_sched() before > atomic_dec_and_test(), or brw_start_read() needs mb() in the > fast-path. > > Correct? Good point, I missed the need for synchronize_sched() to avoid readers sleeping through the next write cycle due to racing with an exiting writer. But yes, this sounds correct. > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > something similar. Certainly it was not in my tree when I started > this patch... percpu_down_write() doesn't allow multiple writers, > but the main problem it uses msleep(1). It should not, I think. > > But. It seems that percpu_up_write() is equally wrong? Doesn't > it need synchronize_rcu() before "p->locked = false" ? > > (add Mikulas) Mikulas said something about doing an updated patch, so I figured I would look at his next version. Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) 2012-10-19 19:28 ` Paul E. McKenney @ 2012-10-22 23:36 ` Mikulas Patocka 2012-10-22 23:37 ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka 2012-10-30 18:48 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Oleg Nesterov 0 siblings, 2 replies; 103+ messages in thread From: Mikulas Patocka @ 2012-10-22 23:36 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Paul E. McKenney > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > > something similar. Certainly it was not in my tree when I started > > this patch... percpu_down_write() doesn't allow multiple writers, > > but the main problem it uses msleep(1). It should not, I think. > > > > But. It seems that percpu_up_write() is equally wrong? Doesn't > > it need synchronize_rcu() before "p->locked = false" ? > > > > (add Mikulas) > > Mikulas said something about doing an updated patch, so I figured I > would look at his next version. > > Thanx, Paul The best ideas proposed in this thread are: Using heavy/light barries by Lai Jiangshan. This fixes the missing barrier bug, removes the ugly test "#if defined(X86) ..." and makes the read path use no barrier instruction on all architectures. Instead of rcu_read_lock, we can use rcu_read_lock_sched (or preempt_disable) - the resulting code is smaller. The critical section is so small that there is no problem disabling preemption. I am sending these two patches. Linus, please apply them if there are no objections. Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-22 23:36 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka @ 2012-10-22 23:37 ` Mikulas Patocka 2012-10-22 23:39 ` [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched Mikulas Patocka ` (2 more replies) 2012-10-30 18:48 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Oleg Nesterov 1 sibling, 3 replies; 103+ messages in thread From: Mikulas Patocka @ 2012-10-22 23:37 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Paul E. McKenney This patch introduces new barrier pair light_mb() and heavy_mb() for percpu rw semaphores. This patch fixes a bug in percpu-rw-semaphores where a barrier was missing in percpu_up_write. This patch improves performance on the read path of percpu-rw-semaphores: on non-x86 cpus, there was a smp_mb() in percpu_up_read. This patch changes it to a compiler barrier and removes the "#if defined(X86) ..." condition. From: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- include/linux/percpu-rwsem.h | 20 +++++++------------- 1 file changed, 7 insertions(+), 13 deletions(-) Index: linux-3.6.3-fast/include/linux/percpu-rwsem.h =================================================================== --- linux-3.6.3-fast.orig/include/linux/percpu-rwsem.h 2012-10-22 23:37:57.000000000 +0200 +++ linux-3.6.3-fast/include/linux/percpu-rwsem.h 2012-10-23 01:21:23.000000000 +0200 @@ -12,6 +12,9 @@ struct percpu_rw_semaphore { struct mutex mtx; }; +#define light_mb() barrier() +#define heavy_mb() synchronize_sched() + static inline void percpu_down_read(struct percpu_rw_semaphore *p) { rcu_read_lock(); @@ -24,22 +27,12 @@ static inline void percpu_down_read(stru } this_cpu_inc(*p->counters); rcu_read_unlock(); + light_mb(); /* A, between read of p->locked and read of data, paired with D */ } static inline void percpu_up_read(struct percpu_rw_semaphore *p) { - /* - * On X86, write operation in this_cpu_dec serves as a memory unlock - * barrier (i.e. memory accesses may be moved before the write, but - * no memory accesses are moved past the write). - * On other architectures this may not be the case, so we need smp_mb() - * there. - */ -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) - barrier(); -#else - smp_mb(); -#endif + light_mb(); /* B, between read of the data and write to p->counter, paired with C */ this_cpu_dec(*p->counters); } @@ -61,11 +54,12 @@ static inline void percpu_down_write(str synchronize_rcu(); while (__percpu_count(p->counters)) msleep(1); - smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */ + heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ } static inline void percpu_up_write(struct percpu_rw_semaphore *p) { + heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ p->locked = false; mutex_unlock(&p->mtx); } ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched 2012-10-22 23:37 ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka @ 2012-10-22 23:39 ` Mikulas Patocka 2012-10-24 16:16 ` Paul E. McKenney 2012-10-23 16:59 ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Oleg Nesterov 2012-10-23 20:32 ` Peter Zijlstra 2 siblings, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-10-22 23:39 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Paul E. McKenney Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu. This is an optimization. The RCU-protected region is very small, so there will be no latency problems if we disable preempt in this region. So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates to preempt_disable / preempt_disable. It is smaller (and supposedly faster) than preemptible rcu_read_lock / rcu_read_unlock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- include/linux/percpu-rwsem.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) Index: linux-3.6.3-fast/include/linux/percpu-rwsem.h =================================================================== --- linux-3.6.3-fast.orig/include/linux/percpu-rwsem.h 2012-10-23 01:21:49.000000000 +0200 +++ linux-3.6.3-fast/include/linux/percpu-rwsem.h 2012-10-23 01:36:23.000000000 +0200 @@ -17,16 +17,16 @@ struct percpu_rw_semaphore { static inline void percpu_down_read(struct percpu_rw_semaphore *p) { - rcu_read_lock(); + rcu_read_lock_sched(); if (unlikely(p->locked)) { - rcu_read_unlock(); + rcu_read_unlock_sched(); mutex_lock(&p->mtx); this_cpu_inc(*p->counters); mutex_unlock(&p->mtx); return; } this_cpu_inc(*p->counters); - rcu_read_unlock(); + rcu_read_unlock_sched(); light_mb(); /* A, between read of p->locked and read of data, paired with D */ } @@ -51,7 +51,7 @@ static inline void percpu_down_write(str { mutex_lock(&p->mtx); p->locked = true; - synchronize_rcu(); + synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ while (__percpu_count(p->counters)) msleep(1); heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched 2012-10-22 23:39 ` [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched Mikulas Patocka @ 2012-10-24 16:16 ` Paul E. McKenney 2012-10-24 17:18 ` Oleg Nesterov 2012-10-25 14:54 ` Mikulas Patocka 0 siblings, 2 replies; 103+ messages in thread From: Paul E. McKenney @ 2012-10-24 16:16 UTC (permalink / raw) To: Mikulas Patocka Cc: Linus Torvalds, Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote: > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu. > > This is an optimization. The RCU-protected region is very small, so > there will be no latency problems if we disable preempt in this region. > > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates > to preempt_disable / preempt_disable. It is smaller (and supposedly > faster) than preemptible rcu_read_lock / rcu_read_unlock. > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> OK, as promised/threatened, I finally got a chance to take a closer look. The light_mb() and heavy_mb() definitions aren't doing much for me, the code would be cleared with them expanded inline. And while the approach of pairing barrier() with synchronize_sched() is interesting, it would be simpler to rely on RCU's properties. The key point is that if RCU cannot prove that a given RCU-sched read-side critical section is seen by all CPUs to have started after a given synchronize_sched(), then that synchronize_sched() must wait for that RCU-sched read-side critical section to complete. This means, as discussed earlier, that there will be a memory barrier somewhere following the end of that RCU-sched read-side critical section, and that this memory barrier executes before the completion of the synchronize_sched(). So I suggest something like the following (untested!) implementation: ------------------------------------------------------------------------ struct percpu_rw_semaphore { unsigned __percpu *counters; bool locked; struct mutex mtx; wait_queue_head_t wq; }; static inline void percpu_down_read(struct percpu_rw_semaphore *p) { rcu_read_lock_sched(); if (unlikely(p->locked)) { rcu_read_unlock_sched(); /* * There might (or might not) be a writer. Acquire &p->mtx, * it is always safe (if a bit slow) to do so. */ mutex_lock(&p->mtx); this_cpu_inc(*p->counters); mutex_unlock(&p->mtx); return; } /* No writer, proceed locklessly. */ this_cpu_inc(*p->counters); rcu_read_unlock_sched(); } static inline void percpu_up_read(struct percpu_rw_semaphore *p) { /* * Decrement our count, but protected by RCU-sched so that * the writer can force proper serialization. */ rcu_read_lock_sched(); this_cpu_dec(*p->counters); rcu_read_unlock_sched(); } static inline unsigned __percpu_count(unsigned __percpu *counters) { unsigned total = 0; int cpu; for_each_possible_cpu(cpu) total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); return total; } static inline void percpu_down_write(struct percpu_rw_semaphore *p) { mutex_lock(&p->mtx); /* Wait for a previous writer, if necessary. */ wait_event(p->wq, !ACCESS_ONCE(p->locked)); /* Force the readers to acquire the lock when manipulating counts. */ ACCESS_ONCE(p->locked) = true; /* Wait for all pre-existing readers' checks of ->locked to finish. */ synchronize_sched(); /* * At this point, all percpu_down_read() invocations will * acquire p->mtx. */ /* * Wait for all pre-existing readers to complete their * percpu_up_read() calls. Because ->locked is set and * because we hold ->mtx, there cannot be any new readers. * ->counters will therefore monotonically decrement to zero. */ while (__percpu_count(p->counters)) msleep(1); /* * Invoke synchronize_sched() in order to force the last * caller of percpu_up_read() to exit its RCU-sched read-side * critical section. On SMP systems, this also forces the CPU * that invoked that percpu_up_read() to execute a full memory * barrier between the time it exited the RCU-sched read-side * critical section and the time that synchronize_sched() returns, * so that the critical section begun by this invocation of * percpu_down_write() will happen after the critical section * ended by percpu_up_read(). */ synchronize_sched(); } static inline void percpu_up_write(struct percpu_rw_semaphore *p) { /* Allow others to proceed, but not yet locklessly. */ mutex_unlock(&p->mtx); /* * Ensure that all calls to percpu_down_read() that did not * start unambiguously after the above mutex_unlock() still * acquire the lock, forcing their critical sections to be * serialized with the one terminated by this call to * percpu_up_write(). */ synchronize_sched(); /* Now it is safe to allow readers to proceed locklessly. */ ACCESS_ONCE(p->locked) = false; /* * If there is another writer waiting, wake it up. Note that * p->mtx properly serializes its critical section with the * critical section terminated by this call to percpu_up_write(). */ wake_up(&p->wq); } static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) { p->counters = alloc_percpu(unsigned); if (unlikely(!p->counters)) return -ENOMEM; p->locked = false; mutex_init(&p->mtx); init_waitqueue_head(&p->wq); return 0; } static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) { free_percpu(p->counters); p->counters = NULL; /* catch use after free bugs */ } ------------------------------------------------------------------------ Of course, it would be nice to get rid of the extra synchronize_sched(). One way to do this is to use SRCU, which allows blocking operations in its read-side critical sections (though also increasing read-side overhead a bit, and also untested): ------------------------------------------------------------------------ struct percpu_rw_semaphore { bool locked; struct mutex mtx; /* Could also be rw_semaphore. */ struct srcu_struct s; wait_queue_head_t wq; }; static inline int percpu_down_read(struct percpu_rw_semaphore *p) { int idx; idx = srcu_read_lock(&p->s); if (unlikely(p->locked)) { srcu_read_unlock(&p->s, idx); /* * There might (or might not) be a writer. Acquire &p->mtx, * it is always safe (if a bit slow) to do so. */ mutex_lock(&p->mtx); return -1; /* srcu_read_lock() cannot return -1. */ } return idx; } static inline void percpu_up_read(struct percpu_rw_semaphore *p, int idx) { if (idx == -1) mutex_unlock(&p->mtx); else srcu_read_unlock(&p->s, idx); } static inline void percpu_down_write(struct percpu_rw_semaphore *p) { mutex_lock(&p->mtx); /* Wait for a previous writer, if necessary. */ wait_event(p->wq, !ACCESS_ONCE(p->locked)); /* Force new readers to acquire the lock when manipulating counts. */ ACCESS_ONCE(p->locked) = true; /* Wait for all pre-existing readers' checks of ->locked to finish. */ synchronize_srcu(&p->s); /* At this point, all lockless readers have completed. */ } static inline void percpu_up_write(struct percpu_rw_semaphore *p) { /* Allow others to proceed, but not yet locklessly. */ mutex_unlock(&p->mtx); /* * Ensure that all calls to percpu_down_read() that did not * start unambiguously after the above mutex_unlock() still * acquire the lock, forcing their critical sections to be * serialized with the one terminated by this call to * percpu_up_write(). */ synchronize_sched(); /* Now it is safe to allow readers to proceed locklessly. */ ACCESS_ONCE(p->locked) = false; /* * If there is another writer waiting, wake it up. Note that * p->mtx properly serializes its critical section with the * critical section terminated by this call to percpu_up_write(). */ wake_up(&p->wq); } static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) { p->locked = false; mutex_init(&p->mtx); if (unlikely(!init_srcu_struct(&p->s))); return -ENOMEM; init_waitqueue_head(&p->wq); return 0; } static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) { cleanup_srcu_struct(&p->s); } ------------------------------------------------------------------------ Of course, there was a question raised as to whether something already exists that does this job... And you guys did ask! Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched 2012-10-24 16:16 ` Paul E. McKenney @ 2012-10-24 17:18 ` Oleg Nesterov 2012-10-24 18:20 ` Paul E. McKenney 2012-10-25 14:54 ` Mikulas Patocka 1 sibling, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-24 17:18 UTC (permalink / raw) To: Paul E. McKenney Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/24, Paul E. McKenney wrote: > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > { > /* > * Decrement our count, but protected by RCU-sched so that > * the writer can force proper serialization. > */ > rcu_read_lock_sched(); > this_cpu_dec(*p->counters); > rcu_read_unlock_sched(); > } Yes, the explicit lock/unlock makes the new assumptions about synchronize_sched && barriers unnecessary. And iiuc this could even written as rcu_read_lock_sched(); rcu_read_unlock_sched(); this_cpu_dec(*p->counters); > Of course, it would be nice to get rid of the extra synchronize_sched(). > One way to do this is to use SRCU, which allows blocking operations in > its read-side critical sections (though also increasing read-side overhead > a bit, and also untested): > > ------------------------------------------------------------------------ > > struct percpu_rw_semaphore { > bool locked; > struct mutex mtx; /* Could also be rw_semaphore. */ > struct srcu_struct s; > wait_queue_head_t wq; > }; but in this case I don't understand > static inline void percpu_up_write(struct percpu_rw_semaphore *p) > { > /* Allow others to proceed, but not yet locklessly. */ > mutex_unlock(&p->mtx); > > /* > * Ensure that all calls to percpu_down_read() that did not > * start unambiguously after the above mutex_unlock() still > * acquire the lock, forcing their critical sections to be > * serialized with the one terminated by this call to > * percpu_up_write(). > */ > synchronize_sched(); how this synchronize_sched() can help... Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched 2012-10-24 17:18 ` Oleg Nesterov @ 2012-10-24 18:20 ` Paul E. McKenney 2012-10-24 18:43 ` Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-10-24 18:20 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote: > On 10/24, Paul E. McKenney wrote: > > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > > { > > /* > > * Decrement our count, but protected by RCU-sched so that > > * the writer can force proper serialization. > > */ > > rcu_read_lock_sched(); > > this_cpu_dec(*p->counters); > > rcu_read_unlock_sched(); > > } > > Yes, the explicit lock/unlock makes the new assumptions about > synchronize_sched && barriers unnecessary. And iiuc this could > even written as > > rcu_read_lock_sched(); > rcu_read_unlock_sched(); > > this_cpu_dec(*p->counters); But this would lose the memory barrier that is inserted by synchronize_sched() after the CPU's last RCU-sched read-side critical section. > > Of course, it would be nice to get rid of the extra synchronize_sched(). > > One way to do this is to use SRCU, which allows blocking operations in > > its read-side critical sections (though also increasing read-side overhead > > a bit, and also untested): > > > > ------------------------------------------------------------------------ > > > > struct percpu_rw_semaphore { > > bool locked; > > struct mutex mtx; /* Could also be rw_semaphore. */ > > struct srcu_struct s; > > wait_queue_head_t wq; > > }; > > but in this case I don't understand > > > static inline void percpu_up_write(struct percpu_rw_semaphore *p) > > { > > /* Allow others to proceed, but not yet locklessly. */ > > mutex_unlock(&p->mtx); > > > > /* > > * Ensure that all calls to percpu_down_read() that did not > > * start unambiguously after the above mutex_unlock() still > > * acquire the lock, forcing their critical sections to be > > * serialized with the one terminated by this call to > > * percpu_up_write(). > > */ > > synchronize_sched(); > > how this synchronize_sched() can help... Indeed it cannot! It should instead be synchronize_srcu(&p->s). I guess that I really meant it when I said it was untested. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched 2012-10-24 18:20 ` Paul E. McKenney @ 2012-10-24 18:43 ` Oleg Nesterov 2012-10-24 19:43 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-24 18:43 UTC (permalink / raw) To: Paul E. McKenney Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/24, Paul E. McKenney wrote: > > On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote: > > On 10/24, Paul E. McKenney wrote: > > > > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > > > { > > > /* > > > * Decrement our count, but protected by RCU-sched so that > > > * the writer can force proper serialization. > > > */ > > > rcu_read_lock_sched(); > > > this_cpu_dec(*p->counters); > > > rcu_read_unlock_sched(); > > > } > > > > Yes, the explicit lock/unlock makes the new assumptions about > > synchronize_sched && barriers unnecessary. And iiuc this could > > even written as > > > > rcu_read_lock_sched(); > > rcu_read_unlock_sched(); > > > > this_cpu_dec(*p->counters); > > But this would lose the memory barrier that is inserted by > synchronize_sched() after the CPU's last RCU-sched read-side critical > section. How? Afaics there is no need to synchronize with this_cpu_dec(), its result was already seen before the 2nd synchronize_sched() was called in percpu_down_write(). IOW, this memory barrier is only needed to synchronize with memory changes inside down_read/up_read. To clarify, of course I do not suggest to write is this way. I am just trying to check my understanding. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched 2012-10-24 18:43 ` Oleg Nesterov @ 2012-10-24 19:43 ` Paul E. McKenney 0 siblings, 0 replies; 103+ messages in thread From: Paul E. McKenney @ 2012-10-24 19:43 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 24, 2012 at 08:43:11PM +0200, Oleg Nesterov wrote: > On 10/24, Paul E. McKenney wrote: > > > > On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote: > > > On 10/24, Paul E. McKenney wrote: > > > > > > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > > > > { > > > > /* > > > > * Decrement our count, but protected by RCU-sched so that > > > > * the writer can force proper serialization. > > > > */ > > > > rcu_read_lock_sched(); > > > > this_cpu_dec(*p->counters); > > > > rcu_read_unlock_sched(); > > > > } > > > > > > Yes, the explicit lock/unlock makes the new assumptions about > > > synchronize_sched && barriers unnecessary. And iiuc this could > > > even written as > > > > > > rcu_read_lock_sched(); > > > rcu_read_unlock_sched(); > > > > > > this_cpu_dec(*p->counters); > > > > But this would lose the memory barrier that is inserted by > > synchronize_sched() after the CPU's last RCU-sched read-side critical > > section. > > How? Afaics there is no need to synchronize with this_cpu_dec(), its > result was already seen before the 2nd synchronize_sched() was called > in percpu_down_write(). > > IOW, this memory barrier is only needed to synchronize with memory > changes inside down_read/up_read. > > To clarify, of course I do not suggest to write is this way. I am just > trying to check my understanding. You are quite correct -- once the writer has seen the change in the counter, it knows that the reader's empty RCU-sched read must have at least started, and thus can rely on the following memory barrier to guarantee that it sees the reader's critical section. But that code really does look strange, I will grant you that! ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched 2012-10-24 16:16 ` Paul E. McKenney 2012-10-24 17:18 ` Oleg Nesterov @ 2012-10-25 14:54 ` Mikulas Patocka 2012-10-25 15:07 ` Paul E. McKenney 1 sibling, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-10-25 14:54 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, 24 Oct 2012, Paul E. McKenney wrote: > On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote: > > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched > > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu. > > > > This is an optimization. The RCU-protected region is very small, so > > there will be no latency problems if we disable preempt in this region. > > > > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates > > to preempt_disable / preempt_disable. It is smaller (and supposedly > > faster) than preemptible rcu_read_lock / rcu_read_unlock. > > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > OK, as promised/threatened, I finally got a chance to take a closer look. > > The light_mb() and heavy_mb() definitions aren't doing much for me, > the code would be cleared with them expanded inline. And while the > approach of pairing barrier() with synchronize_sched() is interesting, > it would be simpler to rely on RCU's properties. The key point is that > if RCU cannot prove that a given RCU-sched read-side critical section > is seen by all CPUs to have started after a given synchronize_sched(), > then that synchronize_sched() must wait for that RCU-sched read-side > critical section to complete. Also note that you can define both light_mb() and heavy_mb() to be smp_mb() and slow down the reader path a bit and speed up the writer path. On architectures with in-order memory access (and thus smp_mb() equals barrier()), it doesn't hurt the reader but helps the writer, for example: #ifdef ARCH_HAS_INORDER_MEMORY_ACCESS #define light_mb() smp_mb() #define heavy_mb() smp_mb() #else #define light_mb() barrier() #define heavy_mb() synchronize_sched() #endif Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched 2012-10-25 14:54 ` Mikulas Patocka @ 2012-10-25 15:07 ` Paul E. McKenney 2012-10-25 16:15 ` Mikulas Patocka 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-10-25 15:07 UTC (permalink / raw) To: Mikulas Patocka Cc: Linus Torvalds, Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, Oct 25, 2012 at 10:54:11AM -0400, Mikulas Patocka wrote: > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote: > > > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched > > > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu. > > > > > > This is an optimization. The RCU-protected region is very small, so > > > there will be no latency problems if we disable preempt in this region. > > > > > > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates > > > to preempt_disable / preempt_disable. It is smaller (and supposedly > > > faster) than preemptible rcu_read_lock / rcu_read_unlock. > > > > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > > > OK, as promised/threatened, I finally got a chance to take a closer look. > > > > The light_mb() and heavy_mb() definitions aren't doing much for me, > > the code would be cleared with them expanded inline. And while the > > approach of pairing barrier() with synchronize_sched() is interesting, > > it would be simpler to rely on RCU's properties. The key point is that > > if RCU cannot prove that a given RCU-sched read-side critical section > > is seen by all CPUs to have started after a given synchronize_sched(), > > then that synchronize_sched() must wait for that RCU-sched read-side > > critical section to complete. > > Also note that you can define both light_mb() and heavy_mb() to be > smp_mb() and slow down the reader path a bit and speed up the writer path. > > On architectures with in-order memory access (and thus smp_mb() equals > barrier()), it doesn't hurt the reader but helps the writer, for example: > #ifdef ARCH_HAS_INORDER_MEMORY_ACCESS > #define light_mb() smp_mb() > #define heavy_mb() smp_mb() > #else > #define light_mb() barrier() > #define heavy_mb() synchronize_sched() > #endif Except that there are no systems running Linux with in-order memory access. Even x86 and s390 require a barrier instruction for smp_mb() on SMP=y builds. Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched 2012-10-25 15:07 ` Paul E. McKenney @ 2012-10-25 16:15 ` Mikulas Patocka 0 siblings, 0 replies; 103+ messages in thread From: Mikulas Patocka @ 2012-10-25 16:15 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, 25 Oct 2012, Paul E. McKenney wrote: > On Thu, Oct 25, 2012 at 10:54:11AM -0400, Mikulas Patocka wrote: > > > > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > > > On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote: > > > > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched > > > > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu. > > > > > > > > This is an optimization. The RCU-protected region is very small, so > > > > there will be no latency problems if we disable preempt in this region. > > > > > > > > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates > > > > to preempt_disable / preempt_disable. It is smaller (and supposedly > > > > faster) than preemptible rcu_read_lock / rcu_read_unlock. > > > > > > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > > > > > OK, as promised/threatened, I finally got a chance to take a closer look. > > > > > > The light_mb() and heavy_mb() definitions aren't doing much for me, > > > the code would be cleared with them expanded inline. And while the > > > approach of pairing barrier() with synchronize_sched() is interesting, > > > it would be simpler to rely on RCU's properties. The key point is that > > > if RCU cannot prove that a given RCU-sched read-side critical section > > > is seen by all CPUs to have started after a given synchronize_sched(), > > > then that synchronize_sched() must wait for that RCU-sched read-side > > > critical section to complete. > > > > Also note that you can define both light_mb() and heavy_mb() to be > > smp_mb() and slow down the reader path a bit and speed up the writer path. > > > > On architectures with in-order memory access (and thus smp_mb() equals > > barrier()), it doesn't hurt the reader but helps the writer, for example: > > #ifdef ARCH_HAS_INORDER_MEMORY_ACCESS > > #define light_mb() smp_mb() > > #define heavy_mb() smp_mb() > > #else > > #define light_mb() barrier() > > #define heavy_mb() synchronize_sched() > > #endif > > Except that there are no systems running Linux with in-order memory > access. Even x86 and s390 require a barrier instruction for smp_mb() > on SMP=y builds. > > Thanx, Paul PA-RISC is in-order. But it is used very rarely. Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-22 23:37 ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka 2012-10-22 23:39 ` [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched Mikulas Patocka @ 2012-10-23 16:59 ` Oleg Nesterov 2012-10-23 18:05 ` Paul E. McKenney 2012-10-23 19:23 ` Oleg Nesterov 2012-10-23 20:32 ` Peter Zijlstra 2 siblings, 2 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-23 16:59 UTC (permalink / raw) To: Mikulas Patocka Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Paul E. McKenney Not really the comment, but the question... On 10/22, Mikulas Patocka wrote: > > static inline void percpu_down_read(struct percpu_rw_semaphore *p) > { > rcu_read_lock(); > @@ -24,22 +27,12 @@ static inline void percpu_down_read(stru > } > this_cpu_inc(*p->counters); > rcu_read_unlock(); > + light_mb(); /* A, between read of p->locked and read of data, paired with D */ > } rcu_read_unlock() (or even preempt_enable) should have compiler barrier semantics... But I agree, this adds more documentation for free. > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > { > - /* > - * On X86, write operation in this_cpu_dec serves as a memory unlock > - * barrier (i.e. memory accesses may be moved before the write, but > - * no memory accesses are moved past the write). > - * On other architectures this may not be the case, so we need smp_mb() > - * there. > - */ > -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) > - barrier(); > -#else > - smp_mb(); > -#endif > + light_mb(); /* B, between read of the data and write to p->counter, paired with C */ > this_cpu_dec(*p->counters); > } > > @@ -61,11 +54,12 @@ static inline void percpu_down_write(str > synchronize_rcu(); > while (__percpu_count(p->counters)) > msleep(1); > - smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */ > + heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ I _think_ this is correct. Just I am wondering if this is strongly correct in theory, I would really like to know what Paul thinks. Ignoring the current implementation, according to the documentation synchronize_sched() has all rights to return immediately if there is no active rcu_read_lock_sched() section. If this were possible, than percpu_up_read() lacks mb. So _perhaps_ it makes sense to document that synchronize_sched() also guarantees that all pending loads/stores on other CPUs should be completed upon return? Or I misunderstood the patch? Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 16:59 ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Oleg Nesterov @ 2012-10-23 18:05 ` Paul E. McKenney 2012-10-23 18:27 ` Oleg Nesterov 2012-10-23 18:41 ` Oleg Nesterov 2012-10-23 19:23 ` Oleg Nesterov 1 sibling, 2 replies; 103+ messages in thread From: Paul E. McKenney @ 2012-10-23 18:05 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Tue, Oct 23, 2012 at 06:59:12PM +0200, Oleg Nesterov wrote: > Not really the comment, but the question... > > On 10/22, Mikulas Patocka wrote: > > > > static inline void percpu_down_read(struct percpu_rw_semaphore *p) > > { > > rcu_read_lock(); > > @@ -24,22 +27,12 @@ static inline void percpu_down_read(stru > > } > > this_cpu_inc(*p->counters); > > rcu_read_unlock(); > > + light_mb(); /* A, between read of p->locked and read of data, paired with D */ > > } > > rcu_read_unlock() (or even preempt_enable) should have compiler barrier > semantics... But I agree, this adds more documentation for free. Although rcu_read_lock() does have compiler-barrier semantics if CONFIG_PREEMPT=y, it does not for CONFIG_PREEMPT=n. So the light_mb() (which appears to be barrier()) is needed in that case. > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > > { > > - /* > > - * On X86, write operation in this_cpu_dec serves as a memory unlock > > - * barrier (i.e. memory accesses may be moved before the write, but > > - * no memory accesses are moved past the write). > > - * On other architectures this may not be the case, so we need smp_mb() > > - * there. > > - */ > > -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) > > - barrier(); > > -#else > > - smp_mb(); > > -#endif > > + light_mb(); /* B, between read of the data and write to p->counter, paired with C */ > > this_cpu_dec(*p->counters); > > } > > > > @@ -61,11 +54,12 @@ static inline void percpu_down_write(str > > synchronize_rcu(); > > while (__percpu_count(p->counters)) > > msleep(1); > > - smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */ > > + heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ > > I _think_ this is correct. > > > Just I am wondering if this is strongly correct in theory, I would > really like to know what Paul thinks. I need to take a closer look. > Ignoring the current implementation, according to the documentation > synchronize_sched() has all rights to return immediately if there is > no active rcu_read_lock_sched() section. If this were possible, than > percpu_up_read() lacks mb. Even if there happen to be no RCU-sched read-side critical sections at the current instant, synchronize_sched() is required to make sure that everyone agrees that whatever code is executed by the caller after synchronize_sched() returns happens after any of the preceding RCU read-side critical sections. So, if we have this, with x==0 initially: Task 0 Task 1 rcu_read_lock_sched(); x = 1; rcu_read_unlock_sched(); synchronize_sched(); r1 = x; Then the value of r1 had better be one. Of course, the above code fragment is doing absolutely nothing to ensure that the synchronize_sched() really does start after Task 1's very strange RCU read-side critical section, but if things did happen in that order, synchronize_sched() would be required to make this guarantee. > So _perhaps_ it makes sense to document that synchronize_sched() also > guarantees that all pending loads/stores on other CPUs should be > completed upon return? Or I misunderstood the patch? Good point. The current documentation implies that it does make that guarantee, but it would be good for it to be explicit. Queued for 3.8 is the following addition: * Note that this guarantee implies a further memory-ordering guarantee. * On systems with more than one CPU, when synchronize_sched() returns, * each CPU is guaranteed to have executed a full memory barrier since * the end of its last RCU read-side critical section whose beginning * preceded the call to synchronize_sched(). Note that this guarantee * includes CPUs that are offline, idle, or executing in user mode, as * well as CPUs that are executing in the kernel. Furthermore, if CPU A * invoked synchronize_sched(), which returned to its caller on CPU B, * then both CPU A and CPU B are guaranteed to have executed a full memory * barrier during the execution of synchronize_sched(). The full comment block now reads: /** * synchronize_sched - wait until an rcu-sched grace period has elapsed. * * Control will return to the caller some time after a full rcu-sched * grace period has elapsed, in other words after all currently executing * rcu-sched read-side critical sections have completed. These read-side * critical sections are delimited by rcu_read_lock_sched() and * rcu_read_unlock_sched(), and may be nested. Note that preempt_disable(), * local_irq_disable(), and so on may be used in place of * rcu_read_lock_sched(). * * This means that all preempt_disable code sequences, including NMI and * hardware-interrupt handlers, in progress on entry will have completed * before this primitive returns. However, this does not guarantee that * softirq handlers will have completed, since in some kernels, these * handlers can run in process context, and can block. * * Note that this guarantee implies a further memory-ordering guarantee. * On systems with more than one CPU, when synchronize_sched() returns, * each CPU is guaranteed to have executed a full memory barrier since * the end of its last RCU read-side critical section whose beginning * preceded the call to synchronize_sched(). Note that this guarantee * includes CPUs that are offline, idle, or executing in user mode, as * well as CPUs that are executing in the kernel. Furthermore, if CPU A * invoked synchronize_sched(), which returned to its caller on CPU B, * then both CPU A and CPU B are guaranteed to have executed a full memory * barrier during the execution of synchronize_sched(). * * This primitive provides the guarantees made by the (now removed) * synchronize_kernel() API. In contrast, synchronize_rcu() only * guarantees that rcu_read_lock() sections will have completed. * In "classic RCU", these two guarantees happen to be one and * the same, but can differ in realtime RCU implementations. */ If this wording looks good to you, I will apply it to the other grace-period primitives as well. Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 18:05 ` Paul E. McKenney @ 2012-10-23 18:27 ` Oleg Nesterov 2012-10-23 18:41 ` Oleg Nesterov 1 sibling, 0 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-23 18:27 UTC (permalink / raw) To: Paul E. McKenney Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/23, Paul E. McKenney wrote: > > On Tue, Oct 23, 2012 at 06:59:12PM +0200, Oleg Nesterov wrote: > > Not really the comment, but the question... > > > > On 10/22, Mikulas Patocka wrote: > > > > > > static inline void percpu_down_read(struct percpu_rw_semaphore *p) > > > { > > > rcu_read_lock(); > > > @@ -24,22 +27,12 @@ static inline void percpu_down_read(stru > > > } > > > this_cpu_inc(*p->counters); > > > rcu_read_unlock(); > > > + light_mb(); /* A, between read of p->locked and read of data, paired with D */ > > > } > > > > rcu_read_unlock() (or even preempt_enable) should have compiler barrier > > semantics... But I agree, this adds more documentation for free. > > Although rcu_read_lock() does have compiler-barrier semantics if > CONFIG_PREEMPT=y, it does not for CONFIG_PREEMPT=n. So the > light_mb() (which appears to be barrier()) is needed in that case. Indeed, I missed this. > > Ignoring the current implementation, according to the documentation > > synchronize_sched() has all rights to return immediately if there is > > no active rcu_read_lock_sched() section. If this were possible, than > > percpu_up_read() lacks mb. > > Even if there happen to be no RCU-sched read-side critical sections > at the current instant, synchronize_sched() is required to make sure > that everyone agrees that whatever code is executed by the caller after > synchronize_sched() returns happens after any of the preceding RCU > read-side critical sections. > > So, if we have this, with x==0 initially: > > Task 0 Task 1 > > rcu_read_lock_sched(); > x = 1; > rcu_read_unlock_sched(); > synchronize_sched(); > r1 = x; > > Then the value of r1 had better be one. Yes, yes, this too. ("active rcu_read_lock_sched() section" above was confusing, I agree). > * Note that this guarantee implies a further memory-ordering guarantee. > * On systems with more than one CPU, when synchronize_sched() returns, > * each CPU is guaranteed to have executed a full memory barrier since > * the end of its last RCU read-side critical section whose beginning > * preceded the call to synchronize_sched(). Note that this guarantee > * includes CPUs that are offline, idle, or executing in user mode, as > * well as CPUs that are executing in the kernel. Furthermore, if CPU A > * invoked synchronize_sched(), which returned to its caller on CPU B, > * then both CPU A and CPU B are guaranteed to have executed a full memory > * barrier during the execution of synchronize_sched(). Great! Thanks Paul. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 18:05 ` Paul E. McKenney 2012-10-23 18:27 ` Oleg Nesterov @ 2012-10-23 18:41 ` Oleg Nesterov 2012-10-23 20:29 ` Paul E. McKenney 1 sibling, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-23 18:41 UTC (permalink / raw) To: Paul E. McKenney Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/23, Paul E. McKenney wrote: > > * Note that this guarantee implies a further memory-ordering guarantee. > * On systems with more than one CPU, when synchronize_sched() returns, > * each CPU is guaranteed to have executed a full memory barrier since > * the end of its last RCU read-side critical section ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Ah wait... I misread this comment. But this patch needs more? Or I misunderstood. There is no RCU unlock in percpu_up_read(). IOW. Suppose the code does percpu_down_read(); x = PROTECTED_BY_THIS_RW_SEM; percpu_up_read(); Withoit mb() the load above can be reordered with this_cpu_dec() in percpu_up_read(). However, we do not care if we can guarantee that the next percpu_down_write() can not return (iow, the next "write" section can not start) until this load is complete. And I _think_ that another synchronize_sched() in percpu_down_write() added by this patch should work. But, "since the end of its last RCU read-side critical section" does not look enough. Or I misundersood you/Mikulas/both ? Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 18:41 ` Oleg Nesterov @ 2012-10-23 20:29 ` Paul E. McKenney 2012-10-23 20:32 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-10-23 20:29 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > On 10/23, Paul E. McKenney wrote: > > > > * Note that this guarantee implies a further memory-ordering guarantee. > > * On systems with more than one CPU, when synchronize_sched() returns, > > * each CPU is guaranteed to have executed a full memory barrier since > > * the end of its last RCU read-side critical section > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Ah wait... I misread this comment. And I miswrote it. It should say "since the end of its last RCU-sched read-side critical section." So, for example, RCU-sched need not force a CPU that is idle, offline, or (eventually) executing in user mode to execute a memory barrier. Fixed this. > But this patch needs more? Or I misunderstood. There is no RCU unlock > in percpu_up_read(). > > IOW. Suppose the code does > > percpu_down_read(); > x = PROTECTED_BY_THIS_RW_SEM; > percpu_up_read(); > > Withoit mb() the load above can be reordered with this_cpu_dec() in > percpu_up_read(). > > However, we do not care if we can guarantee that the next > percpu_down_write() can not return (iow, the next "write" section can > not start) until this load is complete. > > And I _think_ that another synchronize_sched() in percpu_down_write() > added by this patch should work. > > But, "since the end of its last RCU read-side critical section" > does not look enough. > > Or I misundersood you/Mikulas/both ? I clearly need to look more carefully at Mikulas's code... Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 20:29 ` Paul E. McKenney @ 2012-10-23 20:32 ` Paul E. McKenney 2012-10-23 21:39 ` Mikulas Patocka 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-10-23 20:32 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > On 10/23, Paul E. McKenney wrote: > > > > > > * Note that this guarantee implies a further memory-ordering guarantee. > > > * On systems with more than one CPU, when synchronize_sched() returns, > > > * each CPU is guaranteed to have executed a full memory barrier since > > > * the end of its last RCU read-side critical section > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > Ah wait... I misread this comment. > > And I miswrote it. It should say "since the end of its last RCU-sched > read-side critical section." So, for example, RCU-sched need not force > a CPU that is idle, offline, or (eventually) executing in user mode to > execute a memory barrier. Fixed this. And I should hasten to add that for synchronize_sched(), disabling preemption (including disabling irqs, further including NMI handlers) acts as an RCU-sched read-side critical section. (This is in the comment header for synchronize_sched() up above my addition to it.) Thanx, Paul > > But this patch needs more? Or I misunderstood. There is no RCU unlock > > in percpu_up_read(). > > > > IOW. Suppose the code does > > > > percpu_down_read(); > > x = PROTECTED_BY_THIS_RW_SEM; > > percpu_up_read(); > > > > Withoit mb() the load above can be reordered with this_cpu_dec() in > > percpu_up_read(). > > > > However, we do not care if we can guarantee that the next > > percpu_down_write() can not return (iow, the next "write" section can > > not start) until this load is complete. > > > > And I _think_ that another synchronize_sched() in percpu_down_write() > > added by this patch should work. > > > > But, "since the end of its last RCU read-side critical section" > > does not look enough. > > > > Or I misundersood you/Mikulas/both ? > > I clearly need to look more carefully at Mikulas's code... > > Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 20:32 ` Paul E. McKenney @ 2012-10-23 21:39 ` Mikulas Patocka 2012-10-24 16:23 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-10-23 21:39 UTC (permalink / raw) To: Paul E. McKenney Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Tue, 23 Oct 2012, Paul E. McKenney wrote: > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > * Note that this guarantee implies a further memory-ordering guarantee. > > > > * On systems with more than one CPU, when synchronize_sched() returns, > > > > * each CPU is guaranteed to have executed a full memory barrier since > > > > * the end of its last RCU read-side critical section > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > Ah wait... I misread this comment. > > > > And I miswrote it. It should say "since the end of its last RCU-sched > > read-side critical section." So, for example, RCU-sched need not force > > a CPU that is idle, offline, or (eventually) executing in user mode to > > execute a memory barrier. Fixed this. Or you can write "each CPU that is executing a kernel code is guaranteed to have executed a full memory barrier". It would be consistent with the current implementation and it would make it possible to use barrier()-synchronize_sched() as biased memory barriers. --- In percpu-rwlocks, CPU 1 executes ...make some writes in the critical section... barrier(); this_cpu_dec(*p->counters); and the CPU 2 executes while (__percpu_count(p->counters)) msleep(1); synchronize_sched(); So, when CPU 2 finishes synchronize_sched(), we must make sure that all writes done by CPU 1 are visible to CPU 2. The current implementation fulfills this requirement, you can just add it to the specification so that whoever changes the implementation keeps it. Mikulas > And I should hasten to add that for synchronize_sched(), disabling > preemption (including disabling irqs, further including NMI handlers) > acts as an RCU-sched read-side critical section. (This is in the > comment header for synchronize_sched() up above my addition to it.) > > Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 21:39 ` Mikulas Patocka @ 2012-10-24 16:23 ` Paul E. McKenney 2012-10-24 20:22 ` Mikulas Patocka 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-10-24 16:23 UTC (permalink / raw) To: Mikulas Patocka Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote: > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote: > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > > > * Note that this guarantee implies a further memory-ordering guarantee. > > > > > * On systems with more than one CPU, when synchronize_sched() returns, > > > > > * each CPU is guaranteed to have executed a full memory barrier since > > > > > * the end of its last RCU read-side critical section > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > > > Ah wait... I misread this comment. > > > > > > And I miswrote it. It should say "since the end of its last RCU-sched > > > read-side critical section." So, for example, RCU-sched need not force > > > a CPU that is idle, offline, or (eventually) executing in user mode to > > > execute a memory barrier. Fixed this. > > Or you can write "each CPU that is executing a kernel code is guaranteed > to have executed a full memory barrier". Perhaps I could, but it isn't needed, nor is it particularly helpful. Please see suggestions in preceding email. > It would be consistent with the current implementation and it would make > it possible to use > > barrier()-synchronize_sched() as biased memory barriers. But it is simpler to rely on the properties of RCU. We really should avoid memory barriers where possible, as they are way too easy to get wrong. > --- > > In percpu-rwlocks, CPU 1 executes > > ...make some writes in the critical section... > barrier(); > this_cpu_dec(*p->counters); > > and the CPU 2 executes > > while (__percpu_count(p->counters)) > msleep(1); > synchronize_sched(); > > So, when CPU 2 finishes synchronize_sched(), we must make sure that > all writes done by CPU 1 are visible to CPU 2. > > The current implementation fulfills this requirement, you can just add it > to the specification so that whoever changes the implementation keeps it. I will consider doing that if and when someone shows me a situation where adding that requirement makes things simpler and/or faster. From what I can see, your example does not do so. Thanx, Paul > Mikulas > > > And I should hasten to add that for synchronize_sched(), disabling > > preemption (including disabling irqs, further including NMI handlers) > > acts as an RCU-sched read-side critical section. (This is in the > > comment header for synchronize_sched() up above my addition to it.) > > > > Thanx, Paul > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-24 16:23 ` Paul E. McKenney @ 2012-10-24 20:22 ` Mikulas Patocka 2012-10-24 20:36 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-10-24 20:22 UTC (permalink / raw) To: Paul E. McKenney Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, 24 Oct 2012, Paul E. McKenney wrote: > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote: > > > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote: > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > > > > > * Note that this guarantee implies a further memory-ordering guarantee. > > > > > > * On systems with more than one CPU, when synchronize_sched() returns, > > > > > > * each CPU is guaranteed to have executed a full memory barrier since > > > > > > * the end of its last RCU read-side critical section > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > > > > > Ah wait... I misread this comment. > > > > > > > > And I miswrote it. It should say "since the end of its last RCU-sched > > > > read-side critical section." So, for example, RCU-sched need not force > > > > a CPU that is idle, offline, or (eventually) executing in user mode to > > > > execute a memory barrier. Fixed this. > > > > Or you can write "each CPU that is executing a kernel code is guaranteed > > to have executed a full memory barrier". > > Perhaps I could, but it isn't needed, nor is it particularly helpful. > Please see suggestions in preceding email. It is helpful, because if you add this requirement (that already holds for the current implementation), you can drop rcu_read_lock_sched() and rcu_read_unlock_sched() from the following code that you submitted. static inline void percpu_up_read(struct percpu_rw_semaphore *p) { /* * Decrement our count, but protected by RCU-sched so that * the writer can force proper serialization. */ rcu_read_lock_sched(); this_cpu_dec(*p->counters); rcu_read_unlock_sched(); } > > The current implementation fulfills this requirement, you can just add it > > to the specification so that whoever changes the implementation keeps it. > > I will consider doing that if and when someone shows me a situation where > adding that requirement makes things simpler and/or faster. From what I > can see, your example does not do so. > > Thanx, Paul If you do, the above code can be simplified to: { barrier(); this_cpu_dec(*p->counters); } Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-24 20:22 ` Mikulas Patocka @ 2012-10-24 20:36 ` Paul E. McKenney 2012-10-24 20:44 ` Mikulas Patocka 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-10-24 20:36 UTC (permalink / raw) To: Mikulas Patocka Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote: > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote: > > > > > > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote: > > > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > > > > > > > * Note that this guarantee implies a further memory-ordering guarantee. > > > > > > > * On systems with more than one CPU, when synchronize_sched() returns, > > > > > > > * each CPU is guaranteed to have executed a full memory barrier since > > > > > > > * the end of its last RCU read-side critical section > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > > > > > > > Ah wait... I misread this comment. > > > > > > > > > > And I miswrote it. It should say "since the end of its last RCU-sched > > > > > read-side critical section." So, for example, RCU-sched need not force > > > > > a CPU that is idle, offline, or (eventually) executing in user mode to > > > > > execute a memory barrier. Fixed this. > > > > > > Or you can write "each CPU that is executing a kernel code is guaranteed > > > to have executed a full memory barrier". > > > > Perhaps I could, but it isn't needed, nor is it particularly helpful. > > Please see suggestions in preceding email. > > It is helpful, because if you add this requirement (that already holds for > the current implementation), you can drop rcu_read_lock_sched() and > rcu_read_unlock_sched() from the following code that you submitted. > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > { > /* > * Decrement our count, but protected by RCU-sched so that > * the writer can force proper serialization. > */ > rcu_read_lock_sched(); > this_cpu_dec(*p->counters); > rcu_read_unlock_sched(); > } > > > > The current implementation fulfills this requirement, you can just add it > > > to the specification so that whoever changes the implementation keeps it. > > > > I will consider doing that if and when someone shows me a situation where > > adding that requirement makes things simpler and/or faster. From what I > > can see, your example does not do so. > > > > Thanx, Paul > > If you do, the above code can be simplified to: > { > barrier(); > this_cpu_dec(*p->counters); > } The readers are lightweight enough that you are worried about the overhead of rcu_read_lock_sched() and rcu_read_unlock_sched()? Really??? Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-24 20:36 ` Paul E. McKenney @ 2012-10-24 20:44 ` Mikulas Patocka 2012-10-24 23:57 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-10-24 20:44 UTC (permalink / raw) To: Paul E. McKenney Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, 24 Oct 2012, Paul E. McKenney wrote: > On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote: > > > > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > > > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote: > > > > > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > > > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > > > > > > > > > * Note that this guarantee implies a further memory-ordering guarantee. > > > > > > > > * On systems with more than one CPU, when synchronize_sched() returns, > > > > > > > > * each CPU is guaranteed to have executed a full memory barrier since > > > > > > > > * the end of its last RCU read-side critical section > > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > > > > > > > > > Ah wait... I misread this comment. > > > > > > > > > > > > And I miswrote it. It should say "since the end of its last RCU-sched > > > > > > read-side critical section." So, for example, RCU-sched need not force > > > > > > a CPU that is idle, offline, or (eventually) executing in user mode to > > > > > > execute a memory barrier. Fixed this. > > > > > > > > Or you can write "each CPU that is executing a kernel code is guaranteed > > > > to have executed a full memory barrier". > > > > > > Perhaps I could, but it isn't needed, nor is it particularly helpful. > > > Please see suggestions in preceding email. > > > > It is helpful, because if you add this requirement (that already holds for > > the current implementation), you can drop rcu_read_lock_sched() and > > rcu_read_unlock_sched() from the following code that you submitted. > > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > > { > > /* > > * Decrement our count, but protected by RCU-sched so that > > * the writer can force proper serialization. > > */ > > rcu_read_lock_sched(); > > this_cpu_dec(*p->counters); > > rcu_read_unlock_sched(); > > } > > > > > > The current implementation fulfills this requirement, you can just add it > > > > to the specification so that whoever changes the implementation keeps it. > > > > > > I will consider doing that if and when someone shows me a situation where > > > adding that requirement makes things simpler and/or faster. From what I > > > can see, your example does not do so. > > > > > > Thanx, Paul > > > > If you do, the above code can be simplified to: > > { > > barrier(); > > this_cpu_dec(*p->counters); > > } > > The readers are lightweight enough that you are worried about the overhead > of rcu_read_lock_sched() and rcu_read_unlock_sched()? Really??? > > Thanx, Paul There was no lock in previous kernels, so we should make it as simple as possible. Disabling and reenabling preemption is probably not a big deal, but if don't have to do it, why do it? Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-24 20:44 ` Mikulas Patocka @ 2012-10-24 23:57 ` Paul E. McKenney 2012-10-25 12:39 ` Paul E. McKenney 2012-10-25 13:48 ` Mikulas Patocka 0 siblings, 2 replies; 103+ messages in thread From: Paul E. McKenney @ 2012-10-24 23:57 UTC (permalink / raw) To: Mikulas Patocka Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 24, 2012 at 04:44:14PM -0400, Mikulas Patocka wrote: > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote: > > > > > > > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > > > > > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote: > > > > > > > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > > > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > > > > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > > > > > > > > > > > * Note that this guarantee implies a further memory-ordering guarantee. > > > > > > > > > * On systems with more than one CPU, when synchronize_sched() returns, > > > > > > > > > * each CPU is guaranteed to have executed a full memory barrier since > > > > > > > > > * the end of its last RCU read-side critical section > > > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > > > > > > > > > > > Ah wait... I misread this comment. > > > > > > > > > > > > > > And I miswrote it. It should say "since the end of its last RCU-sched > > > > > > > read-side critical section." So, for example, RCU-sched need not force > > > > > > > a CPU that is idle, offline, or (eventually) executing in user mode to > > > > > > > execute a memory barrier. Fixed this. > > > > > > > > > > Or you can write "each CPU that is executing a kernel code is guaranteed > > > > > to have executed a full memory barrier". > > > > > > > > Perhaps I could, but it isn't needed, nor is it particularly helpful. > > > > Please see suggestions in preceding email. > > > > > > It is helpful, because if you add this requirement (that already holds for > > > the current implementation), you can drop rcu_read_lock_sched() and > > > rcu_read_unlock_sched() from the following code that you submitted. > > > > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > > > { > > > /* > > > * Decrement our count, but protected by RCU-sched so that > > > * the writer can force proper serialization. > > > */ > > > rcu_read_lock_sched(); > > > this_cpu_dec(*p->counters); > > > rcu_read_unlock_sched(); > > > } > > > > > > > > The current implementation fulfills this requirement, you can just add it > > > > > to the specification so that whoever changes the implementation keeps it. > > > > > > > > I will consider doing that if and when someone shows me a situation where > > > > adding that requirement makes things simpler and/or faster. From what I > > > > can see, your example does not do so. > > > > > > > > Thanx, Paul > > > > > > If you do, the above code can be simplified to: > > > { > > > barrier(); > > > this_cpu_dec(*p->counters); > > > } > > > > The readers are lightweight enough that you are worried about the overhead > > of rcu_read_lock_sched() and rcu_read_unlock_sched()? Really??? > > > > Thanx, Paul > > There was no lock in previous kernels, so we should make it as simple as > possible. Disabling and reenabling preemption is probably not a big deal, > but if don't have to do it, why do it? Because I don't consider the barrier()-paired-with-synchronize_sched() to be a simplification. While we are discussing this, I have been assuming that readers must block from time to time. Is this the case? Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-24 23:57 ` Paul E. McKenney @ 2012-10-25 12:39 ` Paul E. McKenney 2012-10-25 13:48 ` Mikulas Patocka 1 sibling, 0 replies; 103+ messages in thread From: Paul E. McKenney @ 2012-10-25 12:39 UTC (permalink / raw) To: Mikulas Patocka Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 24, 2012 at 04:57:35PM -0700, Paul E. McKenney wrote: > On Wed, Oct 24, 2012 at 04:44:14PM -0400, Mikulas Patocka wrote: > > > > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > > > On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > > > > > > > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > > > > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote: > > > > > > > > > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > > > > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > > > > > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > > > > > > > > > > > > > * Note that this guarantee implies a further memory-ordering guarantee. > > > > > > > > > > * On systems with more than one CPU, when synchronize_sched() returns, > > > > > > > > > > * each CPU is guaranteed to have executed a full memory barrier since > > > > > > > > > > * the end of its last RCU read-side critical section > > > > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > > > > > > > > > > > > > Ah wait... I misread this comment. > > > > > > > > > > > > > > > > And I miswrote it. It should say "since the end of its last RCU-sched > > > > > > > > read-side critical section." So, for example, RCU-sched need not force > > > > > > > > a CPU that is idle, offline, or (eventually) executing in user mode to > > > > > > > > execute a memory barrier. Fixed this. > > > > > > > > > > > > Or you can write "each CPU that is executing a kernel code is guaranteed > > > > > > to have executed a full memory barrier". > > > > > > > > > > Perhaps I could, but it isn't needed, nor is it particularly helpful. > > > > > Please see suggestions in preceding email. > > > > > > > > It is helpful, because if you add this requirement (that already holds for > > > > the current implementation), you can drop rcu_read_lock_sched() and > > > > rcu_read_unlock_sched() from the following code that you submitted. > > > > > > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > > > > { > > > > /* > > > > * Decrement our count, but protected by RCU-sched so that > > > > * the writer can force proper serialization. > > > > */ > > > > rcu_read_lock_sched(); > > > > this_cpu_dec(*p->counters); > > > > rcu_read_unlock_sched(); > > > > } > > > > > > > > > > The current implementation fulfills this requirement, you can just add it > > > > > > to the specification so that whoever changes the implementation keeps it. > > > > > > > > > > I will consider doing that if and when someone shows me a situation where > > > > > adding that requirement makes things simpler and/or faster. From what I > > > > > can see, your example does not do so. > > > > > > > > > > Thanx, Paul > > > > > > > > If you do, the above code can be simplified to: > > > > { > > > > barrier(); > > > > this_cpu_dec(*p->counters); > > > > } > > > > > > The readers are lightweight enough that you are worried about the overhead > > > of rcu_read_lock_sched() and rcu_read_unlock_sched()? Really??? > > > > > > Thanx, Paul > > > > There was no lock in previous kernels, so we should make it as simple as > > possible. Disabling and reenabling preemption is probably not a big deal, > > but if don't have to do it, why do it? > > Because I don't consider the barrier()-paired-with-synchronize_sched() > to be a simplification. In addition, please note that synchronize_srcu() used to guarantee a memory barrier on all online non-idle CPUs, but that it no longer does after Lai Jiangshan's recent rewrite. Given this change, I would have to be quite foolish not to be very reluctant to make this guarantee for other flavors of RCU, unless there was an extremely good reason for it. Dropping a preempt_disable()/preempt_enable() pair doesn't even come close to being a good enough reason. > While we are discussing this, I have been assuming that readers must block > from time to time. Is this the case? And this really is a serious question. If the answer is "no", that readers never block, a much simpler and faster approach is possible. Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-24 23:57 ` Paul E. McKenney 2012-10-25 12:39 ` Paul E. McKenney @ 2012-10-25 13:48 ` Mikulas Patocka 1 sibling, 0 replies; 103+ messages in thread From: Mikulas Patocka @ 2012-10-25 13:48 UTC (permalink / raw) To: Paul E. McKenney Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, 24 Oct 2012, Paul E. McKenney wrote: > On Wed, Oct 24, 2012 at 04:44:14PM -0400, Mikulas Patocka wrote: > > > > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > > > On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote: > > > > > > > > > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > > > > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote: > > > > > > > > > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote: > > > > > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote: > > > > > > > > > On 10/23, Paul E. McKenney wrote: > > > > > > > > > > > > > > > > > > > > * Note that this guarantee implies a further memory-ordering guarantee. > > > > > > > > > > * On systems with more than one CPU, when synchronize_sched() returns, > > > > > > > > > > * each CPU is guaranteed to have executed a full memory barrier since > > > > > > > > > > * the end of its last RCU read-side critical section > > > > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > > > > > > > > > > > > > Ah wait... I misread this comment. > > > > > > > > > > > > > > > > And I miswrote it. It should say "since the end of its last RCU-sched > > > > > > > > read-side critical section." So, for example, RCU-sched need not force > > > > > > > > a CPU that is idle, offline, or (eventually) executing in user mode to > > > > > > > > execute a memory barrier. Fixed this. > > > > > > > > > > > > Or you can write "each CPU that is executing a kernel code is guaranteed > > > > > > to have executed a full memory barrier". > > > > > > > > > > Perhaps I could, but it isn't needed, nor is it particularly helpful. > > > > > Please see suggestions in preceding email. > > > > > > > > It is helpful, because if you add this requirement (that already holds for > > > > the current implementation), you can drop rcu_read_lock_sched() and > > > > rcu_read_unlock_sched() from the following code that you submitted. > > > > > > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > > > > { > > > > /* > > > > * Decrement our count, but protected by RCU-sched so that > > > > * the writer can force proper serialization. > > > > */ > > > > rcu_read_lock_sched(); > > > > this_cpu_dec(*p->counters); > > > > rcu_read_unlock_sched(); > > > > } > > > > > > > > > > The current implementation fulfills this requirement, you can just add it > > > > > > to the specification so that whoever changes the implementation keeps it. > > > > > > > > > > I will consider doing that if and when someone shows me a situation where > > > > > adding that requirement makes things simpler and/or faster. From what I > > > > > can see, your example does not do so. > > > > > > > > > > Thanx, Paul > > > > > > > > If you do, the above code can be simplified to: > > > > { > > > > barrier(); > > > > this_cpu_dec(*p->counters); > > > > } > > > > > > The readers are lightweight enough that you are worried about the overhead > > > of rcu_read_lock_sched() and rcu_read_unlock_sched()? Really??? > > > > > > Thanx, Paul > > > > There was no lock in previous kernels, so we should make it as simple as > > possible. Disabling and reenabling preemption is probably not a big deal, > > but if don't have to do it, why do it? > > Because I don't consider the barrier()-paired-with-synchronize_sched() > to be a simplification. It is a simplification because it makes the code smaller (just one instruction on x86): this_cpu_dec(*p->counters): 0: 64 ff 08 decl %fs:(%eax) preempt_disable() this_cpu_dec(*p->counters) preempt_enable(): 10: 89 e2 mov %esp,%edx 12: 81 e2 00 e0 ff ff and $0xffffe000,%edx 18: ff 42 14 incl 0x14(%edx) 1b: 64 ff 08 decl %fs:(%eax) 1e: ff 4a 14 decl 0x14(%edx) 21: 8b 42 08 mov 0x8(%edx),%eax 24: a8 08 test $0x8,%al 26: 75 03 jne 2b this_cpu_dec is uninterruptible, so there is no reason why would you want to put preempt_disable and preempt_enable around it. Disabling preemption may actually improve performance on RISC machines. RISC architectures have load/store instructions and they do not have a single instruction to load a value from memory, decrement it and write it back. So, on RISC architectures, this_cpu_dec is implemented as: disable interrupts, load the value, decrement the value, write the value, restore interrupt state. Disabling interrupts slows down because it triggers microcode. For example, on PA-RISC preempt_disable(); (*this_cpu_ptr(counters))--; preempt_enable(); is faster than this_cpu_dec(*counters); But on X86, this_cpu_inc(*counters) is faster. > While we are discussing this, I have been assuming that readers must block > from time to time. Is this the case? > > Thanx, Paul Processes that hold the read lock block in the i/o path - they may block to wait until the data is read from disk. Or for other reasons. Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 16:59 ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Oleg Nesterov 2012-10-23 18:05 ` Paul E. McKenney @ 2012-10-23 19:23 ` Oleg Nesterov 2012-10-23 20:45 ` Peter Zijlstra ` (2 more replies) 1 sibling, 3 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-23 19:23 UTC (permalink / raw) To: Mikulas Patocka, Peter Zijlstra, Paul E. McKenney Cc: Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/23, Oleg Nesterov wrote: > > Not really the comment, but the question... Damn. And another question. Mikulas, I am sorry for this (almost) off-topic noise. Let me repeat just in case that I am not arguing with your patches. So write_lock/write_unlock needs to call synchronize_sched() 3 times. I am wondering if it makes any sense to try to make it a bit heavier but faster. What if we change the reader to use local_irq_disable/enable around this_cpu_inc/dec (instead of rcu read lock)? I have to admit, I have no idea how much cli/sti is slower compared to preempt_disable/enable. Then the writer can use static void mb_ipi(void *arg) { smp_mb(); /* unneeded ? */ } static void force_mb_on_each_cpu(void) { smp_mb(); smp_call_function(mb_ipi, NULL, 1); } to a) synchronise with irq_disable and b) to insert the necessary mb's. Of course smp_call_function() means more work for each CPU, but write_lock() should be rare... This can also wakeup the idle CPU's, but probably we can do on_each_cpu_cond(cond_func => !idle_cpu). Perhaps cond_func() can also return false if rcu_user_enter() was called... Actually I was thinking about this from the very beginning, but I do not feel this looks like a good idea. Still I'd like to ask what do you think. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 19:23 ` Oleg Nesterov @ 2012-10-23 20:45 ` Peter Zijlstra 2012-10-23 20:57 ` Peter Zijlstra 2012-10-23 21:26 ` Mikulas Patocka 2 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2012-10-23 20:45 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Tue, 2012-10-23 at 21:23 +0200, Oleg Nesterov wrote: > I have to admit, I have > no idea how much cli/sti is slower compared to preempt_disable/enable. > A lot.. esp on stupid hardware (insert pentium-4 reference), but I think its more expensive for pretty much all hardware, preempt_disable() is only a non-atomic cpu local increment and a compiler barrier, enable the same and a single conditional. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 19:23 ` Oleg Nesterov 2012-10-23 20:45 ` Peter Zijlstra @ 2012-10-23 20:57 ` Peter Zijlstra 2012-10-24 15:11 ` Oleg Nesterov 2012-10-23 21:26 ` Mikulas Patocka 2 siblings, 1 reply; 103+ messages in thread From: Peter Zijlstra @ 2012-10-23 20:57 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Tue, 2012-10-23 at 21:23 +0200, Oleg Nesterov wrote: > > static void mb_ipi(void *arg) > { > smp_mb(); /* unneeded ? */ > } > > static void force_mb_on_each_cpu(void) > { > smp_mb(); > smp_call_function(mb_ipi, NULL, 1); > } You know we're spending an awful lot of time and effort to get rid of such things, right? RT and HPC people absolutely hate these random IPI things. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 20:57 ` Peter Zijlstra @ 2012-10-24 15:11 ` Oleg Nesterov 0 siblings, 0 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-24 15:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Mikulas Patocka, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/23, Peter Zijlstra wrote: > > On Tue, 2012-10-23 at 21:23 +0200, Oleg Nesterov wrote: > > > > static void mb_ipi(void *arg) > > { > > smp_mb(); /* unneeded ? */ > > } > > > > static void force_mb_on_each_cpu(void) > > { > > smp_mb(); > > smp_call_function(mb_ipi, NULL, 1); > > } > > You know we're spending an awful lot of time and effort to get rid of > such things, right? RT and HPC people absolutely hate these random IPI > things. No I do not know ;) but I am not suprized. And, > > I have to admit, I have > > no idea how much cli/sti is slower compared to preempt_disable/enable. > > > A lot.. esp on stupid hardware (insert pentium-4 reference), but I think > its more expensive for pretty much all hardware, Thanks Peter, this alone answers my question. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-23 19:23 ` Oleg Nesterov 2012-10-23 20:45 ` Peter Zijlstra 2012-10-23 20:57 ` Peter Zijlstra @ 2012-10-23 21:26 ` Mikulas Patocka 2 siblings, 0 replies; 103+ messages in thread From: Mikulas Patocka @ 2012-10-23 21:26 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Tue, 23 Oct 2012, Oleg Nesterov wrote: > On 10/23, Oleg Nesterov wrote: > > > > Not really the comment, but the question... > > Damn. And another question. > > Mikulas, I am sorry for this (almost) off-topic noise. Let me repeat > just in case that I am not arguing with your patches. > > > > > So write_lock/write_unlock needs to call synchronize_sched() 3 times. > I am wondering if it makes any sense to try to make it a bit heavier > but faster. > > What if we change the reader to use local_irq_disable/enable around > this_cpu_inc/dec (instead of rcu read lock)? I have to admit, I have > no idea how much cli/sti is slower compared to preempt_disable/enable. > > Then the writer can use > > static void mb_ipi(void *arg) > { > smp_mb(); /* unneeded ? */ > } > > static void force_mb_on_each_cpu(void) > { > smp_mb(); > smp_call_function(mb_ipi, NULL, 1); > } > > to a) synchronise with irq_disable and b) to insert the necessary mb's. > > Of course smp_call_function() means more work for each CPU, but > write_lock() should be rare... > > This can also wakeup the idle CPU's, but probably we can do > on_each_cpu_cond(cond_func => !idle_cpu). Perhaps cond_func() can > also return false if rcu_user_enter() was called... > > Actually I was thinking about this from the very beginning, but I do > not feel this looks like a good idea. Still I'd like to ask what do > you think. > > Oleg. I think - if we can avoid local_irq_disable/enable, just avoid it (and use barrier-vs-synchronize_kernel). Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers 2012-10-22 23:37 ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka 2012-10-22 23:39 ` [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched Mikulas Patocka 2012-10-23 16:59 ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Oleg Nesterov @ 2012-10-23 20:32 ` Peter Zijlstra 2 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2012-10-23 20:32 UTC (permalink / raw) To: Mikulas Patocka Cc: Linus Torvalds, Oleg Nesterov, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Paul E. McKenney On Mon, 2012-10-22 at 19:37 -0400, Mikulas Patocka wrote: > - /* > - * On X86, write operation in this_cpu_dec serves as a memory unlock > - * barrier (i.e. memory accesses may be moved before the write, but > - * no memory accesses are moved past the write). > - * On other architectures this may not be the case, so we need smp_mb() > - * there. > - */ > -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) > - barrier(); > -#else > - smp_mb(); > -#endif > + light_mb(); /* B, between read of the data and write to p->counter, paired with C */ If we're going to invent new primitives for this, shouldn't we call this: smp_unlock_barrier() or something? That at least has well defined semantics. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) 2012-10-22 23:36 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka 2012-10-22 23:37 ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka @ 2012-10-30 18:48 ` Oleg Nesterov 2012-10-31 19:41 ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov 1 sibling, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-30 18:48 UTC (permalink / raw) To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra Cc: Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/22, Mikulas Patocka wrote: > > > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > > > something similar. Certainly it was not in my tree when I started > > > this patch... percpu_down_write() doesn't allow multiple writers, > > > but the main problem it uses msleep(1). It should not, I think. But, since we already have percpu_rw_semaphore, I do not think I can add another similar thing, However percpu_rw_semaphore is sub-optimal, not sure uprobes can use it to block dup_mmap(). Perhaps we can improve it? > > > But. It seems that percpu_up_write() is equally wrong? Doesn't > > > it need synchronize_rcu() before "p->locked = false" ? > > > > > > (add Mikulas) > > > > Mikulas said something about doing an updated patch, so I figured I > > would look at his next version. > > > > Thanx, Paul > > The best ideas proposed in this thread are: > > > Using heavy/light barries by Lai Jiangshan. So. down_write/up_right does msleep() and it needs to call synchronize_sched() 3 times. This looks too much. It is not that I am worried about the writers, the problem is that the new readers are blocked completely while the writer sleeps in msleep/synchronize_sched. Paul, Mikulas, et al. Could you please look at the new implementation below? Completely untested/uncompiled, just for discussion. Compared to the current implementation, down_read() is still possible while the writer sleeps in synchronize_sched(), but the reader uses rw_semaphore/atomic_inc when it detects the waiting writer. Can this work? Do you think this is better than we have now? Note: probably we can optimize percpu_down/up_write more, we can "factor out" synchronize_sched(), multiple writers can do this in parallel before they take ->writer_mutex to exclude each other. But this won't affect the readers, and this can be done later. Oleg. ------------------------------------------------------------------------------ struct percpu_rw_semaphore { long __percpu *fast_read_ctr; struct mutex writer_mutex; struct rw_semaphore rw_sem; atomit_t slow_read_ctr; }; static bool update_fast_ctr(struct percpu_rw_semaphore *brw, long val) { bool success = false; preempt_disable(); if (likely(!mutex_is_locked(&brw->writer_mutex))) { __this_cpu_add(*brw->fast_read_ctr, val); success = true; } preempt_enable(); return success; } static long clear_fast_read_ctr(struct percpu_rw_semaphore *brw) { long sum = 0; int cpu; for_each_possible_cpu(cpu) { sum += per_cpu(*brw->fast_read_ctr, cpu); per_cpu(*brw->fast_read_ctr, cpu) = 0; } return sum; } void percpu_down_read(struct percpu_rw_semaphore *brw) { if (likely(update_fast_ctr(+1))) return; down_read(&brw->rw_sem); atomic_inc(&brw->slow_read_ctr); up_read(&brw->rw_sem); } void percpu_up_read(struct percpu_rw_semaphore *brw) { if (likely(update_fast_ctr(-1))) return; if (atomic_dec_and_test(&brw->slow_read_ctr)) wake_up_all(&brw->write_waitq); } void percpu_down_write(struct percpu_rw_semaphore *brw) { mutex_lock(&brw->writer_mutex); /* ensure mutex_is_locked() is visible to the readers */ synchronize_sched(); /* block the new readers */ down_write(&brw->rw_sem); atomic_add(&brw->slow_read_ctr, clear_fast_read_ctr(brw)); wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); } void percpu_up_write(struct percpu_rw_semaphore *brw) { up_write(&brw->rw_sem); /* insert the barrier before the next fast-path in down_read */ synchronize_sched(); mutex_unlock(&brw->writer_mutex); } ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-10-30 18:48 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Oleg Nesterov @ 2012-10-31 19:41 ` Oleg Nesterov 2012-10-31 19:41 ` [PATCH 1/1] " Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-31 19:41 UTC (permalink / raw) To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Linus Torvalds Cc: Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/30, Oleg Nesterov wrote: > > So. down_write/up_right does msleep() and it needs to call > synchronize_sched() 3 times. > > This looks too much. It is not that I am worried about the writers, > the problem is that the new readers are blocked completely while the > writer sleeps in msleep/synchronize_sched. > > Paul, Mikulas, et al. Could you please look at the new implementation > below? Completely untested/uncompiled, just for discussion. I tried to test it, seems to work... But. I guess the only valid test is: pass the review from Paul/Peter. Todo: - add the lockdep annotations - we can speedup the down_write-right-aftet-up_write case What do you all think? Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-10-31 19:41 ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov @ 2012-10-31 19:41 ` Oleg Nesterov 2012-11-01 15:10 ` Linus Torvalds 2012-11-01 15:43 ` [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Paul E. McKenney 0 siblings, 2 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-10-31 19:41 UTC (permalink / raw) To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Linus Torvalds Cc: Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the "write" section was not actually started or if it was already finished. With this patch down_read/up_read does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the "slow" counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- include/linux/percpu-rwsem.h | 83 +++++--------------------------- lib/Makefile | 2 +- lib/percpu-rwsem.c | 106 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 120 insertions(+), 71 deletions(-) create mode 100644 lib/percpu-rwsem.c diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 250a4ac..7f738ca 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -2,82 +2,25 @@ #define _LINUX_PERCPU_RWSEM_H #include <linux/mutex.h> +#include <linux/rwsem.h> #include <linux/percpu.h> -#include <linux/rcupdate.h> -#include <linux/delay.h> +#include <linux/wait.h> struct percpu_rw_semaphore { - unsigned __percpu *counters; - bool locked; - struct mutex mtx; + int __percpu *fast_read_ctr; + struct mutex writer_mutex; + struct rw_semaphore rw_sem; + atomic_t slow_read_ctr; + wait_queue_head_t write_waitq; }; -#define light_mb() barrier() -#define heavy_mb() synchronize_sched() +extern void percpu_down_read(struct percpu_rw_semaphore *); +extern void percpu_up_read(struct percpu_rw_semaphore *); -static inline void percpu_down_read(struct percpu_rw_semaphore *p) -{ - rcu_read_lock_sched(); - if (unlikely(p->locked)) { - rcu_read_unlock_sched(); - mutex_lock(&p->mtx); - this_cpu_inc(*p->counters); - mutex_unlock(&p->mtx); - return; - } - this_cpu_inc(*p->counters); - rcu_read_unlock_sched(); - light_mb(); /* A, between read of p->locked and read of data, paired with D */ -} +extern void percpu_down_write(struct percpu_rw_semaphore *); +extern void percpu_up_write(struct percpu_rw_semaphore *); -static inline void percpu_up_read(struct percpu_rw_semaphore *p) -{ - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ - this_cpu_dec(*p->counters); -} - -static inline unsigned __percpu_count(unsigned __percpu *counters) -{ - unsigned total = 0; - int cpu; - - for_each_possible_cpu(cpu) - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); - - return total; -} - -static inline void percpu_down_write(struct percpu_rw_semaphore *p) -{ - mutex_lock(&p->mtx); - p->locked = true; - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ - while (__percpu_count(p->counters)) - msleep(1); - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ -} - -static inline void percpu_up_write(struct percpu_rw_semaphore *p) -{ - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ - p->locked = false; - mutex_unlock(&p->mtx); -} - -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) -{ - p->counters = alloc_percpu(unsigned); - if (unlikely(!p->counters)) - return -ENOMEM; - p->locked = false; - mutex_init(&p->mtx); - return 0; -} - -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) -{ - free_percpu(p->counters); - p->counters = NULL; /* catch use after free bugs */ -} +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); #endif diff --git a/lib/Makefile b/lib/Makefile index 821a162..4dad4a7 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ idr.o int_sqrt.o extable.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ - is_single_threaded.o plist.o decompress.o + is_single_threaded.o plist.o decompress.o percpu-rwsem.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c new file mode 100644 index 0000000..40a415d --- /dev/null +++ b/lib/percpu-rwsem.c @@ -0,0 +1,106 @@ +#include <linux/percpu-rwsem.h> +#include <linux/rcupdate.h> +#include <linux/sched.h> + +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) +{ + brw->fast_read_ctr = alloc_percpu(int); + if (unlikely(!brw->fast_read_ctr)) + return -ENOMEM; + + mutex_init(&brw->writer_mutex); + init_rwsem(&brw->rw_sem); + atomic_set(&brw->slow_read_ctr, 0); + init_waitqueue_head(&brw->write_waitq); + return 0; +} + +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) +{ + free_percpu(brw->fast_read_ctr); + brw->fast_read_ctr = NULL; /* catch use after free bugs */ +} + +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val) +{ + bool success = false; + + preempt_disable(); + if (likely(!mutex_is_locked(&brw->writer_mutex))) { + __this_cpu_add(*brw->fast_read_ctr, val); + success = true; + } + preempt_enable(); + + return success; +} + +void percpu_down_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, +1))) + return; + + down_read(&brw->rw_sem); + atomic_inc(&brw->slow_read_ctr); + up_read(&brw->rw_sem); +} + +void percpu_up_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, -1))) + return; + + /* false-positive is possible but harmless */ + if (atomic_dec_and_test(&brw->slow_read_ctr)) + wake_up_all(&brw->write_waitq); +} + +static int clear_fast_read_ctr(struct percpu_rw_semaphore *brw) +{ + int cpu, sum = 0; + + for_each_possible_cpu(cpu) { + sum += per_cpu(*brw->fast_read_ctr, cpu); + per_cpu(*brw->fast_read_ctr, cpu) = 0; + } + + return sum; +} + +void percpu_down_write(struct percpu_rw_semaphore *brw) +{ + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ + mutex_lock(&brw->writer_mutex); + + /* + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read + * so that update_fast_ctr() can't succeed. + * + * 2. Ensures we see the result of every previous this_cpu_add() in + * update_fast_ctr(). + * + * 3. Ensures that if any reader has exited its critical section via + * fast-path, it executes a full memory barrier before we return. + */ + synchronize_sched(); + + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ + atomic_add(clear_fast_read_ctr(brw), &brw->slow_read_ctr); + + /* block the new readers completely */ + down_write(&brw->rw_sem); + + /* wait for all readers to complete their percpu_up_read() */ + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); +} + +void percpu_up_write(struct percpu_rw_semaphore *brw) +{ + /* allow the new readers, but only the slow-path */ + up_write(&brw->rw_sem); + + /* insert the barrier before the next fast-path in down_read */ + synchronize_sched(); + + mutex_unlock(&brw->writer_mutex); +} -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-10-31 19:41 ` [PATCH 1/1] " Oleg Nesterov @ 2012-11-01 15:10 ` Linus Torvalds 2012-11-01 15:34 ` Oleg Nesterov 2012-11-02 18:06 ` [PATCH v2 0/1] " Oleg Nesterov 2012-11-01 15:43 ` [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Paul E. McKenney 1 sibling, 2 replies; 103+ messages in thread From: Linus Torvalds @ 2012-11-01 15:10 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote: > Currently the writer does msleep() plus synchronize_sched() 3 times > to acquire/release the semaphore, and during this time the readers > are blocked completely. Even if the "write" section was not actually > started or if it was already finished. > > With this patch down_read/up_read does synchronize_sched() twice and > down_read/up_read are still possible during this time, just they use > the slow path. The changelog is wrong (it's the write path, not read path, that does the synchronize_sched). > struct percpu_rw_semaphore { > - unsigned __percpu *counters; > - bool locked; > - struct mutex mtx; > + int __percpu *fast_read_ctr; This change is wrong. You must not make the 'fast_read_ctr' thing be an int. Or at least you need to be a hell of a lot more careful about it. Why? Because the readers update the counters while possibly moving around cpu's, the increment and decrement of the counters may be on different CPU's. But that means that when you add all the counters together, things can overflow (only the final sum is meaningful). And THAT in turn means that you should not use a signed count, for the simple reason that signed integers don't have well-behaved overflow behavior in C. Now, I doubt you'll find an architecture or C compiler where this will actually ever make a difference, but the fact remains that you shouldn't use signed integers for counters like this. You should use unsigned, and you should rely on the well-defined modulo-2**n semantics. I'd also like to see a comment somewhere in the source code about the whole algorithm and the rules. Other than that, I guess it looks ok. Linus ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-01 15:10 ` Linus Torvalds @ 2012-11-01 15:34 ` Oleg Nesterov 2012-11-02 18:06 ` [PATCH v2 0/1] " Oleg Nesterov 1 sibling, 0 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-11-01 15:34 UTC (permalink / raw) To: Linus Torvalds Cc: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel Thanks! I'll send v2 tomorrow. On 11/01, Linus Torvalds wrote: > On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote: > > Currently the writer does msleep() plus synchronize_sched() 3 times > > to acquire/release the semaphore, and during this time the readers > > are blocked completely. Even if the "write" section was not actually > > started or if it was already finished. > > > > With this patch down_read/up_read does synchronize_sched() twice and > > down_read/up_read are still possible during this time, just they use > > the slow path. > > The changelog is wrong (it's the write path, not read path, that does > the synchronize_sched). > > > struct percpu_rw_semaphore { > > - unsigned __percpu *counters; > > - bool locked; > > - struct mutex mtx; > > + int __percpu *fast_read_ctr; > > This change is wrong. > > You must not make the 'fast_read_ctr' thing be an int. Or at least you > need to be a hell of a lot more careful about it. > > Why? > > Because the readers update the counters while possibly moving around > cpu's, the increment and decrement of the counters may be on different > CPU's. But that means that when you add all the counters together, > things can overflow (only the final sum is meaningful). And THAT in > turn means that you should not use a signed count, for the simple > reason that signed integers don't have well-behaved overflow behavior > in C. > > Now, I doubt you'll find an architecture or C compiler where this will > actually ever make a difference, but the fact remains that you > shouldn't use signed integers for counters like this. You should use > unsigned, and you should rely on the well-defined modulo-2**n > semantics. > > I'd also like to see a comment somewhere in the source code about the > whole algorithm and the rules. > > Other than that, I guess it looks ok. > > Linus ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v2 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-01 15:10 ` Linus Torvalds 2012-11-01 15:34 ` Oleg Nesterov @ 2012-11-02 18:06 ` Oleg Nesterov 2012-11-02 18:06 ` [PATCH v2 1/1] " Oleg Nesterov 2012-11-08 13:48 ` [PATCH RESEND v2 0/1] " Oleg Nesterov 1 sibling, 2 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-11-02 18:06 UTC (permalink / raw) To: Linus Torvalds, Paul E. McKenney Cc: Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/01, Linus Torvalds wrote: > > On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > With this patch down_read/up_read does synchronize_sched() twice and > > down_read/up_read are still possible during this time, just they use > > the slow path. > > The changelog is wrong (it's the write path, not read path, that does > the synchronize_sched). Fixed, thanks, > > struct percpu_rw_semaphore { > > - unsigned __percpu *counters; > > - bool locked; > > - struct mutex mtx; > > + int __percpu *fast_read_ctr; > > This change is wrong. > > You must not make the 'fast_read_ctr' thing be an int. Or at least you > need to be a hell of a lot more careful about it. > > Why? > > Because the readers update the counters while possibly moving around > cpu's, the increment and decrement of the counters may be on different > CPU's. But that means that when you add all the counters together, > things can overflow (only the final sum is meaningful). And THAT in > turn means that you should not use a signed count, for the simple > reason that signed integers don't have well-behaved overflow behavior > in C. Yes, Mikulas has pointed this too, but I forgot to make it "unsigned". > Now, I doubt you'll find an architecture or C compiler where this will > actually ever make a difference, Yes. And we have other examples, say, mnt->mnt_pcp->mnt_writers is "int". > but the fact remains that you > shouldn't use signed integers for counters like this. You should use > unsigned, and you should rely on the well-defined modulo-2**n > semantics. OK, I changed this. But please note that clear_fast_ctr() still returns "int", even if it uses "unsigned" to calculate the result. Because we use this value for atomic_add(int i) and it can be actually negative, so to me it looks a bit better this way even if the generated code is the same. > I'd also like to see a comment somewhere in the source code about the > whole algorithm and the rules. Added the comments before down_read and down_write. > Other than that, I guess it looks ok. Great, please see v2. I am not sure I addressed Paul's concerns, so I guess I need his ack. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-02 18:06 ` [PATCH v2 0/1] " Oleg Nesterov @ 2012-11-02 18:06 ` Oleg Nesterov 2012-11-07 17:04 ` [PATCH v3 " Mikulas Patocka 2012-11-08 1:16 ` [PATCH v2 " Paul E. McKenney 2012-11-08 13:48 ` [PATCH RESEND v2 0/1] " Oleg Nesterov 1 sibling, 2 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-11-02 18:06 UTC (permalink / raw) To: Linus Torvalds, Paul E. McKenney Cc: Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the "write" section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the "slow" counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- include/linux/percpu-rwsem.h | 83 +++++------------------------ lib/Makefile | 2 +- lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 137 insertions(+), 71 deletions(-) create mode 100644 lib/percpu-rwsem.c diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 250a4ac..592f0d6 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -2,82 +2,25 @@ #define _LINUX_PERCPU_RWSEM_H #include <linux/mutex.h> +#include <linux/rwsem.h> #include <linux/percpu.h> -#include <linux/rcupdate.h> -#include <linux/delay.h> +#include <linux/wait.h> struct percpu_rw_semaphore { - unsigned __percpu *counters; - bool locked; - struct mutex mtx; + unsigned int __percpu *fast_read_ctr; + struct mutex writer_mutex; + struct rw_semaphore rw_sem; + atomic_t slow_read_ctr; + wait_queue_head_t write_waitq; }; -#define light_mb() barrier() -#define heavy_mb() synchronize_sched() +extern void percpu_down_read(struct percpu_rw_semaphore *); +extern void percpu_up_read(struct percpu_rw_semaphore *); -static inline void percpu_down_read(struct percpu_rw_semaphore *p) -{ - rcu_read_lock_sched(); - if (unlikely(p->locked)) { - rcu_read_unlock_sched(); - mutex_lock(&p->mtx); - this_cpu_inc(*p->counters); - mutex_unlock(&p->mtx); - return; - } - this_cpu_inc(*p->counters); - rcu_read_unlock_sched(); - light_mb(); /* A, between read of p->locked and read of data, paired with D */ -} +extern void percpu_down_write(struct percpu_rw_semaphore *); +extern void percpu_up_write(struct percpu_rw_semaphore *); -static inline void percpu_up_read(struct percpu_rw_semaphore *p) -{ - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ - this_cpu_dec(*p->counters); -} - -static inline unsigned __percpu_count(unsigned __percpu *counters) -{ - unsigned total = 0; - int cpu; - - for_each_possible_cpu(cpu) - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); - - return total; -} - -static inline void percpu_down_write(struct percpu_rw_semaphore *p) -{ - mutex_lock(&p->mtx); - p->locked = true; - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ - while (__percpu_count(p->counters)) - msleep(1); - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ -} - -static inline void percpu_up_write(struct percpu_rw_semaphore *p) -{ - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ - p->locked = false; - mutex_unlock(&p->mtx); -} - -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) -{ - p->counters = alloc_percpu(unsigned); - if (unlikely(!p->counters)) - return -ENOMEM; - p->locked = false; - mutex_init(&p->mtx); - return 0; -} - -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) -{ - free_percpu(p->counters); - p->counters = NULL; /* catch use after free bugs */ -} +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); #endif diff --git a/lib/Makefile b/lib/Makefile index 821a162..4dad4a7 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ idr.o int_sqrt.o extable.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ - is_single_threaded.o plist.o decompress.o + is_single_threaded.o plist.o decompress.o percpu-rwsem.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c new file mode 100644 index 0000000..0e3bc0f --- /dev/null +++ b/lib/percpu-rwsem.c @@ -0,0 +1,123 @@ +#include <linux/percpu-rwsem.h> +#include <linux/rcupdate.h> +#include <linux/sched.h> + +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) +{ + brw->fast_read_ctr = alloc_percpu(int); + if (unlikely(!brw->fast_read_ctr)) + return -ENOMEM; + + mutex_init(&brw->writer_mutex); + init_rwsem(&brw->rw_sem); + atomic_set(&brw->slow_read_ctr, 0); + init_waitqueue_head(&brw->write_waitq); + return 0; +} + +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) +{ + free_percpu(brw->fast_read_ctr); + brw->fast_read_ctr = NULL; /* catch use after free bugs */ +} + +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) +{ + bool success = false; + + preempt_disable(); + if (likely(!mutex_is_locked(&brw->writer_mutex))) { + __this_cpu_add(*brw->fast_read_ctr, val); + success = true; + } + preempt_enable(); + + return success; +} + +/* + * Like the normal down_read() this is not recursive, the writer can + * come after the first percpu_down_read() and create the deadlock. + */ +void percpu_down_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, +1))) + return; + + down_read(&brw->rw_sem); + atomic_inc(&brw->slow_read_ctr); + up_read(&brw->rw_sem); +} + +void percpu_up_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, -1))) + return; + + /* false-positive is possible but harmless */ + if (atomic_dec_and_test(&brw->slow_read_ctr)) + wake_up_all(&brw->write_waitq); +} + +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) +{ + unsigned int sum = 0; + int cpu; + + for_each_possible_cpu(cpu) { + sum += per_cpu(*brw->fast_read_ctr, cpu); + per_cpu(*brw->fast_read_ctr, cpu) = 0; + } + + return sum; +} + +/* + * A writer takes ->writer_mutex to exclude other writers and to force the + * readers to switch to the slow mode, note the mutex_is_locked() check in + * update_fast_ctr(). + * + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow + * counter it represents the number of active readers. + * + * Finally the writer takes ->rw_sem for writing and blocks the new readers, + * then waits until the slow counter becomes zero. + */ +void percpu_down_write(struct percpu_rw_semaphore *brw) +{ + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ + mutex_lock(&brw->writer_mutex); + + /* + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read + * so that update_fast_ctr() can't succeed. + * + * 2. Ensures we see the result of every previous this_cpu_add() in + * update_fast_ctr(). + * + * 3. Ensures that if any reader has exited its critical section via + * fast-path, it executes a full memory barrier before we return. + */ + synchronize_sched(); + + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ + atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); + + /* block the new readers completely */ + down_write(&brw->rw_sem); + + /* wait for all readers to complete their percpu_up_read() */ + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); +} + +void percpu_up_write(struct percpu_rw_semaphore *brw) +{ + /* allow the new readers, but only the slow-path */ + up_write(&brw->rw_sem); + + /* insert the barrier before the next fast-path in down_read */ + synchronize_sched(); + + mutex_unlock(&brw->writer_mutex); +} -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-02 18:06 ` [PATCH v2 1/1] " Oleg Nesterov @ 2012-11-07 17:04 ` Mikulas Patocka 2012-11-07 17:47 ` Oleg Nesterov 2012-11-08 1:23 ` Paul E. McKenney 2012-11-08 1:16 ` [PATCH v2 " Paul E. McKenney 1 sibling, 2 replies; 103+ messages in thread From: Mikulas Patocka @ 2012-11-07 17:04 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel It looks sensible. Here I'm sending an improvement of the patch - I changed it so that there are not two-level nested functions for the fast path and so that both percpu_down_read and percpu_up_read use the same piece of code (to reduce cache footprint). --- Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the "write" section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the "slow" counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- include/linux/percpu-rwsem.h | 80 ++++++------------------------- lib/Makefile | 2 lib/percpu-rwsem.c | 110 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 127 insertions(+), 65 deletions(-) create mode 100644 lib/percpu-rwsem.c Index: linux-3.6.6-fast/include/linux/percpu-rwsem.h =================================================================== --- linux-3.6.6-fast.orig/include/linux/percpu-rwsem.h 2012-11-05 16:21:29.000000000 +0100 +++ linux-3.6.6-fast/include/linux/percpu-rwsem.h 2012-11-07 16:44:04.000000000 +0100 @@ -2,82 +2,34 @@ #define _LINUX_PERCPU_RWSEM_H #include <linux/mutex.h> +#include <linux/rwsem.h> #include <linux/percpu.h> -#include <linux/rcupdate.h> -#include <linux/delay.h> +#include <linux/wait.h> struct percpu_rw_semaphore { - unsigned __percpu *counters; - bool locked; - struct mutex mtx; + unsigned int __percpu *fast_read_ctr; + struct mutex writer_mutex; + struct rw_semaphore rw_sem; + atomic_t slow_read_ctr; + wait_queue_head_t write_waitq; }; -#define light_mb() barrier() -#define heavy_mb() synchronize_sched() +extern void __percpu_down_up_read(struct percpu_rw_semaphore *, int); -static inline void percpu_down_read(struct percpu_rw_semaphore *p) -{ - rcu_read_lock_sched(); - if (unlikely(p->locked)) { - rcu_read_unlock_sched(); - mutex_lock(&p->mtx); - this_cpu_inc(*p->counters); - mutex_unlock(&p->mtx); - return; - } - this_cpu_inc(*p->counters); - rcu_read_unlock_sched(); - light_mb(); /* A, between read of p->locked and read of data, paired with D */ -} - -static inline void percpu_up_read(struct percpu_rw_semaphore *p) -{ - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ - this_cpu_dec(*p->counters); -} - -static inline unsigned __percpu_count(unsigned __percpu *counters) -{ - unsigned total = 0; - int cpu; - - for_each_possible_cpu(cpu) - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); +extern void percpu_down_write(struct percpu_rw_semaphore *); +extern void percpu_up_write(struct percpu_rw_semaphore *); - return total; -} - -static inline void percpu_down_write(struct percpu_rw_semaphore *p) -{ - mutex_lock(&p->mtx); - p->locked = true; - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ - while (__percpu_count(p->counters)) - msleep(1); - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ -} - -static inline void percpu_up_write(struct percpu_rw_semaphore *p) -{ - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ - p->locked = false; - mutex_unlock(&p->mtx); -} +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) +static inline void percpu_down_read(struct percpu_rw_semaphore *s) { - p->counters = alloc_percpu(unsigned); - if (unlikely(!p->counters)) - return -ENOMEM; - p->locked = false; - mutex_init(&p->mtx); - return 0; + __percpu_down_up_read(s, 1); } -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) +static inline void percpu_up_read(struct percpu_rw_semaphore *s) { - free_percpu(p->counters); - p->counters = NULL; /* catch use after free bugs */ + __percpu_down_up_read(s, -1); } #endif Index: linux-3.6.6-fast/lib/Makefile =================================================================== --- linux-3.6.6-fast.orig/lib/Makefile 2012-10-02 00:47:57.000000000 +0200 +++ linux-3.6.6-fast/lib/Makefile 2012-11-07 03:10:44.000000000 +0100 @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmd idr.o int_sqrt.o extable.o prio_tree.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ - is_single_threaded.o plist.o decompress.o + is_single_threaded.o plist.o decompress.o percpu-rwsem.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o Index: linux-3.6.6-fast/lib/percpu-rwsem.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-3.6.6-fast/lib/percpu-rwsem.c 2012-11-07 16:43:27.000000000 +0100 @@ -0,0 +1,110 @@ +#include <linux/percpu-rwsem.h> +#include <linux/rcupdate.h> +#include <linux/sched.h> +#include <linux/module.h> + +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) +{ + brw->fast_read_ctr = alloc_percpu(int); + if (unlikely(!brw->fast_read_ctr)) + return -ENOMEM; + + mutex_init(&brw->writer_mutex); + init_rwsem(&brw->rw_sem); + atomic_set(&brw->slow_read_ctr, 0); + init_waitqueue_head(&brw->write_waitq); + return 0; +} +EXPORT_SYMBOL(percpu_init_rwsem); + +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) +{ + free_percpu(brw->fast_read_ctr); + brw->fast_read_ctr = NULL; /* catch use after free bugs */ +} +EXPORT_SYMBOL(percpu_free_rwsem); + +void __percpu_down_up_read(struct percpu_rw_semaphore *brw, int val) +{ + preempt_disable(); + if (likely(!mutex_is_locked(&brw->writer_mutex))) { + __this_cpu_add(*brw->fast_read_ctr, val); + preempt_enable(); + return; + } + preempt_enable(); + if (val >= 0) { + down_read(&brw->rw_sem); + atomic_inc(&brw->slow_read_ctr); + up_read(&brw->rw_sem); + } else { + if (atomic_dec_and_test(&brw->slow_read_ctr)) + wake_up_all(&brw->write_waitq); + } +} +EXPORT_SYMBOL(__percpu_down_up_read); + +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) +{ + unsigned int sum = 0; + int cpu; + + for_each_possible_cpu(cpu) { + sum += per_cpu(*brw->fast_read_ctr, cpu); + per_cpu(*brw->fast_read_ctr, cpu) = 0; + } + + return sum; +} + +/* + * A writer takes ->writer_mutex to exclude other writers and to force the + * readers to switch to the slow mode, note the mutex_is_locked() check in + * update_fast_ctr(). + * + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow + * counter it represents the number of active readers. + * + * Finally the writer takes ->rw_sem for writing and blocks the new readers, + * then waits until the slow counter becomes zero. + */ +void percpu_down_write(struct percpu_rw_semaphore *brw) +{ + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ + mutex_lock(&brw->writer_mutex); + + /* + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read + * so that update_fast_ctr() can't succeed. + * + * 2. Ensures we see the result of every previous this_cpu_add() in + * update_fast_ctr(). + * + * 3. Ensures that if any reader has exited its critical section via + * fast-path, it executes a full memory barrier before we return. + */ + synchronize_sched(); + + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ + atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); + + /* block the new readers completely */ + down_write(&brw->rw_sem); + + /* wait for all readers to complete their percpu_up_read() */ + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); +} +EXPORT_SYMBOL(percpu_down_write); + +void percpu_up_write(struct percpu_rw_semaphore *brw) +{ + /* allow the new readers, but only the slow-path */ + up_write(&brw->rw_sem); + + /* insert the barrier before the next fast-path in down_read */ + synchronize_sched(); + + mutex_unlock(&brw->writer_mutex); +} +EXPORT_SYMBOL(percpu_up_write); ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-07 17:04 ` [PATCH v3 " Mikulas Patocka @ 2012-11-07 17:47 ` Oleg Nesterov 2012-11-07 19:17 ` Mikulas Patocka 2012-11-08 1:23 ` Paul E. McKenney 1 sibling, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-11-07 17:47 UTC (permalink / raw) To: Mikulas Patocka Cc: Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/07, Mikulas Patocka wrote: > > It looks sensible. > > Here I'm sending an improvement of the patch - I changed it so that there > are not two-level nested functions for the fast path and so that both > percpu_down_read and percpu_up_read use the same piece of code (to reduce > cache footprint). IOW, the only change is that you eliminate "static update_fast_ctr()" and fold it into down/up_read which takes the additional argument. Honestly, personally I do not think this is better, but I won't argue. I agree with everything but I guess we need the ack from Paul. As for EXPORT_SYMBOL, I do not mind of course. But currently the only user is block_dev.c. > Currently the writer does msleep() plus synchronize_sched() 3 times > to acquire/release the semaphore, and during this time the readers > are blocked completely. Even if the "write" section was not actually > started or if it was already finished. > > With this patch down_write/up_write does synchronize_sched() twice > and down_read/up_read are still possible during this time, just they > use the slow path. > > percpu_down_write() first forces the readers to use rw_semaphore and > increment the "slow" counter to take the lock for reading, then it > takes that rw_semaphore for writing and blocks the readers. > > Also. With this patch the code relies on the documented behaviour of > synchronize_sched(), it doesn't try to pair synchronize_sched() with > barrier. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > --- > include/linux/percpu-rwsem.h | 80 ++++++------------------------- > lib/Makefile | 2 > lib/percpu-rwsem.c | 110 +++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 127 insertions(+), 65 deletions(-) > create mode 100644 lib/percpu-rwsem.c > > Index: linux-3.6.6-fast/include/linux/percpu-rwsem.h > =================================================================== > --- linux-3.6.6-fast.orig/include/linux/percpu-rwsem.h 2012-11-05 16:21:29.000000000 +0100 > +++ linux-3.6.6-fast/include/linux/percpu-rwsem.h 2012-11-07 16:44:04.000000000 +0100 > @@ -2,82 +2,34 @@ > #define _LINUX_PERCPU_RWSEM_H > > #include <linux/mutex.h> > +#include <linux/rwsem.h> > #include <linux/percpu.h> > -#include <linux/rcupdate.h> > -#include <linux/delay.h> > +#include <linux/wait.h> > > struct percpu_rw_semaphore { > - unsigned __percpu *counters; > - bool locked; > - struct mutex mtx; > + unsigned int __percpu *fast_read_ctr; > + struct mutex writer_mutex; > + struct rw_semaphore rw_sem; > + atomic_t slow_read_ctr; > + wait_queue_head_t write_waitq; > }; > > -#define light_mb() barrier() > -#define heavy_mb() synchronize_sched() > +extern void __percpu_down_up_read(struct percpu_rw_semaphore *, int); > > -static inline void percpu_down_read(struct percpu_rw_semaphore *p) > -{ > - rcu_read_lock_sched(); > - if (unlikely(p->locked)) { > - rcu_read_unlock_sched(); > - mutex_lock(&p->mtx); > - this_cpu_inc(*p->counters); > - mutex_unlock(&p->mtx); > - return; > - } > - this_cpu_inc(*p->counters); > - rcu_read_unlock_sched(); > - light_mb(); /* A, between read of p->locked and read of data, paired with D */ > -} > - > -static inline void percpu_up_read(struct percpu_rw_semaphore *p) > -{ > - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ > - this_cpu_dec(*p->counters); > -} > - > -static inline unsigned __percpu_count(unsigned __percpu *counters) > -{ > - unsigned total = 0; > - int cpu; > - > - for_each_possible_cpu(cpu) > - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); > +extern void percpu_down_write(struct percpu_rw_semaphore *); > +extern void percpu_up_write(struct percpu_rw_semaphore *); > > - return total; > -} > - > -static inline void percpu_down_write(struct percpu_rw_semaphore *p) > -{ > - mutex_lock(&p->mtx); > - p->locked = true; > - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ > - while (__percpu_count(p->counters)) > - msleep(1); > - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ > -} > - > -static inline void percpu_up_write(struct percpu_rw_semaphore *p) > -{ > - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ > - p->locked = false; > - mutex_unlock(&p->mtx); > -} > +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); > +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); > > -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) > +static inline void percpu_down_read(struct percpu_rw_semaphore *s) > { > - p->counters = alloc_percpu(unsigned); > - if (unlikely(!p->counters)) > - return -ENOMEM; > - p->locked = false; > - mutex_init(&p->mtx); > - return 0; > + __percpu_down_up_read(s, 1); > } > > -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) > +static inline void percpu_up_read(struct percpu_rw_semaphore *s) > { > - free_percpu(p->counters); > - p->counters = NULL; /* catch use after free bugs */ > + __percpu_down_up_read(s, -1); > } > > #endif > Index: linux-3.6.6-fast/lib/Makefile > =================================================================== > --- linux-3.6.6-fast.orig/lib/Makefile 2012-10-02 00:47:57.000000000 +0200 > +++ linux-3.6.6-fast/lib/Makefile 2012-11-07 03:10:44.000000000 +0100 > @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmd > idr.o int_sqrt.o extable.o prio_tree.o \ > sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ > proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ > - is_single_threaded.o plist.o decompress.o > + is_single_threaded.o plist.o decompress.o percpu-rwsem.o > > lib-$(CONFIG_MMU) += ioremap.o > lib-$(CONFIG_SMP) += cpumask.o > Index: linux-3.6.6-fast/lib/percpu-rwsem.c > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-3.6.6-fast/lib/percpu-rwsem.c 2012-11-07 16:43:27.000000000 +0100 > @@ -0,0 +1,110 @@ > +#include <linux/percpu-rwsem.h> > +#include <linux/rcupdate.h> > +#include <linux/sched.h> > +#include <linux/module.h> > + > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > +{ > + brw->fast_read_ctr = alloc_percpu(int); > + if (unlikely(!brw->fast_read_ctr)) > + return -ENOMEM; > + > + mutex_init(&brw->writer_mutex); > + init_rwsem(&brw->rw_sem); > + atomic_set(&brw->slow_read_ctr, 0); > + init_waitqueue_head(&brw->write_waitq); > + return 0; > +} > +EXPORT_SYMBOL(percpu_init_rwsem); > + > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > +{ > + free_percpu(brw->fast_read_ctr); > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > +} > +EXPORT_SYMBOL(percpu_free_rwsem); > + > +void __percpu_down_up_read(struct percpu_rw_semaphore *brw, int val) > +{ > + preempt_disable(); > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { > + __this_cpu_add(*brw->fast_read_ctr, val); > + preempt_enable(); > + return; > + } > + preempt_enable(); > + if (val >= 0) { > + down_read(&brw->rw_sem); > + atomic_inc(&brw->slow_read_ctr); > + up_read(&brw->rw_sem); > + } else { > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > + wake_up_all(&brw->write_waitq); > + } > +} > +EXPORT_SYMBOL(__percpu_down_up_read); > + > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > +{ > + unsigned int sum = 0; > + int cpu; > + > + for_each_possible_cpu(cpu) { > + sum += per_cpu(*brw->fast_read_ctr, cpu); > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > + } > + > + return sum; > +} > + > +/* > + * A writer takes ->writer_mutex to exclude other writers and to force the > + * readers to switch to the slow mode, note the mutex_is_locked() check in > + * update_fast_ctr(). > + * > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > + * counter it represents the number of active readers. > + * > + * Finally the writer takes ->rw_sem for writing and blocks the new readers, > + * then waits until the slow counter becomes zero. > + */ > +void percpu_down_write(struct percpu_rw_semaphore *brw) > +{ > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > + mutex_lock(&brw->writer_mutex); > + > + /* > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > + * so that update_fast_ctr() can't succeed. > + * > + * 2. Ensures we see the result of every previous this_cpu_add() in > + * update_fast_ctr(). > + * > + * 3. Ensures that if any reader has exited its critical section via > + * fast-path, it executes a full memory barrier before we return. > + */ > + synchronize_sched(); > + > + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ > + atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); > + > + /* block the new readers completely */ > + down_write(&brw->rw_sem); > + > + /* wait for all readers to complete their percpu_up_read() */ > + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); > +} > +EXPORT_SYMBOL(percpu_down_write); > + > +void percpu_up_write(struct percpu_rw_semaphore *brw) > +{ > + /* allow the new readers, but only the slow-path */ > + up_write(&brw->rw_sem); > + > + /* insert the barrier before the next fast-path in down_read */ > + synchronize_sched(); > + > + mutex_unlock(&brw->writer_mutex); > +} > +EXPORT_SYMBOL(percpu_up_write); ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-07 17:47 ` Oleg Nesterov @ 2012-11-07 19:17 ` Mikulas Patocka 2012-11-08 13:42 ` Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-11-07 19:17 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, 7 Nov 2012, Oleg Nesterov wrote: > On 11/07, Mikulas Patocka wrote: > > > > It looks sensible. > > > > Here I'm sending an improvement of the patch - I changed it so that there > > are not two-level nested functions for the fast path and so that both > > percpu_down_read and percpu_up_read use the same piece of code (to reduce > > cache footprint). > > IOW, the only change is that you eliminate "static update_fast_ctr()" > and fold it into down/up_read which takes the additional argument. > > Honestly, personally I do not think this is better, but I won't argue. > I agree with everything but I guess we need the ack from Paul. If you look at generated assembly (for x86-64), the footprint of my patch is 78 bytes shared for both percpu_down_read and percpu_up_read. The footprint of your patch is 62 bytes for update_fast_ctr, 46 bytes for percpu_down_read and 20 bytes for percpu_up_read. Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-07 19:17 ` Mikulas Patocka @ 2012-11-08 13:42 ` Oleg Nesterov 0 siblings, 0 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-11-08 13:42 UTC (permalink / raw) To: Mikulas Patocka Cc: Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/07, Mikulas Patocka wrote: > > On Wed, 7 Nov 2012, Oleg Nesterov wrote: > > > On 11/07, Mikulas Patocka wrote: > > > > > > It looks sensible. > > > > > > Here I'm sending an improvement of the patch - I changed it so that there > > > are not two-level nested functions for the fast path and so that both > > > percpu_down_read and percpu_up_read use the same piece of code (to reduce > > > cache footprint). > > > > IOW, the only change is that you eliminate "static update_fast_ctr()" > > and fold it into down/up_read which takes the additional argument. > > > > Honestly, personally I do not think this is better, but I won't argue. > > I agree with everything but I guess we need the ack from Paul. > > If you look at generated assembly (for x86-64), the footprint of my patch > is 78 bytes shared for both percpu_down_read and percpu_up_read. > > The footprint of your patch is 62 bytes for update_fast_ctr, 46 bytes for > percpu_down_read and 20 bytes for percpu_up_read. Still I think the code looks more clean this way, and personally I think this is more important. Plus, this lessens the footprint for the caller although I agree this is minor. Please send the increnental patch if you wish, I won't argue. But note that with the lockdep annotations (and I'll send the patch soon) the code will look even worse. Either you need another "if (val > 0)" check or you need to add rwsem_acquire_read/rwsem_release into .h And if you do this change please also update the comments, they still refer to update_fast_ctr() you folded into down_up ;) Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-07 17:04 ` [PATCH v3 " Mikulas Patocka 2012-11-07 17:47 ` Oleg Nesterov @ 2012-11-08 1:23 ` Paul E. McKenney 1 sibling, 0 replies; 103+ messages in thread From: Paul E. McKenney @ 2012-11-08 1:23 UTC (permalink / raw) To: Mikulas Patocka Cc: Oleg Nesterov, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Nov 07, 2012 at 12:04:48PM -0500, Mikulas Patocka wrote: > It looks sensible. > > Here I'm sending an improvement of the patch - I changed it so that there > are not two-level nested functions for the fast path and so that both > percpu_down_read and percpu_up_read use the same piece of code (to reduce > cache footprint). > > --- > > Currently the writer does msleep() plus synchronize_sched() 3 times > to acquire/release the semaphore, and during this time the readers > are blocked completely. Even if the "write" section was not actually > started or if it was already finished. > > With this patch down_write/up_write does synchronize_sched() twice > and down_read/up_read are still possible during this time, just they > use the slow path. > > percpu_down_write() first forces the readers to use rw_semaphore and > increment the "slow" counter to take the lock for reading, then it > takes that rw_semaphore for writing and blocks the readers. > > Also. With this patch the code relies on the documented behaviour of > synchronize_sched(), it doesn't try to pair synchronize_sched() with > barrier. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> >From a memory-ordering viewpoint, this looks to me to work the same way that Oleg's does. Oleg's approach looks better to me, though that might be because I have looked at it quite a few times over the past several days. Thanx, Paul > --- > include/linux/percpu-rwsem.h | 80 ++++++------------------------- > lib/Makefile | 2 > lib/percpu-rwsem.c | 110 +++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 127 insertions(+), 65 deletions(-) > create mode 100644 lib/percpu-rwsem.c > > Index: linux-3.6.6-fast/include/linux/percpu-rwsem.h > =================================================================== > --- linux-3.6.6-fast.orig/include/linux/percpu-rwsem.h 2012-11-05 16:21:29.000000000 +0100 > +++ linux-3.6.6-fast/include/linux/percpu-rwsem.h 2012-11-07 16:44:04.000000000 +0100 > @@ -2,82 +2,34 @@ > #define _LINUX_PERCPU_RWSEM_H > > #include <linux/mutex.h> > +#include <linux/rwsem.h> > #include <linux/percpu.h> > -#include <linux/rcupdate.h> > -#include <linux/delay.h> > +#include <linux/wait.h> > > struct percpu_rw_semaphore { > - unsigned __percpu *counters; > - bool locked; > - struct mutex mtx; > + unsigned int __percpu *fast_read_ctr; > + struct mutex writer_mutex; > + struct rw_semaphore rw_sem; > + atomic_t slow_read_ctr; > + wait_queue_head_t write_waitq; > }; > > -#define light_mb() barrier() > -#define heavy_mb() synchronize_sched() > +extern void __percpu_down_up_read(struct percpu_rw_semaphore *, int); > > -static inline void percpu_down_read(struct percpu_rw_semaphore *p) > -{ > - rcu_read_lock_sched(); > - if (unlikely(p->locked)) { > - rcu_read_unlock_sched(); > - mutex_lock(&p->mtx); > - this_cpu_inc(*p->counters); > - mutex_unlock(&p->mtx); > - return; > - } > - this_cpu_inc(*p->counters); > - rcu_read_unlock_sched(); > - light_mb(); /* A, between read of p->locked and read of data, paired with D */ > -} > - > -static inline void percpu_up_read(struct percpu_rw_semaphore *p) > -{ > - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ > - this_cpu_dec(*p->counters); > -} > - > -static inline unsigned __percpu_count(unsigned __percpu *counters) > -{ > - unsigned total = 0; > - int cpu; > - > - for_each_possible_cpu(cpu) > - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); > +extern void percpu_down_write(struct percpu_rw_semaphore *); > +extern void percpu_up_write(struct percpu_rw_semaphore *); > > - return total; > -} > - > -static inline void percpu_down_write(struct percpu_rw_semaphore *p) > -{ > - mutex_lock(&p->mtx); > - p->locked = true; > - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ > - while (__percpu_count(p->counters)) > - msleep(1); > - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ > -} > - > -static inline void percpu_up_write(struct percpu_rw_semaphore *p) > -{ > - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ > - p->locked = false; > - mutex_unlock(&p->mtx); > -} > +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); > +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); > > -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) > +static inline void percpu_down_read(struct percpu_rw_semaphore *s) > { > - p->counters = alloc_percpu(unsigned); > - if (unlikely(!p->counters)) > - return -ENOMEM; > - p->locked = false; > - mutex_init(&p->mtx); > - return 0; > + __percpu_down_up_read(s, 1); > } > > -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) > +static inline void percpu_up_read(struct percpu_rw_semaphore *s) > { > - free_percpu(p->counters); > - p->counters = NULL; /* catch use after free bugs */ > + __percpu_down_up_read(s, -1); > } > > #endif > Index: linux-3.6.6-fast/lib/Makefile > =================================================================== > --- linux-3.6.6-fast.orig/lib/Makefile 2012-10-02 00:47:57.000000000 +0200 > +++ linux-3.6.6-fast/lib/Makefile 2012-11-07 03:10:44.000000000 +0100 > @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmd > idr.o int_sqrt.o extable.o prio_tree.o \ > sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ > proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ > - is_single_threaded.o plist.o decompress.o > + is_single_threaded.o plist.o decompress.o percpu-rwsem.o > > lib-$(CONFIG_MMU) += ioremap.o > lib-$(CONFIG_SMP) += cpumask.o > Index: linux-3.6.6-fast/lib/percpu-rwsem.c > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-3.6.6-fast/lib/percpu-rwsem.c 2012-11-07 16:43:27.000000000 +0100 > @@ -0,0 +1,110 @@ > +#include <linux/percpu-rwsem.h> > +#include <linux/rcupdate.h> > +#include <linux/sched.h> > +#include <linux/module.h> > + > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > +{ > + brw->fast_read_ctr = alloc_percpu(int); > + if (unlikely(!brw->fast_read_ctr)) > + return -ENOMEM; > + > + mutex_init(&brw->writer_mutex); > + init_rwsem(&brw->rw_sem); > + atomic_set(&brw->slow_read_ctr, 0); > + init_waitqueue_head(&brw->write_waitq); > + return 0; > +} > +EXPORT_SYMBOL(percpu_init_rwsem); > + > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > +{ > + free_percpu(brw->fast_read_ctr); > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > +} > +EXPORT_SYMBOL(percpu_free_rwsem); > + > +void __percpu_down_up_read(struct percpu_rw_semaphore *brw, int val) > +{ > + preempt_disable(); > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { > + __this_cpu_add(*brw->fast_read_ctr, val); > + preempt_enable(); > + return; > + } > + preempt_enable(); > + if (val >= 0) { > + down_read(&brw->rw_sem); > + atomic_inc(&brw->slow_read_ctr); > + up_read(&brw->rw_sem); > + } else { > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > + wake_up_all(&brw->write_waitq); > + } > +} > +EXPORT_SYMBOL(__percpu_down_up_read); > + > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > +{ > + unsigned int sum = 0; > + int cpu; > + > + for_each_possible_cpu(cpu) { > + sum += per_cpu(*brw->fast_read_ctr, cpu); > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > + } > + > + return sum; > +} > + > +/* > + * A writer takes ->writer_mutex to exclude other writers and to force the > + * readers to switch to the slow mode, note the mutex_is_locked() check in > + * update_fast_ctr(). > + * > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > + * counter it represents the number of active readers. > + * > + * Finally the writer takes ->rw_sem for writing and blocks the new readers, > + * then waits until the slow counter becomes zero. > + */ > +void percpu_down_write(struct percpu_rw_semaphore *brw) > +{ > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > + mutex_lock(&brw->writer_mutex); > + > + /* > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > + * so that update_fast_ctr() can't succeed. > + * > + * 2. Ensures we see the result of every previous this_cpu_add() in > + * update_fast_ctr(). > + * > + * 3. Ensures that if any reader has exited its critical section via > + * fast-path, it executes a full memory barrier before we return. > + */ > + synchronize_sched(); > + > + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ > + atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); > + > + /* block the new readers completely */ > + down_write(&brw->rw_sem); > + > + /* wait for all readers to complete their percpu_up_read() */ > + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); > +} > +EXPORT_SYMBOL(percpu_down_write); > + > +void percpu_up_write(struct percpu_rw_semaphore *brw) > +{ > + /* allow the new readers, but only the slow-path */ > + up_write(&brw->rw_sem); > + > + /* insert the barrier before the next fast-path in down_read */ > + synchronize_sched(); > + > + mutex_unlock(&brw->writer_mutex); > +} > +EXPORT_SYMBOL(percpu_up_write); > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-02 18:06 ` [PATCH v2 1/1] " Oleg Nesterov 2012-11-07 17:04 ` [PATCH v3 " Mikulas Patocka @ 2012-11-08 1:16 ` Paul E. McKenney 2012-11-08 13:33 ` Oleg Nesterov 1 sibling, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-11-08 1:16 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Fri, Nov 02, 2012 at 07:06:29PM +0100, Oleg Nesterov wrote: > Currently the writer does msleep() plus synchronize_sched() 3 times > to acquire/release the semaphore, and during this time the readers > are blocked completely. Even if the "write" section was not actually > started or if it was already finished. > > With this patch down_write/up_write does synchronize_sched() twice > and down_read/up_read are still possible during this time, just they > use the slow path. > > percpu_down_write() first forces the readers to use rw_semaphore and > increment the "slow" counter to take the lock for reading, then it > takes that rw_semaphore for writing and blocks the readers. > > Also. With this patch the code relies on the documented behaviour of > synchronize_sched(), it doesn't try to pair synchronize_sched() with > barrier. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > include/linux/percpu-rwsem.h | 83 +++++------------------------ > lib/Makefile | 2 +- > lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 137 insertions(+), 71 deletions(-) > create mode 100644 lib/percpu-rwsem.c > > diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h > index 250a4ac..592f0d6 100644 > --- a/include/linux/percpu-rwsem.h > +++ b/include/linux/percpu-rwsem.h > @@ -2,82 +2,25 @@ > #define _LINUX_PERCPU_RWSEM_H > > #include <linux/mutex.h> > +#include <linux/rwsem.h> > #include <linux/percpu.h> > -#include <linux/rcupdate.h> > -#include <linux/delay.h> > +#include <linux/wait.h> > > struct percpu_rw_semaphore { > - unsigned __percpu *counters; > - bool locked; > - struct mutex mtx; > + unsigned int __percpu *fast_read_ctr; > + struct mutex writer_mutex; > + struct rw_semaphore rw_sem; > + atomic_t slow_read_ctr; > + wait_queue_head_t write_waitq; > }; > > -#define light_mb() barrier() > -#define heavy_mb() synchronize_sched() > +extern void percpu_down_read(struct percpu_rw_semaphore *); > +extern void percpu_up_read(struct percpu_rw_semaphore *); > > -static inline void percpu_down_read(struct percpu_rw_semaphore *p) > -{ > - rcu_read_lock_sched(); > - if (unlikely(p->locked)) { > - rcu_read_unlock_sched(); > - mutex_lock(&p->mtx); > - this_cpu_inc(*p->counters); > - mutex_unlock(&p->mtx); > - return; > - } > - this_cpu_inc(*p->counters); > - rcu_read_unlock_sched(); > - light_mb(); /* A, between read of p->locked and read of data, paired with D */ > -} > +extern void percpu_down_write(struct percpu_rw_semaphore *); > +extern void percpu_up_write(struct percpu_rw_semaphore *); > > -static inline void percpu_up_read(struct percpu_rw_semaphore *p) > -{ > - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ > - this_cpu_dec(*p->counters); > -} > - > -static inline unsigned __percpu_count(unsigned __percpu *counters) > -{ > - unsigned total = 0; > - int cpu; > - > - for_each_possible_cpu(cpu) > - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); > - > - return total; > -} > - > -static inline void percpu_down_write(struct percpu_rw_semaphore *p) > -{ > - mutex_lock(&p->mtx); > - p->locked = true; > - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ > - while (__percpu_count(p->counters)) > - msleep(1); > - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ > -} > - > -static inline void percpu_up_write(struct percpu_rw_semaphore *p) > -{ > - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ > - p->locked = false; > - mutex_unlock(&p->mtx); > -} > - > -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) > -{ > - p->counters = alloc_percpu(unsigned); > - if (unlikely(!p->counters)) > - return -ENOMEM; > - p->locked = false; > - mutex_init(&p->mtx); > - return 0; > -} > - > -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) > -{ > - free_percpu(p->counters); > - p->counters = NULL; /* catch use after free bugs */ > -} > +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); > +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); > > #endif > diff --git a/lib/Makefile b/lib/Makefile > index 821a162..4dad4a7 100644 > --- a/lib/Makefile > +++ b/lib/Makefile > @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ > idr.o int_sqrt.o extable.o \ > sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ > proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ > - is_single_threaded.o plist.o decompress.o > + is_single_threaded.o plist.o decompress.o percpu-rwsem.o > > lib-$(CONFIG_MMU) += ioremap.o > lib-$(CONFIG_SMP) += cpumask.o > diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c > new file mode 100644 > index 0000000..0e3bc0f > --- /dev/null > +++ b/lib/percpu-rwsem.c > @@ -0,0 +1,123 @@ > +#include <linux/percpu-rwsem.h> > +#include <linux/rcupdate.h> > +#include <linux/sched.h> > + > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > +{ > + brw->fast_read_ctr = alloc_percpu(int); > + if (unlikely(!brw->fast_read_ctr)) > + return -ENOMEM; > + > + mutex_init(&brw->writer_mutex); > + init_rwsem(&brw->rw_sem); > + atomic_set(&brw->slow_read_ctr, 0); > + init_waitqueue_head(&brw->write_waitq); > + return 0; > +} > + > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > +{ > + free_percpu(brw->fast_read_ctr); > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > +} > + > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) > +{ > + bool success = false; > + > + preempt_disable(); > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { > + __this_cpu_add(*brw->fast_read_ctr, val); > + success = true; > + } > + preempt_enable(); > + > + return success; > +} > + > +/* > + * Like the normal down_read() this is not recursive, the writer can > + * come after the first percpu_down_read() and create the deadlock. > + */ > +void percpu_down_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, +1))) > + return; > + > + down_read(&brw->rw_sem); > + atomic_inc(&brw->slow_read_ctr); > + up_read(&brw->rw_sem); > +} > + > +void percpu_up_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, -1))) > + return; > + > + /* false-positive is possible but harmless */ > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > + wake_up_all(&brw->write_waitq); > +} > + > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > +{ > + unsigned int sum = 0; > + int cpu; > + > + for_each_possible_cpu(cpu) { > + sum += per_cpu(*brw->fast_read_ctr, cpu); > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > + } > + > + return sum; > +} > + > +/* > + * A writer takes ->writer_mutex to exclude other writers and to force the > + * readers to switch to the slow mode, note the mutex_is_locked() check in > + * update_fast_ctr(). > + * > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > + * counter it represents the number of active readers. > + * > + * Finally the writer takes ->rw_sem for writing and blocks the new readers, > + * then waits until the slow counter becomes zero. > + */ > +void percpu_down_write(struct percpu_rw_semaphore *brw) > +{ > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > + mutex_lock(&brw->writer_mutex); > + > + /* > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > + * so that update_fast_ctr() can't succeed. > + * > + * 2. Ensures we see the result of every previous this_cpu_add() in > + * update_fast_ctr(). > + * > + * 3. Ensures that if any reader has exited its critical section via > + * fast-path, it executes a full memory barrier before we return. > + */ > + synchronize_sched(); > + > + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ > + atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); > + > + /* block the new readers completely */ > + down_write(&brw->rw_sem); > + > + /* wait for all readers to complete their percpu_up_read() */ > + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); > +} > + > +void percpu_up_write(struct percpu_rw_semaphore *brw) > +{ > + /* allow the new readers, but only the slow-path */ > + up_write(&brw->rw_sem); > + > + /* insert the barrier before the next fast-path in down_read */ > + synchronize_sched(); Ah, my added comments describing the memory-order properties of synchronize_sched() were incomplete. As you say in the comment above, a valid RCU implementation must ensure that each CPU executes a memory barrier between the time that synchronize_sched() starts executing and the time that this same CPU starts its first RCU read-side critical section that ends after synchronize_sched() finishes executing. (This is symmetric with the requirement discussed earlier.) This works for the user-level RCU implementations as well -- some of them supply the memory barriers under the control of the synchronize_rcu(), while others supply them at the beginnings and ends of the RCU read-side critical section. Either way works, as required. (Why do I care about potential implementations with memory barriers in the read-side primitives? Well, I hope that I never have reason to. But if memory barriers do some day become free and if energy efficiency continues to grow in importance, some hardware might prefer the memory barriers in rcu_read_lock() and rcu_read_unlock() to interrupting CPUs to force them to execute memory barriers.) This in turn means that if a given RCU read-side critical section is totally overlapped by a synchronize_sched(), there are no guarantees of any memory barriers. Which is OK, you don't rely on this. > + mutex_unlock(&brw->writer_mutex); And if a reader sees brw->writer_mutex as unlocked, then that reader's RCU read-side critical section must end after the above synchronize_sched() completes, which in turn means that there must have been a memory barrier on that reader's CPU after the synchronize_sched() started, so that the reader correctly sees the writer's updates. > +} Sorry to be such a pain (and a slow pain at that) on this one, but we really needed to get this right. But please let me know what you think of the added memory-order constraint. Note that a CPU that never ever executes any RCU read-side critical sections need not execute any synchronize_sched()-induced memory barriers. So: Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-08 1:16 ` [PATCH v2 " Paul E. McKenney @ 2012-11-08 13:33 ` Oleg Nesterov 2012-11-08 16:27 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-11-08 13:33 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/07, Paul E. McKenney wrote: > > On Fri, Nov 02, 2012 at 07:06:29PM +0100, Oleg Nesterov wrote: > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > +{ > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > + mutex_lock(&brw->writer_mutex); > > + > > + /* > > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > > + * so that update_fast_ctr() can't succeed. > > + * > > + * 2. Ensures we see the result of every previous this_cpu_add() in > > + * update_fast_ctr(). > > + * > > + * 3. Ensures that if any reader has exited its critical section via > > + * fast-path, it executes a full memory barrier before we return. > > + */ > > + synchronize_sched(); > > + > > + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ > > + atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); > > + > > + /* block the new readers completely */ > > + down_write(&brw->rw_sem); > > + > > + /* wait for all readers to complete their percpu_up_read() */ > > + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); > > +} > > + > > +void percpu_up_write(struct percpu_rw_semaphore *brw) > > +{ > > + /* allow the new readers, but only the slow-path */ > > + up_write(&brw->rw_sem); > > + > > + /* insert the barrier before the next fast-path in down_read */ > > + synchronize_sched(); > > Ah, my added comments describing the memory-order properties of > synchronize_sched() were incomplete. As you say in the comment above, > a valid RCU implementation must ensure that each CPU executes a memory > barrier between the time that synchronize_sched() starts executing and > the time that this same CPU starts its first RCU read-side critical > section that ends after synchronize_sched() finishes executing. (This > is symmetric with the requirement discussed earlier.) I think, yes. Let me repeat my example (changed a little bit). Suppose that we have int A = 0, B = 0, STOP = 0; // can be called at any time, and many times void func(void) { rcu_read_lock_sched(); if (!STOP) { A++; B++; } rcu_read_unlock_sched(); } Then I believe the following code should be correct: STOP = 1; synchronize_sched(); BUG_ON(A != B); We should see the result of the previous increments, and func() should see STOP != 0 if it races with BUG_ON(). > And if a reader sees brw->writer_mutex as unlocked, then that reader's > RCU read-side critical section must end after the above synchronize_sched() > completes, which in turn means that there must have been a memory barrier > on that reader's CPU after the synchronize_sched() started, so that the > reader correctly sees the writer's updates. Yes. > But please let me know what you > think of the added memory-order constraint. I am going to (try to) do other changes on top of this patch, and I'll certainly try to think more about this, thanks. > Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Great! thanks a lot Paul. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-08 13:33 ` Oleg Nesterov @ 2012-11-08 16:27 ` Paul E. McKenney 0 siblings, 0 replies; 103+ messages in thread From: Paul E. McKenney @ 2012-11-08 16:27 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, Nov 08, 2012 at 02:33:27PM +0100, Oleg Nesterov wrote: > On 11/07, Paul E. McKenney wrote: > > > > On Fri, Nov 02, 2012 at 07:06:29PM +0100, Oleg Nesterov wrote: > > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > > +{ > > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > > + mutex_lock(&brw->writer_mutex); > > > + > > > + /* > > > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > > > + * so that update_fast_ctr() can't succeed. > > > + * > > > + * 2. Ensures we see the result of every previous this_cpu_add() in > > > + * update_fast_ctr(). > > > + * > > > + * 3. Ensures that if any reader has exited its critical section via > > > + * fast-path, it executes a full memory barrier before we return. > > > + */ > > > + synchronize_sched(); > > > + > > > + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ > > > + atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); > > > + > > > + /* block the new readers completely */ > > > + down_write(&brw->rw_sem); > > > + > > > + /* wait for all readers to complete their percpu_up_read() */ > > > + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); > > > +} > > > + > > > +void percpu_up_write(struct percpu_rw_semaphore *brw) > > > +{ > > > + /* allow the new readers, but only the slow-path */ > > > + up_write(&brw->rw_sem); > > > + > > > + /* insert the barrier before the next fast-path in down_read */ > > > + synchronize_sched(); > > > > Ah, my added comments describing the memory-order properties of > > synchronize_sched() were incomplete. As you say in the comment above, > > a valid RCU implementation must ensure that each CPU executes a memory > > barrier between the time that synchronize_sched() starts executing and > > the time that this same CPU starts its first RCU read-side critical > > section that ends after synchronize_sched() finishes executing. (This > > is symmetric with the requirement discussed earlier.) > > I think, yes. Let me repeat my example (changed a little bit). Suppose > that we have > > int A = 0, B = 0, STOP = 0; > > // can be called at any time, and many times > void func(void) > { > rcu_read_lock_sched(); > if (!STOP) { > A++; > B++; > } > rcu_read_unlock_sched(); > } > > Then I believe the following code should be correct: > > STOP = 1; > > synchronize_sched(); > > BUG_ON(A != B); Agreed, but covered by my earlier definition. > We should see the result of the previous increments, and func() should > see STOP != 0 if it races with BUG_ON(). Alternatively, if we have something like: if (!STOP) { A++; B++; if (random() & 0xffff) { synchronize_sched(); STOP = 1; } } Then if we also have elsewhere: rcu_read_lock_sched(); if (STOP) BUG_ON(A != B); rcu_read_unlock_sched(); The BUG_ON() should never fire. This one requires the other guarantee, that if a given RCU read-side critical section ends after a given synchronize_sched(), then the CPU executing that RCU read-side critical section is guaranteed to have executed a memory barrier between the start of that synchronize_sched() and the start of that RCU read-side critical section. > > And if a reader sees brw->writer_mutex as unlocked, then that reader's > > RCU read-side critical section must end after the above synchronize_sched() > > completes, which in turn means that there must have been a memory barrier > > on that reader's CPU after the synchronize_sched() started, so that the > > reader correctly sees the writer's updates. > > Yes. > > > But please let me know what you > > think of the added memory-order constraint. > > I am going to (try to) do other changes on top of this patch, and I'll > certainly try to think more about this, thanks. Looking forward to hearing your thoughts! Thanx, Paul > > Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > Great! thanks a lot Paul. > > Oleg. > ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH RESEND v2 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-02 18:06 ` [PATCH v2 0/1] " Oleg Nesterov 2012-11-02 18:06 ` [PATCH v2 1/1] " Oleg Nesterov @ 2012-11-08 13:48 ` Oleg Nesterov 2012-11-08 13:48 ` [PATCH RESEND v2 1/1] " Oleg Nesterov 1 sibling, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-11-08 13:48 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds, Paul E. McKenney Cc: Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/02, Oleg Nesterov wrote: > > On 11/01, Linus Torvalds wrote: > > > > Other than that, I guess it looks ok. > > Great, please see v2. > > I am not sure I addressed Paul's concerns, so I guess I need his ack. And now I have it, so I think the patch is ready. Please see 1/1. No changes except I added Reviewed-by from Paul. If it is too late for 3.7, then may be Andrew can pick it. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-08 13:48 ` [PATCH RESEND v2 0/1] " Oleg Nesterov @ 2012-11-08 13:48 ` Oleg Nesterov 2012-11-08 20:07 ` Andrew Morton 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-11-08 13:48 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds, Paul E. McKenney Cc: Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the "write" section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the "slow" counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/percpu-rwsem.h | 83 +++++------------------------ lib/Makefile | 2 +- lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 137 insertions(+), 71 deletions(-) create mode 100644 lib/percpu-rwsem.c diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 250a4ac..592f0d6 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -2,82 +2,25 @@ #define _LINUX_PERCPU_RWSEM_H #include <linux/mutex.h> +#include <linux/rwsem.h> #include <linux/percpu.h> -#include <linux/rcupdate.h> -#include <linux/delay.h> +#include <linux/wait.h> struct percpu_rw_semaphore { - unsigned __percpu *counters; - bool locked; - struct mutex mtx; + unsigned int __percpu *fast_read_ctr; + struct mutex writer_mutex; + struct rw_semaphore rw_sem; + atomic_t slow_read_ctr; + wait_queue_head_t write_waitq; }; -#define light_mb() barrier() -#define heavy_mb() synchronize_sched() +extern void percpu_down_read(struct percpu_rw_semaphore *); +extern void percpu_up_read(struct percpu_rw_semaphore *); -static inline void percpu_down_read(struct percpu_rw_semaphore *p) -{ - rcu_read_lock_sched(); - if (unlikely(p->locked)) { - rcu_read_unlock_sched(); - mutex_lock(&p->mtx); - this_cpu_inc(*p->counters); - mutex_unlock(&p->mtx); - return; - } - this_cpu_inc(*p->counters); - rcu_read_unlock_sched(); - light_mb(); /* A, between read of p->locked and read of data, paired with D */ -} +extern void percpu_down_write(struct percpu_rw_semaphore *); +extern void percpu_up_write(struct percpu_rw_semaphore *); -static inline void percpu_up_read(struct percpu_rw_semaphore *p) -{ - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ - this_cpu_dec(*p->counters); -} - -static inline unsigned __percpu_count(unsigned __percpu *counters) -{ - unsigned total = 0; - int cpu; - - for_each_possible_cpu(cpu) - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); - - return total; -} - -static inline void percpu_down_write(struct percpu_rw_semaphore *p) -{ - mutex_lock(&p->mtx); - p->locked = true; - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ - while (__percpu_count(p->counters)) - msleep(1); - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ -} - -static inline void percpu_up_write(struct percpu_rw_semaphore *p) -{ - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ - p->locked = false; - mutex_unlock(&p->mtx); -} - -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) -{ - p->counters = alloc_percpu(unsigned); - if (unlikely(!p->counters)) - return -ENOMEM; - p->locked = false; - mutex_init(&p->mtx); - return 0; -} - -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) -{ - free_percpu(p->counters); - p->counters = NULL; /* catch use after free bugs */ -} +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); #endif diff --git a/lib/Makefile b/lib/Makefile index 821a162..4dad4a7 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ idr.o int_sqrt.o extable.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ - is_single_threaded.o plist.o decompress.o + is_single_threaded.o plist.o decompress.o percpu-rwsem.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c new file mode 100644 index 0000000..0e3bc0f --- /dev/null +++ b/lib/percpu-rwsem.c @@ -0,0 +1,123 @@ +#include <linux/percpu-rwsem.h> +#include <linux/rcupdate.h> +#include <linux/sched.h> + +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) +{ + brw->fast_read_ctr = alloc_percpu(int); + if (unlikely(!brw->fast_read_ctr)) + return -ENOMEM; + + mutex_init(&brw->writer_mutex); + init_rwsem(&brw->rw_sem); + atomic_set(&brw->slow_read_ctr, 0); + init_waitqueue_head(&brw->write_waitq); + return 0; +} + +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) +{ + free_percpu(brw->fast_read_ctr); + brw->fast_read_ctr = NULL; /* catch use after free bugs */ +} + +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) +{ + bool success = false; + + preempt_disable(); + if (likely(!mutex_is_locked(&brw->writer_mutex))) { + __this_cpu_add(*brw->fast_read_ctr, val); + success = true; + } + preempt_enable(); + + return success; +} + +/* + * Like the normal down_read() this is not recursive, the writer can + * come after the first percpu_down_read() and create the deadlock. + */ +void percpu_down_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, +1))) + return; + + down_read(&brw->rw_sem); + atomic_inc(&brw->slow_read_ctr); + up_read(&brw->rw_sem); +} + +void percpu_up_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, -1))) + return; + + /* false-positive is possible but harmless */ + if (atomic_dec_and_test(&brw->slow_read_ctr)) + wake_up_all(&brw->write_waitq); +} + +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) +{ + unsigned int sum = 0; + int cpu; + + for_each_possible_cpu(cpu) { + sum += per_cpu(*brw->fast_read_ctr, cpu); + per_cpu(*brw->fast_read_ctr, cpu) = 0; + } + + return sum; +} + +/* + * A writer takes ->writer_mutex to exclude other writers and to force the + * readers to switch to the slow mode, note the mutex_is_locked() check in + * update_fast_ctr(). + * + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow + * counter it represents the number of active readers. + * + * Finally the writer takes ->rw_sem for writing and blocks the new readers, + * then waits until the slow counter becomes zero. + */ +void percpu_down_write(struct percpu_rw_semaphore *brw) +{ + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ + mutex_lock(&brw->writer_mutex); + + /* + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read + * so that update_fast_ctr() can't succeed. + * + * 2. Ensures we see the result of every previous this_cpu_add() in + * update_fast_ctr(). + * + * 3. Ensures that if any reader has exited its critical section via + * fast-path, it executes a full memory barrier before we return. + */ + synchronize_sched(); + + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ + atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); + + /* block the new readers completely */ + down_write(&brw->rw_sem); + + /* wait for all readers to complete their percpu_up_read() */ + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); +} + +void percpu_up_write(struct percpu_rw_semaphore *brw) +{ + /* allow the new readers, but only the slow-path */ + up_write(&brw->rw_sem); + + /* insert the barrier before the next fast-path in down_read */ + synchronize_sched(); + + mutex_unlock(&brw->writer_mutex); +} -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-08 13:48 ` [PATCH RESEND v2 1/1] " Oleg Nesterov @ 2012-11-08 20:07 ` Andrew Morton 2012-11-08 21:08 ` Paul E. McKenney ` (3 more replies) 0 siblings, 4 replies; 103+ messages in thread From: Andrew Morton @ 2012-11-08 20:07 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Paul E. McKenney, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > Currently the writer does msleep() plus synchronize_sched() 3 times > to acquire/release the semaphore, and during this time the readers > are blocked completely. Even if the "write" section was not actually > started or if it was already finished. > > With this patch down_write/up_write does synchronize_sched() twice > and down_read/up_read are still possible during this time, just they > use the slow path. > > percpu_down_write() first forces the readers to use rw_semaphore and > increment the "slow" counter to take the lock for reading, then it > takes that rw_semaphore for writing and blocks the readers. > > Also. With this patch the code relies on the documented behaviour of > synchronize_sched(), it doesn't try to pair synchronize_sched() with > barrier. > > ... > > include/linux/percpu-rwsem.h | 83 +++++------------------------ > lib/Makefile | 2 +- > lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++ The patch also uninlines everything. And it didn't export the resulting symbols to modules, so it isn't an equivalent. We can export thing later if needed I guess. It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will avoid including the code altogether, methinks? > > ... > > --- /dev/null > +++ b/lib/percpu-rwsem.c > @@ -0,0 +1,123 @@ That was nice and terse ;) > +#include <linux/percpu-rwsem.h> > +#include <linux/rcupdate.h> > +#include <linux/sched.h> This list is nowhere near sufficient to support this file's requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty more. IOW, if it compiles, it was sheer luck. > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > +{ > + brw->fast_read_ctr = alloc_percpu(int); > + if (unlikely(!brw->fast_read_ctr)) > + return -ENOMEM; > + > + mutex_init(&brw->writer_mutex); > + init_rwsem(&brw->rw_sem); > + atomic_set(&brw->slow_read_ctr, 0); > + init_waitqueue_head(&brw->write_waitq); > + return 0; > +} > + > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > +{ > + free_percpu(brw->fast_read_ctr); > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > +} > + > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) > +{ > + bool success = false; > + > + preempt_disable(); > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { > + __this_cpu_add(*brw->fast_read_ctr, val); > + success = true; > + } > + preempt_enable(); > + > + return success; > +} > + > +/* > + * Like the normal down_read() this is not recursive, the writer can > + * come after the first percpu_down_read() and create the deadlock. > + */ > +void percpu_down_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, +1))) > + return; > + > + down_read(&brw->rw_sem); > + atomic_inc(&brw->slow_read_ctr); > + up_read(&brw->rw_sem); > +} > + > +void percpu_up_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, -1))) > + return; > + > + /* false-positive is possible but harmless */ > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > + wake_up_all(&brw->write_waitq); > +} > + > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > +{ > + unsigned int sum = 0; > + int cpu; > + > + for_each_possible_cpu(cpu) { > + sum += per_cpu(*brw->fast_read_ctr, cpu); > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > + } > + > + return sum; > +} > + > +/* > + * A writer takes ->writer_mutex to exclude other writers and to force the > + * readers to switch to the slow mode, note the mutex_is_locked() check in > + * update_fast_ctr(). > + * > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > + * counter it represents the number of active readers. > + * > + * Finally the writer takes ->rw_sem for writing and blocks the new readers, > + * then waits until the slow counter becomes zero. > + */ Some overview of how fast/slow_read_ctr are supposed to work would be useful. This comment seems to assume that the reader already knew that. > +void percpu_down_write(struct percpu_rw_semaphore *brw) > +{ > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > + mutex_lock(&brw->writer_mutex); > + > + /* > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > + * so that update_fast_ctr() can't succeed. > + * > + * 2. Ensures we see the result of every previous this_cpu_add() in > + * update_fast_ctr(). > + * > + * 3. Ensures that if any reader has exited its critical section via > + * fast-path, it executes a full memory barrier before we return. > + */ > + synchronize_sched(); Here's where I get horridly confused. Your patch completely deRCUifies this code, yes? Yet here we're using an RCU primitive. And we seem to be using it not as an RCU primitive but as a handy thing which happens to have desirable side-effects. But the implementation of synchronize_sched() differs considerably according to which rcu flavor-of-the-minute you're using. And part 3 talks about the reader's critical section. The only critical sections I can see on the reader side are already covered by mutex_lock() and preempt_diable(). I get this feeling I don't have clue what's going on here and I think I'll just retire hurt now. If this code isn't as brain damaged as it initially appears then please, go easy on us simpletons in the next version? > + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ > + atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); > + > + /* block the new readers completely */ > + down_write(&brw->rw_sem); > + > + /* wait for all readers to complete their percpu_up_read() */ > + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); > +} > + > +void percpu_up_write(struct percpu_rw_semaphore *brw) > +{ > + /* allow the new readers, but only the slow-path */ > + up_write(&brw->rw_sem); > + > + /* insert the barrier before the next fast-path in down_read */ > + synchronize_sched(); > + > + mutex_unlock(&brw->writer_mutex); > +} ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-08 20:07 ` Andrew Morton @ 2012-11-08 21:08 ` Paul E. McKenney 2012-11-08 23:41 ` Mikulas Patocka 2012-11-09 12:47 ` Mikulas Patocka ` (2 subsequent siblings) 3 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-11-08 21:08 UTC (permalink / raw) To: Andrew Morton Cc: Oleg Nesterov, Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > On Thu, 8 Nov 2012 14:48:49 +0100 > Oleg Nesterov <oleg@redhat.com> wrote: > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > to acquire/release the semaphore, and during this time the readers > > are blocked completely. Even if the "write" section was not actually > > started or if it was already finished. > > > > With this patch down_write/up_write does synchronize_sched() twice > > and down_read/up_read are still possible during this time, just they > > use the slow path. > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > increment the "slow" counter to take the lock for reading, then it > > takes that rw_semaphore for writing and blocks the readers. > > > > Also. With this patch the code relies on the documented behaviour of > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > barrier. > > > > ... > > > > include/linux/percpu-rwsem.h | 83 +++++------------------------ > > lib/Makefile | 2 +- > > lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++ > > The patch also uninlines everything. > > And it didn't export the resulting symbols to modules, so it isn't an > equivalent. We can export thing later if needed I guess. > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > avoid including the code altogether, methinks? > > > > > ... > > > > --- /dev/null > > +++ b/lib/percpu-rwsem.c > > @@ -0,0 +1,123 @@ > > That was nice and terse ;) > > > +#include <linux/percpu-rwsem.h> > > +#include <linux/rcupdate.h> > > +#include <linux/sched.h> > > This list is nowhere near sufficient to support this file's > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > more. IOW, if it compiles, it was sheer luck. > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > > +{ > > + brw->fast_read_ctr = alloc_percpu(int); > > + if (unlikely(!brw->fast_read_ctr)) > > + return -ENOMEM; > > + > > + mutex_init(&brw->writer_mutex); > > + init_rwsem(&brw->rw_sem); > > + atomic_set(&brw->slow_read_ctr, 0); > > + init_waitqueue_head(&brw->write_waitq); > > + return 0; > > +} > > + > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > > +{ > > + free_percpu(brw->fast_read_ctr); > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > > +} > > + > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) > > +{ > > + bool success = false; > > + > > + preempt_disable(); > > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { > > + __this_cpu_add(*brw->fast_read_ctr, val); > > + success = true; > > + } > > + preempt_enable(); > > + > > + return success; > > +} > > + > > +/* > > + * Like the normal down_read() this is not recursive, the writer can > > + * come after the first percpu_down_read() and create the deadlock. > > + */ > > +void percpu_down_read(struct percpu_rw_semaphore *brw) > > +{ > > + if (likely(update_fast_ctr(brw, +1))) > > + return; > > + > > + down_read(&brw->rw_sem); > > + atomic_inc(&brw->slow_read_ctr); > > + up_read(&brw->rw_sem); > > +} > > + > > +void percpu_up_read(struct percpu_rw_semaphore *brw) > > +{ > > + if (likely(update_fast_ctr(brw, -1))) > > + return; > > + > > + /* false-positive is possible but harmless */ > > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > > + wake_up_all(&brw->write_waitq); > > +} > > + > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > > +{ > > + unsigned int sum = 0; > > + int cpu; > > + > > + for_each_possible_cpu(cpu) { > > + sum += per_cpu(*brw->fast_read_ctr, cpu); > > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > > + } > > + > > + return sum; > > +} > > + > > +/* > > + * A writer takes ->writer_mutex to exclude other writers and to force the > > + * readers to switch to the slow mode, note the mutex_is_locked() check in > > + * update_fast_ctr(). > > + * > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > > + * counter it represents the number of active readers. > > + * > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers, > > + * then waits until the slow counter becomes zero. > > + */ > > Some overview of how fast/slow_read_ctr are supposed to work would be > useful. This comment seems to assume that the reader already knew > that. > > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > +{ > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > + mutex_lock(&brw->writer_mutex); > > + > > + /* > > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > > + * so that update_fast_ctr() can't succeed. > > + * > > + * 2. Ensures we see the result of every previous this_cpu_add() in > > + * update_fast_ctr(). > > + * > > + * 3. Ensures that if any reader has exited its critical section via > > + * fast-path, it executes a full memory barrier before we return. > > + */ > > + synchronize_sched(); > > Here's where I get horridly confused. Your patch completely deRCUifies > this code, yes? Yet here we're using an RCU primitive. And we seem to > be using it not as an RCU primitive but as a handy thing which happens > to have desirable side-effects. But the implementation of > synchronize_sched() differs considerably according to which rcu > flavor-of-the-minute you're using. The trick is that the preempt_disable() call in update_fast_ctr() acts as an RCU read-side critical section WRT synchronize_sched(). The algorithm would work given rcu_read_lock()/rcu_read_unlock() and synchronize_rcu() in place of preempt_disable()/preempt_enable() and synchronize_sched(). The real-time guys would prefer the change to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that you mention it. Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu()? Thanx, Paul > And part 3 talks about the reader's critical section. The only > critical sections I can see on the reader side are already covered by > mutex_lock() and preempt_diable(). > > I get this feeling I don't have clue what's going on here and I think > I'll just retire hurt now. If this code isn't as brain damaged as it > initially appears then please, go easy on us simpletons in the next > version? > > > + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ > > + atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); > > + > > + /* block the new readers completely */ > > + down_write(&brw->rw_sem); > > + > > + /* wait for all readers to complete their percpu_up_read() */ > > + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); > > +} > > + > > +void percpu_up_write(struct percpu_rw_semaphore *brw) > > +{ > > + /* allow the new readers, but only the slow-path */ > > + up_write(&brw->rw_sem); > > + > > + /* insert the barrier before the next fast-path in down_read */ > > + synchronize_sched(); > > + > > + mutex_unlock(&brw->writer_mutex); > > +} > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-08 21:08 ` Paul E. McKenney @ 2012-11-08 23:41 ` Mikulas Patocka 2012-11-09 0:41 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Mikulas Patocka @ 2012-11-08 23:41 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Oleg Nesterov, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, 8 Nov 2012, Paul E. McKenney wrote: > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > > On Thu, 8 Nov 2012 14:48:49 +0100 > > Oleg Nesterov <oleg@redhat.com> wrote: > > > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > > to acquire/release the semaphore, and during this time the readers > > > are blocked completely. Even if the "write" section was not actually > > > started or if it was already finished. > > > > > > With this patch down_write/up_write does synchronize_sched() twice > > > and down_read/up_read are still possible during this time, just they > > > use the slow path. > > > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > > increment the "slow" counter to take the lock for reading, then it > > > takes that rw_semaphore for writing and blocks the readers. > > > > > > Also. With this patch the code relies on the documented behaviour of > > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > > barrier. > > > > > > ... > > > > > > include/linux/percpu-rwsem.h | 83 +++++------------------------ > > > lib/Makefile | 2 +- > > > lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++ > > > > The patch also uninlines everything. > > > > And it didn't export the resulting symbols to modules, so it isn't an > > equivalent. We can export thing later if needed I guess. > > > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > > avoid including the code altogether, methinks? > > > > > > > > ... > > > > > > --- /dev/null > > > +++ b/lib/percpu-rwsem.c > > > @@ -0,0 +1,123 @@ > > > > That was nice and terse ;) > > > > > +#include <linux/percpu-rwsem.h> > > > +#include <linux/rcupdate.h> > > > +#include <linux/sched.h> > > > > This list is nowhere near sufficient to support this file's > > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > > more. IOW, if it compiles, it was sheer luck. > > > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > > > +{ > > > + brw->fast_read_ctr = alloc_percpu(int); > > > + if (unlikely(!brw->fast_read_ctr)) > > > + return -ENOMEM; > > > + > > > + mutex_init(&brw->writer_mutex); > > > + init_rwsem(&brw->rw_sem); > > > + atomic_set(&brw->slow_read_ctr, 0); > > > + init_waitqueue_head(&brw->write_waitq); > > > + return 0; > > > +} > > > + > > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > > > +{ > > > + free_percpu(brw->fast_read_ctr); > > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > > > +} > > > + > > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) > > > +{ > > > + bool success = false; > > > + > > > + preempt_disable(); > > > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { > > > + __this_cpu_add(*brw->fast_read_ctr, val); > > > + success = true; > > > + } > > > + preempt_enable(); > > > + > > > + return success; > > > +} > > > + > > > +/* > > > + * Like the normal down_read() this is not recursive, the writer can > > > + * come after the first percpu_down_read() and create the deadlock. > > > + */ > > > +void percpu_down_read(struct percpu_rw_semaphore *brw) > > > +{ > > > + if (likely(update_fast_ctr(brw, +1))) > > > + return; > > > + > > > + down_read(&brw->rw_sem); > > > + atomic_inc(&brw->slow_read_ctr); > > > + up_read(&brw->rw_sem); > > > +} > > > + > > > +void percpu_up_read(struct percpu_rw_semaphore *brw) > > > +{ > > > + if (likely(update_fast_ctr(brw, -1))) > > > + return; > > > + > > > + /* false-positive is possible but harmless */ > > > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > > > + wake_up_all(&brw->write_waitq); > > > +} > > > + > > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > > > +{ > > > + unsigned int sum = 0; > > > + int cpu; > > > + > > > + for_each_possible_cpu(cpu) { > > > + sum += per_cpu(*brw->fast_read_ctr, cpu); > > > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > > > + } > > > + > > > + return sum; > > > +} > > > + > > > +/* > > > + * A writer takes ->writer_mutex to exclude other writers and to force the > > > + * readers to switch to the slow mode, note the mutex_is_locked() check in > > > + * update_fast_ctr(). > > > + * > > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, > > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > > > + * counter it represents the number of active readers. > > > + * > > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers, > > > + * then waits until the slow counter becomes zero. > > > + */ > > > > Some overview of how fast/slow_read_ctr are supposed to work would be > > useful. This comment seems to assume that the reader already knew > > that. > > > > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > > +{ > > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > > + mutex_lock(&brw->writer_mutex); > > > + > > > + /* > > > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > > > + * so that update_fast_ctr() can't succeed. > > > + * > > > + * 2. Ensures we see the result of every previous this_cpu_add() in > > > + * update_fast_ctr(). > > > + * > > > + * 3. Ensures that if any reader has exited its critical section via > > > + * fast-path, it executes a full memory barrier before we return. > > > + */ > > > + synchronize_sched(); > > > > Here's where I get horridly confused. Your patch completely deRCUifies > > this code, yes? Yet here we're using an RCU primitive. And we seem to > > be using it not as an RCU primitive but as a handy thing which happens > > to have desirable side-effects. But the implementation of > > synchronize_sched() differs considerably according to which rcu > > flavor-of-the-minute you're using. > > The trick is that the preempt_disable() call in update_fast_ctr() > acts as an RCU read-side critical section WRT synchronize_sched(). > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and > synchronize_rcu() in place of preempt_disable()/preempt_enable() and > synchronize_sched(). The real-time guys would prefer the change > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that > you mention it. > > Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock() > and synchronize_rcu()? > > Thanx, Paul preempt_disable/preempt_enable is faster than rcu_read_lock/rcu_read_unlock for preemptive kernels. Regarding real-time response - the region blocked with preempt_disable/preempt_enable contains a few instructions (one test for mutex_is_locked and one increment of percpu variable), so it isn't any threat to real time response. There are plenty of longer regions in the kernel that are executed with interrupts or preemption disabled. Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-08 23:41 ` Mikulas Patocka @ 2012-11-09 0:41 ` Paul E. McKenney 2012-11-09 3:23 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-11-09 0:41 UTC (permalink / raw) To: Mikulas Patocka Cc: Andrew Morton, Oleg Nesterov, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: > > > On Thu, 8 Nov 2012, Paul E. McKenney wrote: > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > > > On Thu, 8 Nov 2012 14:48:49 +0100 > > > Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > > > to acquire/release the semaphore, and during this time the readers > > > > are blocked completely. Even if the "write" section was not actually > > > > started or if it was already finished. > > > > > > > > With this patch down_write/up_write does synchronize_sched() twice > > > > and down_read/up_read are still possible during this time, just they > > > > use the slow path. > > > > > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > > > increment the "slow" counter to take the lock for reading, then it > > > > takes that rw_semaphore for writing and blocks the readers. > > > > > > > > Also. With this patch the code relies on the documented behaviour of > > > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > > > barrier. > > > > > > > > ... > > > > > > > > include/linux/percpu-rwsem.h | 83 +++++------------------------ > > > > lib/Makefile | 2 +- > > > > lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++ > > > > > > The patch also uninlines everything. > > > > > > And it didn't export the resulting symbols to modules, so it isn't an > > > equivalent. We can export thing later if needed I guess. > > > > > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > > > avoid including the code altogether, methinks? > > > > > > > > > > > ... > > > > > > > > --- /dev/null > > > > +++ b/lib/percpu-rwsem.c > > > > @@ -0,0 +1,123 @@ > > > > > > That was nice and terse ;) > > > > > > > +#include <linux/percpu-rwsem.h> > > > > +#include <linux/rcupdate.h> > > > > +#include <linux/sched.h> > > > > > > This list is nowhere near sufficient to support this file's > > > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > > > more. IOW, if it compiles, it was sheer luck. > > > > > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > > > > +{ > > > > + brw->fast_read_ctr = alloc_percpu(int); > > > > + if (unlikely(!brw->fast_read_ctr)) > > > > + return -ENOMEM; > > > > + > > > > + mutex_init(&brw->writer_mutex); > > > > + init_rwsem(&brw->rw_sem); > > > > + atomic_set(&brw->slow_read_ctr, 0); > > > > + init_waitqueue_head(&brw->write_waitq); > > > > + return 0; > > > > +} > > > > + > > > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > > > > +{ > > > > + free_percpu(brw->fast_read_ctr); > > > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > > > > +} > > > > + > > > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) > > > > +{ > > > > + bool success = false; > > > > + > > > > + preempt_disable(); > > > > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { > > > > + __this_cpu_add(*brw->fast_read_ctr, val); > > > > + success = true; > > > > + } > > > > + preempt_enable(); > > > > + > > > > + return success; > > > > +} > > > > + > > > > +/* > > > > + * Like the normal down_read() this is not recursive, the writer can > > > > + * come after the first percpu_down_read() and create the deadlock. > > > > + */ > > > > +void percpu_down_read(struct percpu_rw_semaphore *brw) > > > > +{ > > > > + if (likely(update_fast_ctr(brw, +1))) > > > > + return; > > > > + > > > > + down_read(&brw->rw_sem); > > > > + atomic_inc(&brw->slow_read_ctr); > > > > + up_read(&brw->rw_sem); > > > > +} > > > > + > > > > +void percpu_up_read(struct percpu_rw_semaphore *brw) > > > > +{ > > > > + if (likely(update_fast_ctr(brw, -1))) > > > > + return; > > > > + > > > > + /* false-positive is possible but harmless */ > > > > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > > > > + wake_up_all(&brw->write_waitq); > > > > +} > > > > + > > > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > > > > +{ > > > > + unsigned int sum = 0; > > > > + int cpu; > > > > + > > > > + for_each_possible_cpu(cpu) { > > > > + sum += per_cpu(*brw->fast_read_ctr, cpu); > > > > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > > > > + } > > > > + > > > > + return sum; > > > > +} > > > > + > > > > +/* > > > > + * A writer takes ->writer_mutex to exclude other writers and to force the > > > > + * readers to switch to the slow mode, note the mutex_is_locked() check in > > > > + * update_fast_ctr(). > > > > + * > > > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, > > > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > > > > + * counter it represents the number of active readers. > > > > + * > > > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers, > > > > + * then waits until the slow counter becomes zero. > > > > + */ > > > > > > Some overview of how fast/slow_read_ctr are supposed to work would be > > > useful. This comment seems to assume that the reader already knew > > > that. > > > > > > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > > > +{ > > > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > > > + mutex_lock(&brw->writer_mutex); > > > > + > > > > + /* > > > > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > > > > + * so that update_fast_ctr() can't succeed. > > > > + * > > > > + * 2. Ensures we see the result of every previous this_cpu_add() in > > > > + * update_fast_ctr(). > > > > + * > > > > + * 3. Ensures that if any reader has exited its critical section via > > > > + * fast-path, it executes a full memory barrier before we return. > > > > + */ > > > > + synchronize_sched(); > > > > > > Here's where I get horridly confused. Your patch completely deRCUifies > > > this code, yes? Yet here we're using an RCU primitive. And we seem to > > > be using it not as an RCU primitive but as a handy thing which happens > > > to have desirable side-effects. But the implementation of > > > synchronize_sched() differs considerably according to which rcu > > > flavor-of-the-minute you're using. > > > > The trick is that the preempt_disable() call in update_fast_ctr() > > acts as an RCU read-side critical section WRT synchronize_sched(). > > > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and > > synchronize_rcu() in place of preempt_disable()/preempt_enable() and > > synchronize_sched(). The real-time guys would prefer the change > > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that > > you mention it. > > > > Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock() > > and synchronize_rcu()? > > preempt_disable/preempt_enable is faster than > rcu_read_lock/rcu_read_unlock for preemptive kernels. Significantly faster in this case? Can you measure the difference from a user-mode test? Hmmm. I have been avoiding moving the preemptible-RCU state from task_struct to thread_info, but if the difference really matters, perhaps that needs to be done. > Regarding real-time response - the region blocked with > preempt_disable/preempt_enable contains a few instructions (one test for > mutex_is_locked and one increment of percpu variable), so it isn't any > threat to real time response. There are plenty of longer regions in the > kernel that are executed with interrupts or preemption disabled. Careful. The real-time guys might take the same every-little-bit approach to latency that you seem to be taking for CPU cycles. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-09 0:41 ` Paul E. McKenney @ 2012-11-09 3:23 ` Paul E. McKenney 2012-11-09 16:35 ` Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-11-09 3:23 UTC (permalink / raw) To: Mikulas Patocka Cc: Andrew Morton, Oleg Nesterov, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote: > On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: > > > > > > On Thu, 8 Nov 2012, Paul E. McKenney wrote: > > > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > > > > On Thu, 8 Nov 2012 14:48:49 +0100 > > > > Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > > > > to acquire/release the semaphore, and during this time the readers > > > > > are blocked completely. Even if the "write" section was not actually > > > > > started or if it was already finished. > > > > > > > > > > With this patch down_write/up_write does synchronize_sched() twice > > > > > and down_read/up_read are still possible during this time, just they > > > > > use the slow path. > > > > > > > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > > > > increment the "slow" counter to take the lock for reading, then it > > > > > takes that rw_semaphore for writing and blocks the readers. > > > > > > > > > > Also. With this patch the code relies on the documented behaviour of > > > > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > > > > barrier. > > > > > > > > > > ... > > > > > > > > > > include/linux/percpu-rwsem.h | 83 +++++------------------------ > > > > > lib/Makefile | 2 +- > > > > > lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > The patch also uninlines everything. > > > > > > > > And it didn't export the resulting symbols to modules, so it isn't an > > > > equivalent. We can export thing later if needed I guess. > > > > > > > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > > > > avoid including the code altogether, methinks? > > > > > > > > > > > > > > ... > > > > > > > > > > --- /dev/null > > > > > +++ b/lib/percpu-rwsem.c > > > > > @@ -0,0 +1,123 @@ > > > > > > > > That was nice and terse ;) > > > > > > > > > +#include <linux/percpu-rwsem.h> > > > > > +#include <linux/rcupdate.h> > > > > > +#include <linux/sched.h> > > > > > > > > This list is nowhere near sufficient to support this file's > > > > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > > > > more. IOW, if it compiles, it was sheer luck. > > > > > > > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > > > > > +{ > > > > > + brw->fast_read_ctr = alloc_percpu(int); > > > > > + if (unlikely(!brw->fast_read_ctr)) > > > > > + return -ENOMEM; > > > > > + > > > > > + mutex_init(&brw->writer_mutex); > > > > > + init_rwsem(&brw->rw_sem); > > > > > + atomic_set(&brw->slow_read_ctr, 0); > > > > > + init_waitqueue_head(&brw->write_waitq); > > > > > + return 0; > > > > > +} > > > > > + > > > > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > > > > > +{ > > > > > + free_percpu(brw->fast_read_ctr); > > > > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > > > > > +} > > > > > + > > > > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) > > > > > +{ > > > > > + bool success = false; > > > > > + > > > > > + preempt_disable(); > > > > > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { > > > > > + __this_cpu_add(*brw->fast_read_ctr, val); > > > > > + success = true; > > > > > + } > > > > > + preempt_enable(); > > > > > + > > > > > + return success; > > > > > +} > > > > > + > > > > > +/* > > > > > + * Like the normal down_read() this is not recursive, the writer can > > > > > + * come after the first percpu_down_read() and create the deadlock. > > > > > + */ > > > > > +void percpu_down_read(struct percpu_rw_semaphore *brw) > > > > > +{ > > > > > + if (likely(update_fast_ctr(brw, +1))) > > > > > + return; > > > > > + > > > > > + down_read(&brw->rw_sem); > > > > > + atomic_inc(&brw->slow_read_ctr); > > > > > + up_read(&brw->rw_sem); > > > > > +} > > > > > + > > > > > +void percpu_up_read(struct percpu_rw_semaphore *brw) > > > > > +{ > > > > > + if (likely(update_fast_ctr(brw, -1))) > > > > > + return; > > > > > + > > > > > + /* false-positive is possible but harmless */ > > > > > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > > > > > + wake_up_all(&brw->write_waitq); > > > > > +} > > > > > + > > > > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > > > > > +{ > > > > > + unsigned int sum = 0; > > > > > + int cpu; > > > > > + > > > > > + for_each_possible_cpu(cpu) { > > > > > + sum += per_cpu(*brw->fast_read_ctr, cpu); > > > > > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > > > > > + } > > > > > + > > > > > + return sum; > > > > > +} > > > > > + > > > > > +/* > > > > > + * A writer takes ->writer_mutex to exclude other writers and to force the > > > > > + * readers to switch to the slow mode, note the mutex_is_locked() check in > > > > > + * update_fast_ctr(). > > > > > + * > > > > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, > > > > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > > > > > + * counter it represents the number of active readers. > > > > > + * > > > > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers, > > > > > + * then waits until the slow counter becomes zero. > > > > > + */ > > > > > > > > Some overview of how fast/slow_read_ctr are supposed to work would be > > > > useful. This comment seems to assume that the reader already knew > > > > that. > > > > > > > > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > > > > +{ > > > > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > > > > + mutex_lock(&brw->writer_mutex); > > > > > + > > > > > + /* > > > > > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > > > > > + * so that update_fast_ctr() can't succeed. > > > > > + * > > > > > + * 2. Ensures we see the result of every previous this_cpu_add() in > > > > > + * update_fast_ctr(). > > > > > + * > > > > > + * 3. Ensures that if any reader has exited its critical section via > > > > > + * fast-path, it executes a full memory barrier before we return. > > > > > + */ > > > > > + synchronize_sched(); > > > > > > > > Here's where I get horridly confused. Your patch completely deRCUifies > > > > this code, yes? Yet here we're using an RCU primitive. And we seem to > > > > be using it not as an RCU primitive but as a handy thing which happens > > > > to have desirable side-effects. But the implementation of > > > > synchronize_sched() differs considerably according to which rcu > > > > flavor-of-the-minute you're using. > > > > > > The trick is that the preempt_disable() call in update_fast_ctr() > > > acts as an RCU read-side critical section WRT synchronize_sched(). > > > > > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and > > > synchronize_rcu() in place of preempt_disable()/preempt_enable() and > > > synchronize_sched(). The real-time guys would prefer the change > > > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that > > > you mention it. > > > > > > Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock() > > > and synchronize_rcu()? > > > > preempt_disable/preempt_enable is faster than > > rcu_read_lock/rcu_read_unlock for preemptive kernels. > > Significantly faster in this case? Can you measure the difference > from a user-mode test? > > Hmmm. I have been avoiding moving the preemptible-RCU state from > task_struct to thread_info, but if the difference really matters, > perhaps that needs to be done. Actually, the fact that __this_cpu_add() will malfunction on some architectures is preemption is not disabled seems a more compelling reason to keep preempt_enable() than any performance improvement. ;-) Thanx, Paul > > Regarding real-time response - the region blocked with > > preempt_disable/preempt_enable contains a few instructions (one test for > > mutex_is_locked and one increment of percpu variable), so it isn't any > > threat to real time response. There are plenty of longer regions in the > > kernel that are executed with interrupts or preemption disabled. > > Careful. The real-time guys might take the same every-little-bit approach > to latency that you seem to be taking for CPU cycles. ;-) > > Thanx, Paul > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-09 3:23 ` Paul E. McKenney @ 2012-11-09 16:35 ` Oleg Nesterov 2012-11-09 16:59 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-11-09 16:35 UTC (permalink / raw) To: Paul E. McKenney Cc: Mikulas Patocka, Andrew Morton, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/08, Paul E. McKenney wrote: > > On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote: > > On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: > > > > > > On Thu, 8 Nov 2012, Paul E. McKenney wrote: > > > > > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > > > > > On Thu, 8 Nov 2012 14:48:49 +0100 > > > > > Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > > > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and > > > > synchronize_rcu() in place of preempt_disable()/preempt_enable() and > > > > synchronize_sched(). The real-time guys would prefer the change > > > > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that > > > > you mention it. > > > > > > > > Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock() > > > > and synchronize_rcu()? > > > > > > preempt_disable/preempt_enable is faster than > > > rcu_read_lock/rcu_read_unlock for preemptive kernels. Yes, I chose preempt_disable() because it is the fastest/simplest primitive and the critical section is really tiny. But: > > Significantly faster in this case? Can you measure the difference > > from a user-mode test? I do not think rcu_read_lock() or rcu_read_lock_sched() can actually make a measurable difference. > Actually, the fact that __this_cpu_add() will malfunction on some > architectures is preemption is not disabled seems a more compelling > reason to keep preempt_enable() than any performance improvement. ;-) Yes, but this_cpu_add() should work. > > Careful. The real-time guys might take the same every-little-bit approach > > to latency that you seem to be taking for CPU cycles. ;-) Understand... So I simply do not know. Please tell me if you think it would be better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched, and I'll send the patch. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-09 16:35 ` Oleg Nesterov @ 2012-11-09 16:59 ` Paul E. McKenney 0 siblings, 0 replies; 103+ messages in thread From: Paul E. McKenney @ 2012-11-09 16:59 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Andrew Morton, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Fri, Nov 09, 2012 at 05:35:38PM +0100, Oleg Nesterov wrote: > On 11/08, Paul E. McKenney wrote: > > > > On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote: > > > On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: > > > > > > > > On Thu, 8 Nov 2012, Paul E. McKenney wrote: > > > > > > > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > > > > > > On Thu, 8 Nov 2012 14:48:49 +0100 > > > > > > Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > > > > > > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and > > > > > synchronize_rcu() in place of preempt_disable()/preempt_enable() and > > > > > synchronize_sched(). The real-time guys would prefer the change > > > > > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that > > > > > you mention it. > > > > > > > > > > Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock() > > > > > and synchronize_rcu()? > > > > > > > > preempt_disable/preempt_enable is faster than > > > > rcu_read_lock/rcu_read_unlock for preemptive kernels. > > Yes, I chose preempt_disable() because it is the fastest/simplest > primitive and the critical section is really tiny. > > But: > > > > Significantly faster in this case? Can you measure the difference > > > from a user-mode test? > > I do not think rcu_read_lock() or rcu_read_lock_sched() can actually > make a measurable difference. > > > Actually, the fact that __this_cpu_add() will malfunction on some > > architectures is preemption is not disabled seems a more compelling > > reason to keep preempt_enable() than any performance improvement. ;-) > > Yes, but this_cpu_add() should work. Indeed! But this_cpu_add() just does the preempt_enable() under the covers, so not much difference from a latency viewpoint. > > > Careful. The real-time guys might take the same every-little-bit approach > > > to latency that you seem to be taking for CPU cycles. ;-) > > Understand... > > So I simply do not know. Please tell me if you think it would be > better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched, > and I'll send the patch. I doubt if it makes a measurable difference for either throughput or latency. One could argue that rcu_read_lock() would be better for readability, but making sure that the preempt_disable() is clearly commented as starting an RCU-sched read-side critical section would be just as good. So I am OK with the current preempt_disable() approach. Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-08 20:07 ` Andrew Morton 2012-11-08 21:08 ` Paul E. McKenney @ 2012-11-09 12:47 ` Mikulas Patocka 2012-11-09 15:46 ` Oleg Nesterov 2012-11-11 18:27 ` [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix Oleg Nesterov 3 siblings, 0 replies; 103+ messages in thread From: Mikulas Patocka @ 2012-11-09 12:47 UTC (permalink / raw) To: Andrew Morton Cc: Oleg Nesterov, Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, 8 Nov 2012, Andrew Morton wrote: > On Thu, 8 Nov 2012 14:48:49 +0100 > Oleg Nesterov <oleg@redhat.com> wrote: > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > to acquire/release the semaphore, and during this time the readers > > are blocked completely. Even if the "write" section was not actually > > started or if it was already finished. > > > > With this patch down_write/up_write does synchronize_sched() twice > > and down_read/up_read are still possible during this time, just they > > use the slow path. > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > increment the "slow" counter to take the lock for reading, then it > > takes that rw_semaphore for writing and blocks the readers. > > > > Also. With this patch the code relies on the documented behaviour of > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > barrier. > > > > ... > > > > include/linux/percpu-rwsem.h | 83 +++++------------------------ > > lib/Makefile | 2 +- > > lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++ > > The patch also uninlines everything. > > And it didn't export the resulting symbols to modules, so it isn't an > equivalent. We can export thing later if needed I guess. > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > avoid including the code altogether, methinks? If you want to use percpu-rwsem only for block devices then you can remove Oleg's patch at all. Oleg's optimizations are useless for block device use case (the contention between readers and writers is very rare and it doesn't matter if readers are blocked in case of contention). I suppose that Oleg made the optimizations because he wants to use percpu-rwsem for something else - if not, you can drop the patch and revert to the previois version that is simpler. Mikulas ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-08 20:07 ` Andrew Morton 2012-11-08 21:08 ` Paul E. McKenney 2012-11-09 12:47 ` Mikulas Patocka @ 2012-11-09 15:46 ` Oleg Nesterov 2012-11-09 17:01 ` Paul E. McKenney 2012-11-11 18:27 ` [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix Oleg Nesterov 3 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-11-09 15:46 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Paul E. McKenney, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/08, Andrew Morton wrote: > > On Thu, 8 Nov 2012 14:48:49 +0100 > Oleg Nesterov <oleg@redhat.com> wrote: > > > > > include/linux/percpu-rwsem.h | 83 +++++------------------------ > > lib/Makefile | 2 +- > > lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++ > > The patch also uninlines everything. > > And it didn't export the resulting symbols to modules, so it isn't an > equivalent. We can export thing later if needed I guess. Yes, currently it is only used by block_dev.c > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > avoid including the code altogether, methinks? I am going to add another user (uprobes), this was my motivation for this patch. And perhaps it will have more users. But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send the patch. > > +#include <linux/percpu-rwsem.h> > > +#include <linux/rcupdate.h> > > +#include <linux/sched.h> > > This list is nowhere near sufficient to support this file's > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > more. IOW, if it compiles, it was sheer luck. OK, thanks, I'll send send percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix > > +/* > > + * A writer takes ->writer_mutex to exclude other writers and to force the > > + * readers to switch to the slow mode, note the mutex_is_locked() check in > > + * update_fast_ctr(). > > + * > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > > + * counter it represents the number of active readers. > > + * > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers, > > + * then waits until the slow counter becomes zero. > > + */ > > Some overview of how fast/slow_read_ctr are supposed to work would be > useful. This comment seems to assume that the reader already knew > that. I hate to say this, but I'll try to update this comment too ;) > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > +{ > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > + mutex_lock(&brw->writer_mutex); > > + > > + /* > > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > > + * so that update_fast_ctr() can't succeed. > > + * > > + * 2. Ensures we see the result of every previous this_cpu_add() in > > + * update_fast_ctr(). > > + * > > + * 3. Ensures that if any reader has exited its critical section via > > + * fast-path, it executes a full memory barrier before we return. > > + */ > > + synchronize_sched(); > > Here's where I get horridly confused. Your patch completely deRCUifies > this code, yes? Yet here we're using an RCU primitive. And we seem to > be using it not as an RCU primitive but as a handy thing which happens > to have desirable side-effects. But the implementation of > synchronize_sched() differs considerably according to which rcu > flavor-of-the-minute you're using. It is documented that synchronize_sched() should play well with preempt_disable/enable. From the comment: Note that preempt_disable(), local_irq_disable(), and so on may be used in place of rcu_read_lock_sched(). But I guess this needs more discussion, I see other emails in this thread... > And part 3 talks about the reader's critical section. The only > critical sections I can see on the reader side are already covered by > mutex_lock() and preempt_diable(). Yes, but we need to ensure that if we take the lock for writing, we should see all memory modifications done under down_read/up_read(). IOW. Suppose that the reader does percpu_down_read(); STORE; percpu_up_read(); // no barriers in the fast path The writer should see the result of that STORE under percpu_down_write(). Part 3 tries to say that at this point we should already see the result, so we should not worry about acquire/release semantics. > If this code isn't as brain damaged as it > initially appears then please, I hope ;) > go easy on us simpletons in the next > version? Well, I'll try to update the comments... but the code is simple, I do not think I can simplify it more. The nontrivial part is the barriers, but this is always nontrivial. Contrary, I am going to try to add some complications later, so that it can have more users. In particular, I think it can replace get_online_cpus/cpu_hotplug_begin, just we need percpu_down_write_but_dont_deadlock_with_recursive_readers(). Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-09 15:46 ` Oleg Nesterov @ 2012-11-09 17:01 ` Paul E. McKenney 2012-11-09 18:10 ` Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-11-09 17:01 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote: > On 11/08, Andrew Morton wrote: > > > > On Thu, 8 Nov 2012 14:48:49 +0100 > > Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > include/linux/percpu-rwsem.h | 83 +++++------------------------ > > > lib/Makefile | 2 +- > > > lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++ > > > > The patch also uninlines everything. > > > > And it didn't export the resulting symbols to modules, so it isn't an > > equivalent. We can export thing later if needed I guess. > > Yes, currently it is only used by block_dev.c > > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > > avoid including the code altogether, methinks? > > I am going to add another user (uprobes), this was my motivation for > this patch. And perhaps it will have more users. > > But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send > the patch. > > > > +#include <linux/percpu-rwsem.h> > > > +#include <linux/rcupdate.h> > > > +#include <linux/sched.h> > > > > This list is nowhere near sufficient to support this file's > > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > > more. IOW, if it compiles, it was sheer luck. > > OK, thanks, I'll send > send percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix > > > > +/* > > > + * A writer takes ->writer_mutex to exclude other writers and to force the > > > + * readers to switch to the slow mode, note the mutex_is_locked() check in > > > + * update_fast_ctr(). > > > + * > > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, > > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > > > + * counter it represents the number of active readers. > > > + * > > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers, > > > + * then waits until the slow counter becomes zero. > > > + */ > > > > Some overview of how fast/slow_read_ctr are supposed to work would be > > useful. This comment seems to assume that the reader already knew > > that. > > I hate to say this, but I'll try to update this comment too ;) > > > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > > +{ > > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > > + mutex_lock(&brw->writer_mutex); > > > + > > > + /* > > > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > > > + * so that update_fast_ctr() can't succeed. > > > + * > > > + * 2. Ensures we see the result of every previous this_cpu_add() in > > > + * update_fast_ctr(). > > > + * > > > + * 3. Ensures that if any reader has exited its critical section via > > > + * fast-path, it executes a full memory barrier before we return. > > > + */ > > > + synchronize_sched(); > > > > Here's where I get horridly confused. Your patch completely deRCUifies > > this code, yes? Yet here we're using an RCU primitive. And we seem to > > be using it not as an RCU primitive but as a handy thing which happens > > to have desirable side-effects. But the implementation of > > synchronize_sched() differs considerably according to which rcu > > flavor-of-the-minute you're using. > > It is documented that synchronize_sched() should play well with > preempt_disable/enable. From the comment: > > Note that preempt_disable(), > local_irq_disable(), and so on may be used in place of > rcu_read_lock_sched(). > > But I guess this needs more discussion, I see other emails in this > thread... > > > And part 3 talks about the reader's critical section. The only > > critical sections I can see on the reader side are already covered by > > mutex_lock() and preempt_diable(). > > Yes, but we need to ensure that if we take the lock for writing, we > should see all memory modifications done under down_read/up_read(). > > IOW. Suppose that the reader does > > percpu_down_read(); > STORE; > percpu_up_read(); // no barriers in the fast path > > The writer should see the result of that STORE under percpu_down_write(). > > Part 3 tries to say that at this point we should already see the result, > so we should not worry about acquire/release semantics. > > > If this code isn't as brain damaged as it > > initially appears then please, > > I hope ;) > > > go easy on us simpletons in the next > > version? > > Well, I'll try to update the comments... but the code is simple, I do > not think I can simplify it more. The nontrivial part is the barriers, > but this is always nontrivial. > > Contrary, I am going to try to add some complications later, so that > it can have more users. In particular, I think it can replace > get_online_cpus/cpu_hotplug_begin, just we need > percpu_down_write_but_dont_deadlock_with_recursive_readers(). I must confess that I am a bit concerned about possible scalability bottlenecks in the current get_online_cpus(), so +1 from me on this one. Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-09 17:01 ` Paul E. McKenney @ 2012-11-09 18:10 ` Oleg Nesterov 2012-11-09 18:19 ` Oleg Nesterov 2012-11-10 0:55 ` Paul E. McKenney 0 siblings, 2 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-11-09 18:10 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/09, Paul E. McKenney wrote: > > On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote: > > Contrary, I am going to try to add some complications later, so that > > it can have more users. In particular, I think it can replace > > get_online_cpus/cpu_hotplug_begin, just we need > > percpu_down_write_but_dont_deadlock_with_recursive_readers(). > > I must confess that I am a bit concerned about possible scalability > bottlenecks in the current get_online_cpus(), so +1 from me on this one. OK, thanks... And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is trivial, just it needs down_write(rw_sem) "inside" wait_event(), not before. But I'm afraid I will never manage to write the comments ;) static bool xxx(brw) { down_write(&brw->rw_sem); if (!atomic_read(&brw->slow_read_ctr)) return true; up_write(&brw->rw_sem); return false; } static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool recursive_readers) { mutex_lock(&brw->writer_mutex); synchronize_sched(); atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); if (recursive_readers) { wait_event(brw->write_waitq, xxx(brw)); } else { down_write(&brw->rw_sem); wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); } } Of course, cpu.c still needs .active_writer to allow get_online_cpus() under cpu_hotplug_begin(), but this is simple. But first we should do other changes, I think. IMHO we should not do synchronize_sched() under mutex_lock() and this will add (a bit) more complications. We will see. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-09 18:10 ` Oleg Nesterov @ 2012-11-09 18:19 ` Oleg Nesterov 2012-11-10 0:55 ` Paul E. McKenney 1 sibling, 0 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-11-09 18:19 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/09, Oleg Nesterov wrote: > > static bool xxx(brw) > { > down_write(&brw->rw_sem); > if (!atomic_read(&brw->slow_read_ctr)) > return true; I meant, try_to_down_write(). Otherwise this can obviously deadlock. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-09 18:10 ` Oleg Nesterov 2012-11-09 18:19 ` Oleg Nesterov @ 2012-11-10 0:55 ` Paul E. McKenney 2012-11-11 15:45 ` Oleg Nesterov 1 sibling, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-11-10 0:55 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote: > On 11/09, Paul E. McKenney wrote: > > > > On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote: > > > Contrary, I am going to try to add some complications later, so that > > > it can have more users. In particular, I think it can replace > > > get_online_cpus/cpu_hotplug_begin, just we need > > > percpu_down_write_but_dont_deadlock_with_recursive_readers(). > > > > I must confess that I am a bit concerned about possible scalability > > bottlenecks in the current get_online_cpus(), so +1 from me on this one. > > OK, thanks... > > And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is > trivial, just it needs down_write(rw_sem) "inside" wait_event(), not > before. But I'm afraid I will never manage to write the comments ;) > > static bool xxx(brw) > { > down_write(&brw->rw_sem); down_write_trylock() As you noted in your later email. Presumably you return false if the attempt to acquire it fails. > if (!atomic_read(&brw->slow_read_ctr)) > return true; > > up_write(&brw->rw_sem); > return false; > } > > static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool recursive_readers) > { > mutex_lock(&brw->writer_mutex); > > synchronize_sched(); > > atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); > > if (recursive_readers) { > wait_event(brw->write_waitq, xxx(brw)); I see what you mean about acquiring brw->rw_sem inside of wait_event(). Cute trick! The "recursive_readers" is a global initialization-time thing, right? > } else { > down_write(&brw->rw_sem); > > wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); > } > } Looks like it should work, and would perform and scale nicely even if we end up having to greatly increase the number of calls to get_online_cpus(). > Of course, cpu.c still needs .active_writer to allow get_online_cpus() > under cpu_hotplug_begin(), but this is simple. Yep, same check as now. > But first we should do other changes, I think. IMHO we should not do > synchronize_sched() under mutex_lock() and this will add (a bit) more > complications. We will see. Indeed, that does put considerable delay on the writers. There is always synchronize_sched_expedited(), I suppose. Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-10 0:55 ` Paul E. McKenney @ 2012-11-11 15:45 ` Oleg Nesterov 2012-11-12 18:38 ` Paul E. McKenney 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-11-11 15:45 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/09, Paul E. McKenney wrote: > > On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote: > > > > static bool xxx(brw) > > { > > down_write(&brw->rw_sem); > > down_write_trylock() > > As you noted in your later email. Presumably you return false if > the attempt to acquire it fails. Yes, yes, thanks. > > But first we should do other changes, I think. IMHO we should not do > > synchronize_sched() under mutex_lock() and this will add (a bit) more > > complications. We will see. > > Indeed, that does put considerable delay on the writers. There is always > synchronize_sched_expedited(), I suppose. I am not sure about synchronize_sched_expedited() (at least unconditionally), but: only the 1st down_write() needs synchronize_, and up_write() do not need to sleep in synchronize_ at all. To simplify, lets ignore the fact that the writers need to serialize with each other. IOW, the pseudo-code below is obviously deadly wrong and racy, just to illustrate the idea. 1. We remove brw->writer_mutex and add "atomic_t writers_ctr". update_fast_ctr() uses atomic_read(brw->writers_ctr) == 0 instead of !mutex_is_locked(). 2. down_write() does if (atomic_add_return(brw->writers_ctr) == 1) { // first writer synchronize_sched(); ... } else { ... XXX: wait for percpu_up_write() from the first writer ... } 3. up_write() does if (atomic_dec_unless_one(brw->writers_ctr)) { ... wake up XXX writers above ... return; } else { // the last writer call_rcu_sched( func => { atomic_dec(brw->writers_ctr) } ); } Once again, this all is racy, but hopefully the idea is clear: - down_write(brw) sleeps in synchronize_sched() only if brw has already switched back to fast-path-mode - up_write() never sleeps in synchronize_sched(), it uses call_rcu_sched() or wakes up the next writer. Of course I am not sure this all worth the trouble, this should be discussed. (and, cough, I'd like to add the multi-writers mode which I'm afraid nobody will like) But I am not going to even try to do this until the current patch is applied, I need it to fix the bug in uprobes and I think the current code is "good enough". These changes can't help to speedup the readers, and the writers are slow/rare anyway. Thanks! Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-11 15:45 ` Oleg Nesterov @ 2012-11-12 18:38 ` Paul E. McKenney 0 siblings, 0 replies; 103+ messages in thread From: Paul E. McKenney @ 2012-11-12 18:38 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Sun, Nov 11, 2012 at 04:45:09PM +0100, Oleg Nesterov wrote: > On 11/09, Paul E. McKenney wrote: > > > > On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote: > > > > > > static bool xxx(brw) > > > { > > > down_write(&brw->rw_sem); > > > > down_write_trylock() > > > > As you noted in your later email. Presumably you return false if > > the attempt to acquire it fails. > > Yes, yes, thanks. > > > > But first we should do other changes, I think. IMHO we should not do > > > synchronize_sched() under mutex_lock() and this will add (a bit) more > > > complications. We will see. > > > > Indeed, that does put considerable delay on the writers. There is always > > synchronize_sched_expedited(), I suppose. > > I am not sure about synchronize_sched_expedited() (at least unconditionally), > but: only the 1st down_write() needs synchronize_, and up_write() do not > need to sleep in synchronize_ at all. > > To simplify, lets ignore the fact that the writers need to serialize with > each other. IOW, the pseudo-code below is obviously deadly wrong and racy, > just to illustrate the idea. > > 1. We remove brw->writer_mutex and add "atomic_t writers_ctr". > > update_fast_ctr() uses atomic_read(brw->writers_ctr) == 0 instead > of !mutex_is_locked(). > > 2. down_write() does > > if (atomic_add_return(brw->writers_ctr) == 1) { > // first writer > synchronize_sched(); > ... > } else { > ... XXX: wait for percpu_up_write() from the first writer ... > } > > 3. up_write() does > > if (atomic_dec_unless_one(brw->writers_ctr)) { > ... wake up XXX writers above ... > return; > } else { > // the last writer > call_rcu_sched( func => { atomic_dec(brw->writers_ctr) } ); > } Agreed, an asynchronous callback can be used to switch the readers back onto the fastpath. Of course, as you say, getting it all working will require some care. ;-) > Once again, this all is racy, but hopefully the idea is clear: > > - down_write(brw) sleeps in synchronize_sched() only if brw > has already switched back to fast-path-mode > > - up_write() never sleeps in synchronize_sched(), it uses > call_rcu_sched() or wakes up the next writer. > > Of course I am not sure this all worth the trouble, this should be discussed. > (and, cough, I'd like to add the multi-writers mode which I'm afraid nobody > will like) But I am not going to even try to do this until the current patch > is applied, I need it to fix the bug in uprobes and I think the current code > is "good enough". These changes can't help to speedup the readers, and the > writers are slow/rare anyway. Probably best to wait for multi-writers until there is a measurable need, to be sure! ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix 2012-11-08 20:07 ` Andrew Morton ` (2 preceding siblings ...) 2012-11-09 15:46 ` Oleg Nesterov @ 2012-11-11 18:27 ` Oleg Nesterov 2012-11-12 18:31 ` Paul E. McKenney 2012-11-16 23:22 ` Andrew Morton 3 siblings, 2 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-11-11 18:27 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Paul E. McKenney, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel More include's and more comments, no changes in code. To remind, once/if I am sure you agree with this patch I'll send 2 additional and simple patches: 1. lockdep annotations 2. CONFIG_PERCPU_RWSEM It seems that we can do much more improvements to a) speedup the writers and b) make percpu_rw_semaphore more useful, but not right now. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- lib/percpu-rwsem.c | 35 +++++++++++++++++++++++++++++++++-- 1 files changed, 33 insertions(+), 2 deletions(-) diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c index 0e3bc0f..02bd157 100644 --- a/lib/percpu-rwsem.c +++ b/lib/percpu-rwsem.c @@ -1,6 +1,11 @@ +#include <linux/mutex.h> +#include <linux/rwsem.h> +#include <linux/percpu.h> +#include <linux/wait.h> #include <linux/percpu-rwsem.h> #include <linux/rcupdate.h> #include <linux/sched.h> +#include <linux/errno.h> int percpu_init_rwsem(struct percpu_rw_semaphore *brw) { @@ -21,6 +26,29 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw) brw->fast_read_ctr = NULL; /* catch use after free bugs */ } +/* + * This is the fast-path for down_read/up_read, it only needs to ensure + * there is no pending writer (!mutex_is_locked() check) and inc/dec the + * fast per-cpu counter. The writer uses synchronize_sched() to serialize + * with the preempt-disabled section below. + * + * The nontrivial part is that we should guarantee acquire/release semantics + * in case when + * + * R_W: down_write() comes after up_read(), the writer should see all + * changes done by the reader + * or + * W_R: down_read() comes after up_write(), the reader should see all + * changes done by the writer + * + * If this helper fails the callers rely on the normal rw_semaphore and + * atomic_dec_and_test(), so in this case we have the necessary barriers. + * + * But if it succeeds we do not have any barriers, mutex_is_locked() or + * __this_cpu_add() below can be reordered with any LOAD/STORE done by the + * reader inside the critical section. See the comments in down_write and + * up_write below. + */ static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) { bool success = false; @@ -98,6 +126,7 @@ void percpu_down_write(struct percpu_rw_semaphore *brw) * * 3. Ensures that if any reader has exited its critical section via * fast-path, it executes a full memory barrier before we return. + * See R_W case in the comment above update_fast_ctr(). */ synchronize_sched(); @@ -116,8 +145,10 @@ void percpu_up_write(struct percpu_rw_semaphore *brw) /* allow the new readers, but only the slow-path */ up_write(&brw->rw_sem); - /* insert the barrier before the next fast-path in down_read */ + /* + * Insert the barrier before the next fast-path in down_read, + * see W_R case in the comment above update_fast_ctr(). + */ synchronize_sched(); - mutex_unlock(&brw->writer_mutex); } -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix 2012-11-11 18:27 ` [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix Oleg Nesterov @ 2012-11-12 18:31 ` Paul E. McKenney 2012-11-16 23:22 ` Andrew Morton 1 sibling, 0 replies; 103+ messages in thread From: Paul E. McKenney @ 2012-11-12 18:31 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Sun, Nov 11, 2012 at 07:27:44PM +0100, Oleg Nesterov wrote: > More include's and more comments, no changes in code. > > To remind, once/if I am sure you agree with this patch I'll send 2 additional > and simple patches: > > 1. lockdep annotations > > 2. CONFIG_PERCPU_RWSEM > > It seems that we can do much more improvements to a) speedup the writers and > b) make percpu_rw_semaphore more useful, but not right now. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> Looks good to me! Thanx, Paul > --- > lib/percpu-rwsem.c | 35 +++++++++++++++++++++++++++++++++-- > 1 files changed, 33 insertions(+), 2 deletions(-) > > diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c > index 0e3bc0f..02bd157 100644 > --- a/lib/percpu-rwsem.c > +++ b/lib/percpu-rwsem.c > @@ -1,6 +1,11 @@ > +#include <linux/mutex.h> > +#include <linux/rwsem.h> > +#include <linux/percpu.h> > +#include <linux/wait.h> > #include <linux/percpu-rwsem.h> > #include <linux/rcupdate.h> > #include <linux/sched.h> > +#include <linux/errno.h> > > int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > { > @@ -21,6 +26,29 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > brw->fast_read_ctr = NULL; /* catch use after free bugs */ > } > > +/* > + * This is the fast-path for down_read/up_read, it only needs to ensure > + * there is no pending writer (!mutex_is_locked() check) and inc/dec the > + * fast per-cpu counter. The writer uses synchronize_sched() to serialize > + * with the preempt-disabled section below. > + * > + * The nontrivial part is that we should guarantee acquire/release semantics > + * in case when > + * > + * R_W: down_write() comes after up_read(), the writer should see all > + * changes done by the reader > + * or > + * W_R: down_read() comes after up_write(), the reader should see all > + * changes done by the writer > + * > + * If this helper fails the callers rely on the normal rw_semaphore and > + * atomic_dec_and_test(), so in this case we have the necessary barriers. > + * > + * But if it succeeds we do not have any barriers, mutex_is_locked() or > + * __this_cpu_add() below can be reordered with any LOAD/STORE done by the > + * reader inside the critical section. See the comments in down_write and > + * up_write below. > + */ > static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) > { > bool success = false; > @@ -98,6 +126,7 @@ void percpu_down_write(struct percpu_rw_semaphore *brw) > * > * 3. Ensures that if any reader has exited its critical section via > * fast-path, it executes a full memory barrier before we return. > + * See R_W case in the comment above update_fast_ctr(). > */ > synchronize_sched(); > > @@ -116,8 +145,10 @@ void percpu_up_write(struct percpu_rw_semaphore *brw) > /* allow the new readers, but only the slow-path */ > up_write(&brw->rw_sem); > > - /* insert the barrier before the next fast-path in down_read */ > + /* > + * Insert the barrier before the next fast-path in down_read, > + * see W_R case in the comment above update_fast_ctr(). > + */ > synchronize_sched(); > - > mutex_unlock(&brw->writer_mutex); > } > -- > 1.5.5.1 > > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix 2012-11-11 18:27 ` [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix Oleg Nesterov 2012-11-12 18:31 ` Paul E. McKenney @ 2012-11-16 23:22 ` Andrew Morton 2012-11-18 19:32 ` Oleg Nesterov 1 sibling, 1 reply; 103+ messages in thread From: Andrew Morton @ 2012-11-16 23:22 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Paul E. McKenney, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Sun, 11 Nov 2012 19:27:44 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > lib/percpu-rwsem.c | 35 +++++++++++++++++++++++++++++++++-- y'know, this looks like a great pile of useless bloat for single-CPU machines. Maybe add a CONFIG_SMP=n variant which simply calls the regular rwsem operations? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix 2012-11-16 23:22 ` Andrew Morton @ 2012-11-18 19:32 ` Oleg Nesterov 0 siblings, 0 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-11-18 19:32 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Paul E. McKenney, Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/16, Andrew Morton wrote: > > On Sun, 11 Nov 2012 19:27:44 +0100 > Oleg Nesterov <oleg@redhat.com> wrote: > > > lib/percpu-rwsem.c | 35 +++++++++++++++++++++++++++++++++-- > > y'know, this looks like a great pile of useless bloat for single-CPU > machines. Maybe add a CONFIG_SMP=n variant which simply calls the > regular rwsem operations? Yes, I thought about this and probably I'll send the patch... But note that the regular down_read() won't be actually faster if there is no writer, and it doesn't allow to add other features. I'll try to think, perhaps it would be enough to add a couple of "ifdef CONFIG_SMP" into this code, say, to avoid __percpu. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-10-31 19:41 ` [PATCH 1/1] " Oleg Nesterov 2012-11-01 15:10 ` Linus Torvalds @ 2012-11-01 15:43 ` Paul E. McKenney 2012-11-01 18:33 ` Oleg Nesterov 1 sibling, 1 reply; 103+ messages in thread From: Paul E. McKenney @ 2012-11-01 15:43 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 31, 2012 at 08:41:58PM +0100, Oleg Nesterov wrote: > Currently the writer does msleep() plus synchronize_sched() 3 times > to acquire/release the semaphore, and during this time the readers > are blocked completely. Even if the "write" section was not actually > started or if it was already finished. > > With this patch down_read/up_read does synchronize_sched() twice and > down_read/up_read are still possible during this time, just they use > the slow path. > > percpu_down_write() first forces the readers to use rw_semaphore and > increment the "slow" counter to take the lock for reading, then it > takes that rw_semaphore for writing and blocks the readers. > > Also. With this patch the code relies on the documented behaviour of > synchronize_sched(), it doesn't try to pair synchronize_sched() with > barrier. OK, so it looks to me that this code relies on synchronize_sched() forcing a memory barrier on each CPU executing in the kernel. I might well be confused, so here is the sequence of events that leads me to believe this: 1. A task running on CPU 0 currently write-holds the lock. 2. CPU 1 is running in the kernel, executing a longer-than-average loop of normal instructions (no atomic instructions or memory barriers). 3. CPU 0 invokes percpu_up_write(), calling up_write(), synchronize_sched(), and finally mutex_unlock(). 4. CPU 1 executes percpu_down_read(), which calls update_fast_ctr(), which finds that ->writer_mutex is not held. CPU 1 therefore increments >fast_read_ctr and returns success. Of course, as Mikulas pointed out, the actual implementation will have forced CPU 1 to execute a memory barrier in the course of the synchronize_sched() implementation. However, if synchronize_sched() had been modified to act as synchronize_srcu() currently does, there would be no memory barrier, and thus no guarantee that CPU 1's subsequent read-side critical section would seen the effect of CPU 0's previous write-side critical section. Fortunately, this is easy to fix, with zero added overhead on the read-side fastpath, as shown by the notes interspersed below. Thoughts? Thanx, Paul > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > include/linux/percpu-rwsem.h | 83 +++++--------------------------- > lib/Makefile | 2 +- > lib/percpu-rwsem.c | 106 ++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 120 insertions(+), 71 deletions(-) > create mode 100644 lib/percpu-rwsem.c > > diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h > index 250a4ac..7f738ca 100644 > --- a/include/linux/percpu-rwsem.h > +++ b/include/linux/percpu-rwsem.h > @@ -2,82 +2,25 @@ > #define _LINUX_PERCPU_RWSEM_H > > #include <linux/mutex.h> > +#include <linux/rwsem.h> > #include <linux/percpu.h> > -#include <linux/rcupdate.h> > -#include <linux/delay.h> > +#include <linux/wait.h> > > struct percpu_rw_semaphore { > - unsigned __percpu *counters; > - bool locked; > - struct mutex mtx; > + int __percpu *fast_read_ctr; > + struct mutex writer_mutex; > + struct rw_semaphore rw_sem; > + atomic_t slow_read_ctr; > + wait_queue_head_t write_waitq; int wstate; > }; > > -#define light_mb() barrier() > -#define heavy_mb() synchronize_sched() > +extern void percpu_down_read(struct percpu_rw_semaphore *); > +extern void percpu_up_read(struct percpu_rw_semaphore *); > > -static inline void percpu_down_read(struct percpu_rw_semaphore *p) > -{ > - rcu_read_lock_sched(); > - if (unlikely(p->locked)) { > - rcu_read_unlock_sched(); > - mutex_lock(&p->mtx); > - this_cpu_inc(*p->counters); > - mutex_unlock(&p->mtx); > - return; > - } > - this_cpu_inc(*p->counters); > - rcu_read_unlock_sched(); > - light_mb(); /* A, between read of p->locked and read of data, paired with D */ > -} > +extern void percpu_down_write(struct percpu_rw_semaphore *); > +extern void percpu_up_write(struct percpu_rw_semaphore *); > > -static inline void percpu_up_read(struct percpu_rw_semaphore *p) > -{ > - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ > - this_cpu_dec(*p->counters); > -} > - > -static inline unsigned __percpu_count(unsigned __percpu *counters) > -{ > - unsigned total = 0; > - int cpu; > - > - for_each_possible_cpu(cpu) > - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); > - > - return total; > -} > - > -static inline void percpu_down_write(struct percpu_rw_semaphore *p) > -{ > - mutex_lock(&p->mtx); > - p->locked = true; > - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ > - while (__percpu_count(p->counters)) > - msleep(1); > - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ > -} > - > -static inline void percpu_up_write(struct percpu_rw_semaphore *p) > -{ > - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ > - p->locked = false; > - mutex_unlock(&p->mtx); > -} > - > -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) > -{ > - p->counters = alloc_percpu(unsigned); > - if (unlikely(!p->counters)) > - return -ENOMEM; > - p->locked = false; > - mutex_init(&p->mtx); > - return 0; > -} > - > -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) > -{ > - free_percpu(p->counters); > - p->counters = NULL; /* catch use after free bugs */ > -} > +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); > +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); > > #endif > diff --git a/lib/Makefile b/lib/Makefile > index 821a162..4dad4a7 100644 > --- a/lib/Makefile > +++ b/lib/Makefile > @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ > idr.o int_sqrt.o extable.o \ > sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ > proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ > - is_single_threaded.o plist.o decompress.o > + is_single_threaded.o plist.o decompress.o percpu-rwsem.o > > lib-$(CONFIG_MMU) += ioremap.o > lib-$(CONFIG_SMP) += cpumask.o > diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c > new file mode 100644 > index 0000000..40a415d > --- /dev/null > +++ b/lib/percpu-rwsem.c > @@ -0,0 +1,106 @@ > +#include <linux/percpu-rwsem.h> > +#include <linux/rcupdate.h> > +#include <linux/sched.h> #define WSTATE_NEED_LOCK 1 #define WSTATE_NEED_MB 2 > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > +{ > + brw->fast_read_ctr = alloc_percpu(int); > + if (unlikely(!brw->fast_read_ctr)) > + return -ENOMEM; > + > + mutex_init(&brw->writer_mutex); > + init_rwsem(&brw->rw_sem); > + atomic_set(&brw->slow_read_ctr, 0); > + init_waitqueue_head(&brw->write_waitq); > + return 0; > +} > + > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > +{ > + free_percpu(brw->fast_read_ctr); > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > +} > + > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val) > +{ > + bool success = false; int state; > + > + preempt_disable(); > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { state = ACCESS_ONCE(brw->wstate); if (likely(!state)) { > + __this_cpu_add(*brw->fast_read_ctr, val); > + success = true; } else if (state & WSTATE_NEED_MB) { __this_cpu_add(*brw->fast_read_ctr, val); smb_mb(); /* Order increment against critical section. */ success = true; } > + preempt_enable(); > + > + return success; > +} > + > +void percpu_down_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, +1))) > + return; > + > + down_read(&brw->rw_sem); > + atomic_inc(&brw->slow_read_ctr); > + up_read(&brw->rw_sem); > +} > + > +void percpu_up_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, -1))) > + return; > + > + /* false-positive is possible but harmless */ > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > + wake_up_all(&brw->write_waitq); > +} > + > +static int clear_fast_read_ctr(struct percpu_rw_semaphore *brw) > +{ > + int cpu, sum = 0; > + > + for_each_possible_cpu(cpu) { > + sum += per_cpu(*brw->fast_read_ctr, cpu); > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > + } > + > + return sum; > +} > + > +void percpu_down_write(struct percpu_rw_semaphore *brw) > +{ > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > + mutex_lock(&brw->writer_mutex); ACCESS_ONCE(brw->wstate) = WSTATE_NEED_LOCK; > + /* > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > + * so that update_fast_ctr() can't succeed. > + * > + * 2. Ensures we see the result of every previous this_cpu_add() in > + * update_fast_ctr(). > + * > + * 3. Ensures that if any reader has exited its critical section via > + * fast-path, it executes a full memory barrier before we return. > + */ > + synchronize_sched(); > + > + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ > + atomic_add(clear_fast_read_ctr(brw), &brw->slow_read_ctr); > + > + /* block the new readers completely */ > + down_write(&brw->rw_sem); > + > + /* wait for all readers to complete their percpu_up_read() */ > + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); > +} > + > +void percpu_up_write(struct percpu_rw_semaphore *brw) > +{ > + /* allow the new readers, but only the slow-path */ > + up_write(&brw->rw_sem); ACCESS_ONCE(brw->wstate) = WSTATE_NEED_MB; > + > + /* insert the barrier before the next fast-path in down_read */ > + synchronize_sched(); ACCESS_ONCE(brw->wstate) = 0; > + mutex_unlock(&brw->writer_mutex); > +} OK, o ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-01 15:43 ` [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Paul E. McKenney @ 2012-11-01 18:33 ` Oleg Nesterov 2012-11-02 16:18 ` Oleg Nesterov 0 siblings, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-11-01 18:33 UTC (permalink / raw) To: Paul E. McKenney Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel Paul, thanks. Sorry, I can't reply today, just one note... On 11/01, Paul E. McKenney wrote: > > OK, so it looks to me that this code relies on synchronize_sched() > forcing a memory barrier on each CPU executing in the kernel. No, the patch tries to avoid this assumption, but probably I missed something. > 1. A task running on CPU 0 currently write-holds the lock. > > 2. CPU 1 is running in the kernel, executing a longer-than-average > loop of normal instructions (no atomic instructions or memory > barriers). > > 3. CPU 0 invokes percpu_up_write(), calling up_write(), > synchronize_sched(), and finally mutex_unlock(). And my expectation was, this should be enough because ... > 4. CPU 1 executes percpu_down_read(), which calls update_fast_ctr(), since update_fast_ctr does preempt_disable/enable it should see all modifications done by CPU 0. IOW. Suppose that the writer (CPU 0) does percpu_done_write(); STORE; percpu_up_write(); This means STORE; synchronize_sched(); mutex_unlock(); Now. Do you mean that the next preempt_disable/enable can see the result of mutex_unlock() but not STORE? Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-01 18:33 ` Oleg Nesterov @ 2012-11-02 16:18 ` Oleg Nesterov 0 siblings, 0 replies; 103+ messages in thread From: Oleg Nesterov @ 2012-11-02 16:18 UTC (permalink / raw) To: Paul E. McKenney Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/01, Oleg Nesterov wrote: > > On 11/01, Paul E. McKenney wrote: > > > > OK, so it looks to me that this code relies on synchronize_sched() > > forcing a memory barrier on each CPU executing in the kernel. > > No, the patch tries to avoid this assumption, but probably I missed > something. > > > 1. A task running on CPU 0 currently write-holds the lock. > > > > 2. CPU 1 is running in the kernel, executing a longer-than-average > > loop of normal instructions (no atomic instructions or memory > > barriers). > > > > 3. CPU 0 invokes percpu_up_write(), calling up_write(), > > synchronize_sched(), and finally mutex_unlock(). > > And my expectation was, this should be enough because ... > > > 4. CPU 1 executes percpu_down_read(), which calls update_fast_ctr(), > > since update_fast_ctr does preempt_disable/enable it should see all > modifications done by CPU 0. > > IOW. Suppose that the writer (CPU 0) does > > percpu_done_write(); > STORE; > percpu_up_write(); > > This means > > STORE; > synchronize_sched(); > mutex_unlock(); > > Now. Do you mean that the next preempt_disable/enable can see the > result of mutex_unlock() but not STORE? So far I think this is not possible, so the code doesn't need the additional wstate/barriers. > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val) > > +{ > > + bool success = false; > > int state; > > > + > > + preempt_disable(); > > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { > > state = ACCESS_ONCE(brw->wstate); > if (likely(!state)) { > > > + __this_cpu_add(*brw->fast_read_ctr, val); > > + success = true; > > } else if (state & WSTATE_NEED_MB) { > __this_cpu_add(*brw->fast_read_ctr, val); > smb_mb(); /* Order increment against critical section. */ > success = true; > } ... > > +void percpu_up_write(struct percpu_rw_semaphore *brw) > > +{ > > + /* allow the new readers, but only the slow-path */ > > + up_write(&brw->rw_sem); > > ACCESS_ONCE(brw->wstate) = WSTATE_NEED_MB; > > > + > > + /* insert the barrier before the next fast-path in down_read */ > > + synchronize_sched(); But update_fast_ctr() should see mutex_is_locked(), obiously down_write() must ensure this. So update_fast_ctr() can execute the WSTATE_NEED_MB code only if it races with > ACCESS_ONCE(brw->wstate) = 0; > > > + mutex_unlock(&brw->writer_mutex); these 2 stores and sees them in reverse order. I guess that mutex_is_locked() in update_fast_ctr() looks a bit confusing. It means no-fast-path for the reader, we could use ->state instead. And even ->writer_mutex should go away if we want to optimize the write-contended case, but I think this needs another patch on top of this initial implementation. Oleg. ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race 2012-10-15 19:09 [RFC PATCH 0/2] uprobes: register/unregister can race with fork Oleg Nesterov 2012-10-15 19:10 ` [PATCH 1/2] brw_mutex: big read-write mutex Oleg Nesterov @ 2012-10-15 19:10 ` Oleg Nesterov 2012-10-18 7:03 ` Srikar Dronamraju 1 sibling, 1 reply; 103+ messages in thread From: Oleg Nesterov @ 2012-10-15 19:10 UTC (permalink / raw) To: Ingo Molnar, Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Srikar Dronamraju Cc: Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel This was always racy, but 268720903f87e0b84b161626c4447b81671b5d18 "uprobes: Rework register_for_each_vma() to make it O(n)" should be blamed anyway, it made everything worse and I didn't notice. register/unregister call build_map_info() and then do install/remove breakpoint for every mm which mmaps inode/offset. This can obviously race with fork()->dup_mmap() in between and we can miss the child. uprobe_register() could be easily fixed but unregister is much worse, the new mm inherits "int3" from parent and there is no way to detect this if uprobe goes away. So this patch simply adds brw_start/end_read() around dup_mmap(), and brw_start/end_write() into register_for_each_vma(). This adds 2 new hooks into dup_mmap() but we can kill uprobe_dup_mmap() and fold it into uprobe_end_dup_mmap(). Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- include/linux/uprobes.h | 8 ++++++++ kernel/events/uprobes.c | 26 +++++++++++++++++++++++--- kernel/fork.c | 2 ++ 3 files changed, 33 insertions(+), 3 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 2459457..80913e3 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -97,6 +97,8 @@ extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_con extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end); +extern void uprobe_start_dup_mmap(void); +extern void uprobe_end_dup_mmap(void); extern void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm); extern void uprobe_free_utask(struct task_struct *t); extern void uprobe_copy_process(struct task_struct *t); @@ -129,6 +131,12 @@ static inline void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end) { } +static inline void uprobe_start_dup_mmap(void) +{ +} +static inline void uprobe_end_dup_mmap(void) +{ +} static inline void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm) { diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 6a5b5a4..7aeb096 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -33,6 +33,7 @@ #include <linux/ptrace.h> /* user_enable_single_step */ #include <linux/kdebug.h> /* notifier mechanism */ #include "../../mm/internal.h" /* munlock_vma_page */ +#include <linux/brw_mutex.h> #include <linux/uprobes.h> @@ -71,6 +72,8 @@ static struct mutex uprobes_mutex[UPROBES_HASH_SZ]; static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ]; #define uprobes_mmap_hash(v) (&uprobes_mmap_mutex[((unsigned long)(v)) % UPROBES_HASH_SZ]) +static struct brw_mutex dup_mmap_mutex; + /* * uprobe_events allows us to skip the uprobe_mmap if there are no uprobe * events active at this time. Probably a fine grained per inode count is @@ -766,10 +769,13 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register) struct map_info *info; int err = 0; + brw_start_write(&dup_mmap_mutex); info = build_map_info(uprobe->inode->i_mapping, uprobe->offset, is_register); - if (IS_ERR(info)) - return PTR_ERR(info); + if (IS_ERR(info)) { + err = PTR_ERR(info); + goto out; + } while (info) { struct mm_struct *mm = info->mm; @@ -799,7 +805,8 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register) mmput(mm); info = free_map_info(info); } - + out: + brw_end_write(&dup_mmap_mutex); return err; } @@ -1131,6 +1138,16 @@ void uprobe_clear_state(struct mm_struct *mm) kfree(area); } +void uprobe_start_dup_mmap(void) +{ + brw_start_read(&dup_mmap_mutex); +} + +void uprobe_end_dup_mmap(void) +{ + brw_end_read(&dup_mmap_mutex); +} + void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm) { newmm->uprobes_state.xol_area = NULL; @@ -1605,6 +1622,9 @@ static int __init init_uprobes(void) mutex_init(&uprobes_mmap_mutex[i]); } + if (brw_mutex_init(&dup_mmap_mutex)) + return -ENOMEM; + return register_die_notifier(&uprobe_exception_nb); } module_init(init_uprobes); diff --git a/kernel/fork.c b/kernel/fork.c index 8b20ab7..c497e57 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -352,6 +352,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) unsigned long charge; struct mempolicy *pol; + uprobe_start_dup_mmap(); down_write(&oldmm->mmap_sem); flush_cache_dup_mm(oldmm); uprobe_dup_mmap(oldmm, mm); @@ -469,6 +470,7 @@ out: up_write(&mm->mmap_sem); flush_tlb_mm(oldmm); up_write(&oldmm->mmap_sem); + uprobe_end_dup_mmap(); return retval; fail_nomem_anon_vma_fork: mpol_put(pol); -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race 2012-10-15 19:10 ` [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race Oleg Nesterov @ 2012-10-18 7:03 ` Srikar Dronamraju 0 siblings, 0 replies; 103+ messages in thread From: Srikar Dronamraju @ 2012-10-18 7:03 UTC (permalink / raw) To: Oleg Nesterov Cc: Ingo Molnar, Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel > This was always racy, but 268720903f87e0b84b161626c4447b81671b5d18 > "uprobes: Rework register_for_each_vma() to make it O(n)" should be > blamed anyway, it made everything worse and I didn't notice. > > register/unregister call build_map_info() and then do install/remove > breakpoint for every mm which mmaps inode/offset. This can obviously > race with fork()->dup_mmap() in between and we can miss the child. > > uprobe_register() could be easily fixed but unregister is much worse, > the new mm inherits "int3" from parent and there is no way to detect > this if uprobe goes away. > > So this patch simply adds brw_start/end_read() around dup_mmap(), and > brw_start/end_write() into register_for_each_vma(). > > This adds 2 new hooks into dup_mmap() but we can kill uprobe_dup_mmap() > and fold it into uprobe_end_dup_mmap(). > > Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> ^ permalink raw reply [flat|nested] 103+ messages in thread
end of thread, other threads:[~2012-11-18 19:32 UTC | newest] Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-10-15 19:09 [RFC PATCH 0/2] uprobes: register/unregister can race with fork Oleg Nesterov 2012-10-15 19:10 ` [PATCH 1/2] brw_mutex: big read-write mutex Oleg Nesterov 2012-10-15 23:28 ` Paul E. McKenney 2012-10-16 15:56 ` Oleg Nesterov 2012-10-16 18:58 ` Paul E. McKenney 2012-10-17 16:37 ` Oleg Nesterov 2012-10-17 22:28 ` Paul E. McKenney 2012-10-16 19:56 ` Linus Torvalds 2012-10-17 16:59 ` Oleg Nesterov 2012-10-17 22:44 ` Paul E. McKenney 2012-10-18 16:24 ` Oleg Nesterov 2012-10-18 16:38 ` Paul E. McKenney 2012-10-18 17:57 ` Oleg Nesterov 2012-10-18 19:28 ` Mikulas Patocka 2012-10-19 12:38 ` Peter Zijlstra 2012-10-19 15:32 ` Mikulas Patocka 2012-10-19 17:40 ` Peter Zijlstra 2012-10-19 17:57 ` Oleg Nesterov 2012-10-19 22:54 ` Mikulas Patocka 2012-10-24 3:08 ` Dave Chinner 2012-10-25 14:09 ` Mikulas Patocka 2012-10-25 23:40 ` Dave Chinner 2012-10-26 12:06 ` Oleg Nesterov 2012-10-26 13:22 ` Mikulas Patocka 2012-10-26 14:12 ` Oleg Nesterov 2012-10-26 15:23 ` mark_files_ro && sb_end_write Oleg Nesterov 2012-10-26 16:09 ` [PATCH 1/2] brw_mutex: big read-write mutex Mikulas Patocka 2012-10-19 17:49 ` Oleg Nesterov 2012-10-22 23:09 ` Mikulas Patocka 2012-10-23 15:12 ` Oleg Nesterov 2012-10-19 19:28 ` Paul E. McKenney 2012-10-22 23:36 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka 2012-10-22 23:37 ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka 2012-10-22 23:39 ` [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched Mikulas Patocka 2012-10-24 16:16 ` Paul E. McKenney 2012-10-24 17:18 ` Oleg Nesterov 2012-10-24 18:20 ` Paul E. McKenney 2012-10-24 18:43 ` Oleg Nesterov 2012-10-24 19:43 ` Paul E. McKenney 2012-10-25 14:54 ` Mikulas Patocka 2012-10-25 15:07 ` Paul E. McKenney 2012-10-25 16:15 ` Mikulas Patocka 2012-10-23 16:59 ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Oleg Nesterov 2012-10-23 18:05 ` Paul E. McKenney 2012-10-23 18:27 ` Oleg Nesterov 2012-10-23 18:41 ` Oleg Nesterov 2012-10-23 20:29 ` Paul E. McKenney 2012-10-23 20:32 ` Paul E. McKenney 2012-10-23 21:39 ` Mikulas Patocka 2012-10-24 16:23 ` Paul E. McKenney 2012-10-24 20:22 ` Mikulas Patocka 2012-10-24 20:36 ` Paul E. McKenney 2012-10-24 20:44 ` Mikulas Patocka 2012-10-24 23:57 ` Paul E. McKenney 2012-10-25 12:39 ` Paul E. McKenney 2012-10-25 13:48 ` Mikulas Patocka 2012-10-23 19:23 ` Oleg Nesterov 2012-10-23 20:45 ` Peter Zijlstra 2012-10-23 20:57 ` Peter Zijlstra 2012-10-24 15:11 ` Oleg Nesterov 2012-10-23 21:26 ` Mikulas Patocka 2012-10-23 20:32 ` Peter Zijlstra 2012-10-30 18:48 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Oleg Nesterov 2012-10-31 19:41 ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov 2012-10-31 19:41 ` [PATCH 1/1] " Oleg Nesterov 2012-11-01 15:10 ` Linus Torvalds 2012-11-01 15:34 ` Oleg Nesterov 2012-11-02 18:06 ` [PATCH v2 0/1] " Oleg Nesterov 2012-11-02 18:06 ` [PATCH v2 1/1] " Oleg Nesterov 2012-11-07 17:04 ` [PATCH v3 " Mikulas Patocka 2012-11-07 17:47 ` Oleg Nesterov 2012-11-07 19:17 ` Mikulas Patocka 2012-11-08 13:42 ` Oleg Nesterov 2012-11-08 1:23 ` Paul E. McKenney 2012-11-08 1:16 ` [PATCH v2 " Paul E. McKenney 2012-11-08 13:33 ` Oleg Nesterov 2012-11-08 16:27 ` Paul E. McKenney 2012-11-08 13:48 ` [PATCH RESEND v2 0/1] " Oleg Nesterov 2012-11-08 13:48 ` [PATCH RESEND v2 1/1] " Oleg Nesterov 2012-11-08 20:07 ` Andrew Morton 2012-11-08 21:08 ` Paul E. McKenney 2012-11-08 23:41 ` Mikulas Patocka 2012-11-09 0:41 ` Paul E. McKenney 2012-11-09 3:23 ` Paul E. McKenney 2012-11-09 16:35 ` Oleg Nesterov 2012-11-09 16:59 ` Paul E. McKenney 2012-11-09 12:47 ` Mikulas Patocka 2012-11-09 15:46 ` Oleg Nesterov 2012-11-09 17:01 ` Paul E. McKenney 2012-11-09 18:10 ` Oleg Nesterov 2012-11-09 18:19 ` Oleg Nesterov 2012-11-10 0:55 ` Paul E. McKenney 2012-11-11 15:45 ` Oleg Nesterov 2012-11-12 18:38 ` Paul E. McKenney 2012-11-11 18:27 ` [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix Oleg Nesterov 2012-11-12 18:31 ` Paul E. McKenney 2012-11-16 23:22 ` Andrew Morton 2012-11-18 19:32 ` Oleg Nesterov 2012-11-01 15:43 ` [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Paul E. McKenney 2012-11-01 18:33 ` Oleg Nesterov 2012-11-02 16:18 ` Oleg Nesterov 2012-10-15 19:10 ` [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race Oleg Nesterov 2012-10-18 7:03 ` Srikar Dronamraju
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).