All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] uprobes: register/unregister can race with fork
@ 2012-10-15 19:09 Oleg Nesterov
  2012-10-15 19:10 ` [PATCH 1/2] brw_mutex: big read-write mutex Oleg Nesterov
  2012-10-15 19:10 ` [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race Oleg Nesterov
  0 siblings, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-15 19:09 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, Paul E. McKenney, Peter Zijlstra,
	Srikar Dronamraju
  Cc: Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

Hello.

Well. The very fact this series adds the new locking primitive
probably means we should try to find another fix. And yes, it is
possible to fix this differently, afaics. But this will need
more complications, I think.

So please review. As for 1/2:

	- I really hope paulmck/peterz will tell me if it is
	  correct or not

	- The naming sucks, and I agree with any suggestions

	- Probably this code should be compiled only if
	  CONFIG_UPRPOBES

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-15 19:09 [RFC PATCH 0/2] uprobes: register/unregister can race with fork Oleg Nesterov
@ 2012-10-15 19:10 ` Oleg Nesterov
  2012-10-15 23:28   ` Paul E. McKenney
  2012-10-16 19:56   ` Linus Torvalds
  2012-10-15 19:10 ` [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race Oleg Nesterov
  1 sibling, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-15 19:10 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, Paul E. McKenney, Peter Zijlstra,
	Srikar Dronamraju
  Cc: Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
it allows multiple writers too, just "read" and "write" are mutually
exclusive.

brw_start_read() and brw_end_read() are extremely cheap, they only do
this_cpu_inc(read_ctr) + atomic_read() if there are no waiting writers.

OTOH it is write-biased, any brw_start_write() blocks the new readers.
But "write" is slow, it does synchronize_sched() to serialize with
preempt_disable() in brw_start_read(), and wait_event(write_waitq) can
have a lot of extra wakeups before percpu-counter-sum becomes zero.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/brw_mutex.h |   22 +++++++++++++++
 lib/Makefile              |    2 +-
 lib/brw_mutex.c           |   67 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 90 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/brw_mutex.h
 create mode 100644 lib/brw_mutex.c

diff --git a/include/linux/brw_mutex.h b/include/linux/brw_mutex.h
new file mode 100644
index 0000000..16b8d5f
--- /dev/null
+++ b/include/linux/brw_mutex.h
@@ -0,0 +1,22 @@
+#ifndef _LINUX_BRW_MUTEX_H
+#define _LINUX_BRW_MUTEX_H
+
+#include <linux/percpu.h>
+#include <linux/wait.h>
+
+struct brw_mutex {
+	long __percpu		*read_ctr;
+	atomic_t		write_ctr;
+	wait_queue_head_t	read_waitq;
+	wait_queue_head_t	write_waitq;
+};
+
+extern int brw_mutex_init(struct brw_mutex *brw);
+
+extern void brw_start_read(struct brw_mutex *brw);
+extern void brw_end_read(struct brw_mutex *brw);
+
+extern void brw_start_write(struct brw_mutex *brw);
+extern void brw_end_write(struct brw_mutex *brw);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 3128e35..18f2876 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 idr.o int_sqrt.o extable.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o
+	 is_single_threaded.o plist.o decompress.o brw_mutex.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/brw_mutex.c b/lib/brw_mutex.c
new file mode 100644
index 0000000..41984a6
--- /dev/null
+++ b/lib/brw_mutex.c
@@ -0,0 +1,67 @@
+#include <linux/brw_mutex.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+int brw_mutex_init(struct brw_mutex *brw)
+{
+	atomic_set(&brw->write_ctr, 0);
+	init_waitqueue_head(&brw->read_waitq);
+	init_waitqueue_head(&brw->write_waitq);
+	brw->read_ctr = alloc_percpu(long);
+	return brw->read_ctr ? 0 : -ENOMEM;
+}
+
+void brw_start_read(struct brw_mutex *brw)
+{
+	for (;;) {
+		bool done = false;
+
+		preempt_disable();
+		if (likely(!atomic_read(&brw->write_ctr))) {
+			__this_cpu_inc(*brw->read_ctr);
+			done = true;
+		}
+		preempt_enable();
+
+		if (likely(done))
+			break;
+
+		__wait_event(brw->read_waitq, !atomic_read(&brw->write_ctr));
+	}
+}
+
+void brw_end_read(struct brw_mutex *brw)
+{
+	this_cpu_dec(*brw->read_ctr);
+
+	if (unlikely(atomic_read(&brw->write_ctr)))
+		wake_up_all(&brw->write_waitq);
+}
+
+static inline long brw_read_ctr(struct brw_mutex *brw)
+{
+	long sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		sum += per_cpu(*brw->read_ctr, cpu);
+
+	return sum;
+}
+
+void brw_start_write(struct brw_mutex *brw)
+{
+	atomic_inc(&brw->write_ctr);
+	synchronize_sched();
+	/*
+	 * Thereafter brw_*_read() must see write_ctr != 0,
+	 * and we should see the result of __this_cpu_inc().
+	 */
+	wait_event(brw->write_waitq, brw_read_ctr(brw) == 0);
+}
+
+void brw_end_write(struct brw_mutex *brw)
+{
+	if (atomic_dec_and_test(&brw->write_ctr))
+		wake_up_all(&brw->read_waitq);
+}
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race
  2012-10-15 19:09 [RFC PATCH 0/2] uprobes: register/unregister can race with fork Oleg Nesterov
  2012-10-15 19:10 ` [PATCH 1/2] brw_mutex: big read-write mutex Oleg Nesterov
@ 2012-10-15 19:10 ` Oleg Nesterov
  2012-10-18  7:03   ` Srikar Dronamraju
  1 sibling, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-15 19:10 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, Paul E. McKenney, Peter Zijlstra,
	Srikar Dronamraju
  Cc: Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

This was always racy, but 268720903f87e0b84b161626c4447b81671b5d18
"uprobes: Rework register_for_each_vma() to make it O(n)" should be
blamed anyway, it made everything worse and I didn't notice.

register/unregister call build_map_info() and then do install/remove
breakpoint for every mm which mmaps inode/offset. This can obviously
race with fork()->dup_mmap() in between and we can miss the child.

uprobe_register() could be easily fixed but unregister is much worse,
the new mm inherits "int3" from parent and there is no way to detect
this if uprobe goes away.

So this patch simply adds brw_start/end_read() around dup_mmap(), and
brw_start/end_write() into register_for_each_vma().

This adds 2 new hooks into dup_mmap() but we can kill uprobe_dup_mmap()
and fold it into uprobe_end_dup_mmap().

Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/uprobes.h |    8 ++++++++
 kernel/events/uprobes.c |   26 +++++++++++++++++++++++---
 kernel/fork.c           |    2 ++
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 2459457..80913e3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -97,6 +97,8 @@ extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_con
 extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+extern void uprobe_start_dup_mmap(void);
+extern void uprobe_end_dup_mmap(void);
 extern void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm);
 extern void uprobe_free_utask(struct task_struct *t);
 extern void uprobe_copy_process(struct task_struct *t);
@@ -129,6 +131,12 @@ static inline void
 uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 {
 }
+static inline void uprobe_start_dup_mmap(void)
+{
+}
+static inline void uprobe_end_dup_mmap(void)
+{
+}
 static inline void
 uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm)
 {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6a5b5a4..7aeb096 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -33,6 +33,7 @@
 #include <linux/ptrace.h>	/* user_enable_single_step */
 #include <linux/kdebug.h>	/* notifier mechanism */
 #include "../../mm/internal.h"	/* munlock_vma_page */
+#include <linux/brw_mutex.h>
 
 #include <linux/uprobes.h>
 
@@ -71,6 +72,8 @@ static struct mutex uprobes_mutex[UPROBES_HASH_SZ];
 static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
 #define uprobes_mmap_hash(v)	(&uprobes_mmap_mutex[((unsigned long)(v)) % UPROBES_HASH_SZ])
 
+static struct brw_mutex dup_mmap_mutex;
+
 /*
  * uprobe_events allows us to skip the uprobe_mmap if there are no uprobe
  * events active at this time.  Probably a fine grained per inode count is
@@ -766,10 +769,13 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register)
 	struct map_info *info;
 	int err = 0;
 
+	brw_start_write(&dup_mmap_mutex);
 	info = build_map_info(uprobe->inode->i_mapping,
 					uprobe->offset, is_register);
-	if (IS_ERR(info))
-		return PTR_ERR(info);
+	if (IS_ERR(info)) {
+		err = PTR_ERR(info);
+		goto out;
+	}
 
 	while (info) {
 		struct mm_struct *mm = info->mm;
@@ -799,7 +805,8 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register)
 		mmput(mm);
 		info = free_map_info(info);
 	}
-
+ out:
+	brw_end_write(&dup_mmap_mutex);
 	return err;
 }
 
@@ -1131,6 +1138,16 @@ void uprobe_clear_state(struct mm_struct *mm)
 	kfree(area);
 }
 
+void uprobe_start_dup_mmap(void)
+{
+	brw_start_read(&dup_mmap_mutex);
+}
+
+void uprobe_end_dup_mmap(void)
+{
+	brw_end_read(&dup_mmap_mutex);
+}
+
 void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm)
 {
 	newmm->uprobes_state.xol_area = NULL;
@@ -1605,6 +1622,9 @@ static int __init init_uprobes(void)
 		mutex_init(&uprobes_mmap_mutex[i]);
 	}
 
+	if (brw_mutex_init(&dup_mmap_mutex))
+		return -ENOMEM;
+
 	return register_die_notifier(&uprobe_exception_nb);
 }
 module_init(init_uprobes);
diff --git a/kernel/fork.c b/kernel/fork.c
index 8b20ab7..c497e57 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -352,6 +352,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	unsigned long charge;
 	struct mempolicy *pol;
 
+	uprobe_start_dup_mmap();
 	down_write(&oldmm->mmap_sem);
 	flush_cache_dup_mm(oldmm);
 	uprobe_dup_mmap(oldmm, mm);
@@ -469,6 +470,7 @@ out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);
 	up_write(&oldmm->mmap_sem);
+	uprobe_end_dup_mmap();
 	return retval;
 fail_nomem_anon_vma_fork:
 	mpol_put(pol);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-15 19:10 ` [PATCH 1/2] brw_mutex: big read-write mutex Oleg Nesterov
@ 2012-10-15 23:28   ` Paul E. McKenney
  2012-10-16 15:56     ` Oleg Nesterov
  2012-10-16 19:56   ` Linus Torvalds
  1 sibling, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-15 23:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On Mon, Oct 15, 2012 at 09:10:18PM +0200, Oleg Nesterov wrote:
> This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> it allows multiple writers too, just "read" and "write" are mutually
> exclusive.
> 
> brw_start_read() and brw_end_read() are extremely cheap, they only do
> this_cpu_inc(read_ctr) + atomic_read() if there are no waiting writers.
> 
> OTOH it is write-biased, any brw_start_write() blocks the new readers.
> But "write" is slow, it does synchronize_sched() to serialize with
> preempt_disable() in brw_start_read(), and wait_event(write_waitq) can
> have a lot of extra wakeups before percpu-counter-sum becomes zero.

A few questions and comments below, as always.

							Thanx, Paul

> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/brw_mutex.h |   22 +++++++++++++++
>  lib/Makefile              |    2 +-
>  lib/brw_mutex.c           |   67 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 90 insertions(+), 1 deletions(-)
>  create mode 100644 include/linux/brw_mutex.h
>  create mode 100644 lib/brw_mutex.c
> 
> diff --git a/include/linux/brw_mutex.h b/include/linux/brw_mutex.h
> new file mode 100644
> index 0000000..16b8d5f
> --- /dev/null
> +++ b/include/linux/brw_mutex.h
> @@ -0,0 +1,22 @@
> +#ifndef _LINUX_BRW_MUTEX_H
> +#define _LINUX_BRW_MUTEX_H
> +
> +#include <linux/percpu.h>
> +#include <linux/wait.h>
> +
> +struct brw_mutex {
> +	long __percpu		*read_ctr;
> +	atomic_t		write_ctr;
> +	wait_queue_head_t	read_waitq;
> +	wait_queue_head_t	write_waitq;
> +};
> +
> +extern int brw_mutex_init(struct brw_mutex *brw);
> +
> +extern void brw_start_read(struct brw_mutex *brw);
> +extern void brw_end_read(struct brw_mutex *brw);
> +
> +extern void brw_start_write(struct brw_mutex *brw);
> +extern void brw_end_write(struct brw_mutex *brw);
> +
> +#endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 3128e35..18f2876 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>  	 idr.o int_sqrt.o extable.o \
>  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> -	 is_single_threaded.o plist.o decompress.o
> +	 is_single_threaded.o plist.o decompress.o brw_mutex.o
> 
>  lib-$(CONFIG_MMU) += ioremap.o
>  lib-$(CONFIG_SMP) += cpumask.o
> diff --git a/lib/brw_mutex.c b/lib/brw_mutex.c
> new file mode 100644
> index 0000000..41984a6
> --- /dev/null
> +++ b/lib/brw_mutex.c
> @@ -0,0 +1,67 @@
> +#include <linux/brw_mutex.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>
> +
> +int brw_mutex_init(struct brw_mutex *brw)
> +{
> +	atomic_set(&brw->write_ctr, 0);
> +	init_waitqueue_head(&brw->read_waitq);
> +	init_waitqueue_head(&brw->write_waitq);
> +	brw->read_ctr = alloc_percpu(long);
> +	return brw->read_ctr ? 0 : -ENOMEM;
> +}
> +
> +void brw_start_read(struct brw_mutex *brw)
> +{
> +	for (;;) {
> +		bool done = false;
> +
> +		preempt_disable();
> +		if (likely(!atomic_read(&brw->write_ctr))) {
> +			__this_cpu_inc(*brw->read_ctr);
> +			done = true;
> +		}

brw_start_read() is not recursive -- attempting to call it recursively
can result in deadlock if a writer has shown up in the meantime.

Which is often OK, but not sure what you intended.

> +		preempt_enable();
> +
> +		if (likely(done))
> +			break;
> +
> +		__wait_event(brw->read_waitq, !atomic_read(&brw->write_ctr));
> +	}
> +}
> +
> +void brw_end_read(struct brw_mutex *brw)
> +{

I believe that you need smp_mb() here.  The wake_up_all()'s memory barriers
do not suffice because some other reader might have awakened the writer
between this_cpu_dec() and wake_up_all().  IIRC, this smp_mb() is also
needed if the timing is such that the writer does not actually block.

> +	this_cpu_dec(*brw->read_ctr);
> +
> +	if (unlikely(atomic_read(&brw->write_ctr)))
> +		wake_up_all(&brw->write_waitq);
> +}

Of course, it would be good to avoid smp_mb on the fast path.  Here is
one way to avoid it:

void brw_end_read(struct brw_mutex *brw)
{
	if (unlikely(atomic_read(&brw->write_ctr))) {
		smp_mb();
		this_cpu_dec(*brw->read_ctr);
		wake_up_all(&brw->write_waitq);
	} else {
		this_cpu_dec(*brw->read_ctr);
	}
}

> +static inline long brw_read_ctr(struct brw_mutex *brw)
> +{
> +	long sum = 0;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		sum += per_cpu(*brw->read_ctr, cpu);
> +
> +	return sum;
> +}
> +
> +void brw_start_write(struct brw_mutex *brw)
> +{
> +	atomic_inc(&brw->write_ctr);
> +	synchronize_sched();
> +	/*
> +	 * Thereafter brw_*_read() must see write_ctr != 0,
> +	 * and we should see the result of __this_cpu_inc().
> +	 */
> +	wait_event(brw->write_waitq, brw_read_ctr(brw) == 0);

This looks like it allows multiple writers to proceed concurrently.
They both increment, do a synchronize_sched(), do the wait_event(),
and then are both awakened by the last reader.

Was that the intent?  (The implementation of brw_end_write() makes
it look like it is in fact the intent.)

> +}
> +
> +void brw_end_write(struct brw_mutex *brw)
> +{
> +	if (atomic_dec_and_test(&brw->write_ctr))
> +		wake_up_all(&brw->read_waitq);
> +}
> -- 
> 1.5.5.1
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-15 23:28   ` Paul E. McKenney
@ 2012-10-16 15:56     ` Oleg Nesterov
  2012-10-16 18:58       ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-16 15:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

Paul, thanks for looking!

On 10/15, Paul E. McKenney wrote:
>
> > +void brw_start_read(struct brw_mutex *brw)
> > +{
> > +	for (;;) {
> > +		bool done = false;
> > +
> > +		preempt_disable();
> > +		if (likely(!atomic_read(&brw->write_ctr))) {
> > +			__this_cpu_inc(*brw->read_ctr);
> > +			done = true;
> > +		}
>
> brw_start_read() is not recursive -- attempting to call it recursively
> can result in deadlock if a writer has shown up in the meantime.

Yes, yes, it is not recursive. Like rw_semaphore.

> Which is often OK, but not sure what you intended.

I forgot to document this in the changelog.

> > +void brw_end_read(struct brw_mutex *brw)
> > +{
>
> I believe that you need smp_mb() here.

I don't understand why...

> The wake_up_all()'s memory barriers
> do not suffice because some other reader might have awakened the writer
> between this_cpu_dec() and wake_up_all().

But __wake_up(q) takes q->lock? And the same lock is taken by
prepare_to_wait(), so how can the writer miss the result of _dec?

> > +	this_cpu_dec(*brw->read_ctr);
> > +
> > +	if (unlikely(atomic_read(&brw->write_ctr)))
> > +		wake_up_all(&brw->write_waitq);
> > +}
>
> Of course, it would be good to avoid smp_mb on the fast path.  Here is
> one way to avoid it:
>
> void brw_end_read(struct brw_mutex *brw)
> {
> 	if (unlikely(atomic_read(&brw->write_ctr))) {
> 		smp_mb();
> 		this_cpu_dec(*brw->read_ctr);
> 		wake_up_all(&brw->write_waitq);

Hmm... still can't understand.

It seems that this mb() is needed to ensure that brw_end_read() can't
miss write_ctr != 0.

But we do not care unless the writer already does wait_event(). And
before it does wait_event() it calls synchronize_sched() after it sets
write_ctr != 0. Doesn't this mean that after that any preempt-disabled
section must see write_ctr != 0 ?

This code actually checks write_ctr after preempt_disable + enable,
but I think this doesn't matter?

Paul, most probably I misunderstood you. Could you spell please?

> > +void brw_start_write(struct brw_mutex *brw)
> > +{
> > +	atomic_inc(&brw->write_ctr);
> > +	synchronize_sched();
> > +	/*
> > +	 * Thereafter brw_*_read() must see write_ctr != 0,
> > +	 * and we should see the result of __this_cpu_inc().
> > +	 */
> > +	wait_event(brw->write_waitq, brw_read_ctr(brw) == 0);
>
> This looks like it allows multiple writers to proceed concurrently.
> They both increment, do a synchronize_sched(), do the wait_event(),
> and then are both awakened by the last reader.

Yes. From the changelog:

	Unlike rw_semaphore it allows multiple writers too,
	just "read" and "write" are mutually exclusive.

> Was that the intent?  (The implementation of brw_end_write() makes
> it look like it is in fact the intent.)

Please look at 2/2.

Multiple uprobe_register() or uprobe_unregister() can run at the
same time to install/remove the system-wide breakpoint, and
brw_start_write() is used to block dup_mmap() to avoid the race.
But they do not block each other.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-16 15:56     ` Oleg Nesterov
@ 2012-10-16 18:58       ` Paul E. McKenney
  2012-10-17 16:37         ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-16 18:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote:
> Paul, thanks for looking!
> 
> On 10/15, Paul E. McKenney wrote:
> >
> > > +void brw_start_read(struct brw_mutex *brw)
> > > +{
> > > +	for (;;) {
> > > +		bool done = false;
> > > +
> > > +		preempt_disable();
> > > +		if (likely(!atomic_read(&brw->write_ctr))) {
> > > +			__this_cpu_inc(*brw->read_ctr);
> > > +			done = true;
> > > +		}
> >
> > brw_start_read() is not recursive -- attempting to call it recursively
> > can result in deadlock if a writer has shown up in the meantime.
> 
> Yes, yes, it is not recursive. Like rw_semaphore.
> 
> > Which is often OK, but not sure what you intended.
> 
> I forgot to document this in the changelog.

Hey, I had to ask.  ;-)

> > > +void brw_end_read(struct brw_mutex *brw)
> > > +{
> >
> > I believe that you need smp_mb() here.
> 
> I don't understand why...
> 
> > The wake_up_all()'s memory barriers
> > do not suffice because some other reader might have awakened the writer
> > between this_cpu_dec() and wake_up_all().
> 
> But __wake_up(q) takes q->lock? And the same lock is taken by
> prepare_to_wait(), so how can the writer miss the result of _dec?

Suppose that the writer arrives and sees that the value of the counter
is zero, and thus never sleeps, and so is also not awakened?  Unless I
am missing something, there are no memory barriers in that case.

Which means that you also need an smp_mb() after the wait_event()
in the writer, now that I think on it.

> > > +	this_cpu_dec(*brw->read_ctr);
> > > +
> > > +	if (unlikely(atomic_read(&brw->write_ctr)))
> > > +		wake_up_all(&brw->write_waitq);
> > > +}
> >
> > Of course, it would be good to avoid smp_mb on the fast path.  Here is
> > one way to avoid it:
> >
> > void brw_end_read(struct brw_mutex *brw)
> > {
> > 	if (unlikely(atomic_read(&brw->write_ctr))) {
> > 		smp_mb();
> > 		this_cpu_dec(*brw->read_ctr);
> > 		wake_up_all(&brw->write_waitq);
> 
> Hmm... still can't understand.
> 
> It seems that this mb() is needed to ensure that brw_end_read() can't
> miss write_ctr != 0.
> 
> But we do not care unless the writer already does wait_event(). And
> before it does wait_event() it calls synchronize_sched() after it sets
> write_ctr != 0. Doesn't this mean that after that any preempt-disabled
> section must see write_ctr != 0 ?
> 
> This code actually checks write_ctr after preempt_disable + enable,
> but I think this doesn't matter?
> 
> Paul, most probably I misunderstood you. Could you spell please?

Let me try outlining the sequence of events that I am worried about...

1.	Task A invokes brw_start_read().  There is no writer, so it
	takes the fastpath.

2.	Task B invokes brw_start_write(), atomically increments
	&brw->write_ctr, and executes synchronize_sched().

3.	Task A invokes brw_end_read() and does this_cpu_dec().

4.	Task B invokes wait_event(), which invokes brw_read_ctr()
	and sees the result as zero.  Therefore, Task B does
	not sleep, does not acquire locks, and does not execute
	any memory barriers.  As a result, ordering is not
	guaranteed between Task A's read-side critical section
	and Task B's upcoming write-side critical section.

So I believe that you need smp_mb() in both brw_end_read() and
brw_start_write().

Sigh...  It is quite possible that you also need an smp_mb() in
brw_start_read(), but let's start with just the scenario above.

So, does the above scenario show a problem, or am I confused?

> > > +void brw_start_write(struct brw_mutex *brw)
> > > +{
> > > +	atomic_inc(&brw->write_ctr);
> > > +	synchronize_sched();
> > > +	/*
> > > +	 * Thereafter brw_*_read() must see write_ctr != 0,
> > > +	 * and we should see the result of __this_cpu_inc().
> > > +	 */
> > > +	wait_event(brw->write_waitq, brw_read_ctr(brw) == 0);
> >
> > This looks like it allows multiple writers to proceed concurrently.
> > They both increment, do a synchronize_sched(), do the wait_event(),
> > and then are both awakened by the last reader.
> 
> Yes. From the changelog:
> 
> 	Unlike rw_semaphore it allows multiple writers too,
> 	just "read" and "write" are mutually exclusive.

OK, color me blind!  ;-)

> > Was that the intent?  (The implementation of brw_end_write() makes
> > it look like it is in fact the intent.)
> 
> Please look at 2/2.
> 
> Multiple uprobe_register() or uprobe_unregister() can run at the
> same time to install/remove the system-wide breakpoint, and
> brw_start_write() is used to block dup_mmap() to avoid the race.
> But they do not block each other.

Ah, makes sense, thank you!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-15 19:10 ` [PATCH 1/2] brw_mutex: big read-write mutex Oleg Nesterov
  2012-10-15 23:28   ` Paul E. McKenney
@ 2012-10-16 19:56   ` Linus Torvalds
  2012-10-17 16:59     ` Oleg Nesterov
  1 sibling, 1 reply; 103+ messages in thread
From: Linus Torvalds @ 2012-10-16 19:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Paul E. McKenney, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> it allows multiple writers too, just "read" and "write" are mutually
> exclusive.

So those semantics just don't sound sane. It's also not what any kind
of normal "rw" lock ever does.

So can you explain why these particular insane semantics are useful,
and what for?

                  Linus

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-16 18:58       ` Paul E. McKenney
@ 2012-10-17 16:37         ` Oleg Nesterov
  2012-10-17 22:28           ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-17 16:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 10/16, Paul E. McKenney wrote:
>
> On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote:
> > >
> > > I believe that you need smp_mb() here.
> >
> > I don't understand why...
> >
> > > The wake_up_all()'s memory barriers
> > > do not suffice because some other reader might have awakened the writer
> > > between this_cpu_dec() and wake_up_all().
> >
> > But __wake_up(q) takes q->lock? And the same lock is taken by
> > prepare_to_wait(), so how can the writer miss the result of _dec?
>
> Suppose that the writer arrives and sees that the value of the counter
> is zero,

after synchronize_sched(). So there are no readers (but perhaps there
are brw_end_read's in flight which already decremented read_ctr)

> and thus never sleeps, and so is also not awakened?

and why do we need wakeup in this case?

> > > void brw_end_read(struct brw_mutex *brw)
> > > {
> > > 	if (unlikely(atomic_read(&brw->write_ctr))) {
> > > 		smp_mb();
> > > 		this_cpu_dec(*brw->read_ctr);
> > > 		wake_up_all(&brw->write_waitq);
> >
> > Hmm... still can't understand.
> >
> > It seems that this mb() is needed to ensure that brw_end_read() can't
> > miss write_ctr != 0.
> >
> > But we do not care unless the writer already does wait_event(). And
> > before it does wait_event() it calls synchronize_sched() after it sets
> > write_ctr != 0. Doesn't this mean that after that any preempt-disabled
> > section must see write_ctr != 0 ?
> >
> > This code actually checks write_ctr after preempt_disable + enable,
> > but I think this doesn't matter?
> >
> > Paul, most probably I misunderstood you. Could you spell please?
>
> Let me try outlining the sequence of events that I am worried about...
>
> 1.	Task A invokes brw_start_read().  There is no writer, so it
> 	takes the fastpath.
>
> 2.	Task B invokes brw_start_write(), atomically increments
> 	&brw->write_ctr, and executes synchronize_sched().
>
> 3.	Task A invokes brw_end_read() and does this_cpu_dec().

OK. And to simplify this discussion, suppose that A invoked
brw_start_read() on CPU_0 and thus incremented read_ctr[0], and
then it migrates to CPU_1 and brw_end_read() uses read_ctr[1].

My understanding was, brw_start_write() must see read_ctr[0] == 1
after synchronize_sched().

> 4.	Task B invokes wait_event(), which invokes brw_read_ctr()
> 	and sees the result as zero.

So my understanding is completely wrong? I thought that after
synchronize_sched() we should see the result of any operation
which were done inside the preempt-disable section.

No?

Hmm. Suppose that we have long A = B = STOP = 0, and

	void func(void)
	{
		preempt_disable();
		if (!STOP) {
			A = 1;
			B = 1;
		}
		preempt_enable();
	}

Now, you are saying that this code

	STOP = 1;

	synchronize_sched();

	BUG_ON(A != B);

is not correct? (yes, yes, this example is not very good).

The comment above synchronize_sched() says:

	return ... after all currently executing
	rcu-sched read-side critical sections have completed.

But if this code is wrong, then what "completed" actually means?
I thought that it also means "all memory operations have completed",
but this is not true?

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-16 19:56   ` Linus Torvalds
@ 2012-10-17 16:59     ` Oleg Nesterov
  2012-10-17 22:44       ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-17 16:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Paul E. McKenney, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 10/16, Linus Torvalds wrote:
>
> On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> > it allows multiple writers too, just "read" and "write" are mutually
> > exclusive.
>
> So those semantics just don't sound sane. It's also not what any kind
> of normal "rw" lock ever does.

Yes, this is not usual.

And initially I made brw_sem which allows only 1 writer, but then
I changed this patch.

> So can you explain why these particular insane semantics are useful,
> and what for?

To allow multiple uprobe_register/unregister at the same time. Mostly
to not add the "regression", currently this is possible.

It is not that I think this is terribly important, but still. And
personally I think that "multiple writers" is not necessarily insane
in general. Suppose you have the complex object/subsystem, the readers
can use a single brw_mutex to access it "lockless", start_read() is
very cheap.

But start_write() is slow. Multiple writes can use the fine-grained
inside the start_write/end_write section and do not block each other.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-17 16:37         ` Oleg Nesterov
@ 2012-10-17 22:28           ` Paul E. McKenney
  0 siblings, 0 replies; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-17 22:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote:
> On 10/16, Paul E. McKenney wrote:
> >
> > On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote:
> > > >
> > > > I believe that you need smp_mb() here.
> > >
> > > I don't understand why...
> > >
> > > > The wake_up_all()'s memory barriers
> > > > do not suffice because some other reader might have awakened the writer
> > > > between this_cpu_dec() and wake_up_all().
> > >
> > > But __wake_up(q) takes q->lock? And the same lock is taken by
> > > prepare_to_wait(), so how can the writer miss the result of _dec?
> >
> > Suppose that the writer arrives and sees that the value of the counter
> > is zero,
> 
> after synchronize_sched(). So there are no readers (but perhaps there
> are brw_end_read's in flight which already decremented read_ctr)

But the preempt_disable() region only covers read acquisition.  So
synchronize_sched() waits only for all the brw_start_read() calls to
reach the preempt_enable() -- it cannot wait for all the resulting
readers to reach the corresponding brw_end_read().

> > and thus never sleeps, and so is also not awakened?
> 
> and why do we need wakeup in this case?

To get the memory barriers required to keep the critical sections
ordered -- to ensure that everyone sees the reader's critical section
as ending before the writer's critical section starts.

> > > > void brw_end_read(struct brw_mutex *brw)
> > > > {
> > > > 	if (unlikely(atomic_read(&brw->write_ctr))) {
> > > > 		smp_mb();
> > > > 		this_cpu_dec(*brw->read_ctr);
> > > > 		wake_up_all(&brw->write_waitq);
> > >
> > > Hmm... still can't understand.
> > >
> > > It seems that this mb() is needed to ensure that brw_end_read() can't
> > > miss write_ctr != 0.
> > >
> > > But we do not care unless the writer already does wait_event(). And
> > > before it does wait_event() it calls synchronize_sched() after it sets
> > > write_ctr != 0. Doesn't this mean that after that any preempt-disabled
> > > section must see write_ctr != 0 ?
> > >
> > > This code actually checks write_ctr after preempt_disable + enable,
> > > but I think this doesn't matter?
> > >
> > > Paul, most probably I misunderstood you. Could you spell please?
> >
> > Let me try outlining the sequence of events that I am worried about...
> >
> > 1.	Task A invokes brw_start_read().  There is no writer, so it
> > 	takes the fastpath.
> >
> > 2.	Task B invokes brw_start_write(), atomically increments
> > 	&brw->write_ctr, and executes synchronize_sched().
> >
> > 3.	Task A invokes brw_end_read() and does this_cpu_dec().
> 
> OK. And to simplify this discussion, suppose that A invoked
> brw_start_read() on CPU_0 and thus incremented read_ctr[0], and
> then it migrates to CPU_1 and brw_end_read() uses read_ctr[1].
> 
> My understanding was, brw_start_write() must see read_ctr[0] == 1
> after synchronize_sched().

Yep.  But it makes absolutely no guarantee about ordering of the
decrement of read_ctr[1].

> > 4.	Task B invokes wait_event(), which invokes brw_read_ctr()
> > 	and sees the result as zero.
> 
> So my understanding is completely wrong? I thought that after
> synchronize_sched() we should see the result of any operation
> which were done inside the preempt-disable section.

We should indeed.  But the decrement of read_ctr[1] is not done within
the preempt_disable() section, and the guarantee therefore does not
apply to it.  This means that there is no guarantee that Task A's
read-side critical section will be ordered before Task B's read-side
critical section.

Now, maybe you don't need that guarantee, but if you don't, I am missing
what exactly these primitives are doing for you.

> No?
> 
> Hmm. Suppose that we have long A = B = STOP = 0, and
> 
> 	void func(void)
> 	{
> 		preempt_disable();
> 		if (!STOP) {
> 			A = 1;
> 			B = 1;
> 		}
> 		preempt_enable();
> 	}
> 
> Now, you are saying that this code
> 
> 	STOP = 1;
> 
> 	synchronize_sched();
> 
> 	BUG_ON(A != B);
> 
> is not correct? (yes, yes, this example is not very good).

Yep.  Assuming no other modifications to A and B, at the point of
the BUG_ON(), we should have A==1 and B==1.

The thing is that the preempt_disable() in your patch only covers
brw_start_read(), but not brw_end_read().  So the decrement (along with
the rest of the read-side critical section) is unordered with respect
to the write-side critical section started by the brw_start_write().

> The comment above synchronize_sched() says:
> 
> 	return ... after all currently executing
> 	rcu-sched read-side critical sections have completed.
> 
> But if this code is wrong, then what "completed" actually means?
> I thought that it also means "all memory operations have completed",
> but this is not true?

>From what I can see, your interpretation of synchronize_sched() is
correct.

The problem is that brw_end_read() isn't within the relevant
rcu-sched read-side critical section.

Or that I am confused....

							Thanx, Paul

> Oleg.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-17 16:59     ` Oleg Nesterov
@ 2012-10-17 22:44       ` Paul E. McKenney
  2012-10-18 16:24         ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-17 22:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On Wed, Oct 17, 2012 at 06:59:02PM +0200, Oleg Nesterov wrote:
> On 10/16, Linus Torvalds wrote:
> >
> > On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> > > it allows multiple writers too, just "read" and "write" are mutually
> > > exclusive.
> >
> > So those semantics just don't sound sane. It's also not what any kind
> > of normal "rw" lock ever does.
> 
> Yes, this is not usual.
> 
> And initially I made brw_sem which allows only 1 writer, but then
> I changed this patch.
> 
> > So can you explain why these particular insane semantics are useful,
> > and what for?
> 
> To allow multiple uprobe_register/unregister at the same time. Mostly
> to not add the "regression", currently this is possible.
> 
> It is not that I think this is terribly important, but still. And
> personally I think that "multiple writers" is not necessarily insane
> in general. Suppose you have the complex object/subsystem, the readers
> can use a single brw_mutex to access it "lockless", start_read() is
> very cheap.
> 
> But start_write() is slow. Multiple writes can use the fine-grained
> inside the start_write/end_write section and do not block each other.

Strangely enough, the old VAXCluster locking primitives allowed this
sort of thing.  The brw_start_read() would be a "protected read", and
brw_start_write() would be a "concurrent write".  Even more interesting,
they gave the same advice you give -- concurrent writes should use
fine-grained locking to protect the actual accesses.

It seems like it should be possible to come up with better names,
but I cannot think of any at the moment.

							Thanx, Paul

PS.  For the sufficiently masochistic, here is the exclusion table
     for the six VAXCluster locking modes:

		NL	CR	CW	PR	PW	EX
	NL
	CR						X
	CW				X	X	X
	PR			X		X	X
	PW			X	X	X	X
	EX		X	X	X	X	X

	"X" means that the pair of modes exclude each other,
	otherwise the lock may be held in both of the modes
	simultaneously.


Modes:

NL:	Null, or "not held".
CR:	Concurrent read.
CW:	Concurrent write.
PR:	Protected read.
PW:	Protected write.
EX:	Exclusive.


A reader-writer lock could use protected read for readers and either of
protected write or exclusive for writers, the difference between protected
write and exclusive being irrelevant in the absence of concurrent readers.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race
  2012-10-15 19:10 ` [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race Oleg Nesterov
@ 2012-10-18  7:03   ` Srikar Dronamraju
  0 siblings, 0 replies; 103+ messages in thread
From: Srikar Dronamraju @ 2012-10-18  7:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Linus Torvalds, Paul E. McKenney, Peter Zijlstra,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

> This was always racy, but 268720903f87e0b84b161626c4447b81671b5d18
> "uprobes: Rework register_for_each_vma() to make it O(n)" should be
> blamed anyway, it made everything worse and I didn't notice.
> 
> register/unregister call build_map_info() and then do install/remove
> breakpoint for every mm which mmaps inode/offset. This can obviously
> race with fork()->dup_mmap() in between and we can miss the child.
> 
> uprobe_register() could be easily fixed but unregister is much worse,
> the new mm inherits "int3" from parent and there is no way to detect
> this if uprobe goes away.
> 
> So this patch simply adds brw_start/end_read() around dup_mmap(), and
> brw_start/end_write() into register_for_each_vma().
> 
> This adds 2 new hooks into dup_mmap() but we can kill uprobe_dup_mmap()
> and fold it into uprobe_end_dup_mmap().
> 
> Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-17 22:44       ` Paul E. McKenney
@ 2012-10-18 16:24         ` Oleg Nesterov
  2012-10-18 16:38           ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-18 16:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 10/17, Paul E. McKenney wrote:
>
> On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote:
> > On 10/16, Paul E. McKenney wrote:
> > >
> > > Suppose that the writer arrives and sees that the value of the counter
> > > is zero,
> >
> > after synchronize_sched(). So there are no readers (but perhaps there
> > are brw_end_read's in flight which already decremented read_ctr)
>
> But the preempt_disable() region only covers read acquisition.  So
> synchronize_sched() waits only for all the brw_start_read() calls to
> reach the preempt_enable()

Yes.

> -- it cannot wait for all the resulting
> readers to reach the corresponding brw_end_read().

Indeed.

> > > and thus never sleeps, and so is also not awakened?
> >
> > and why do we need wakeup in this case?
>
> To get the memory barriers required to keep the critical sections
> ordered -- to ensure that everyone sees the reader's critical section
> as ending before the writer's critical section starts.

And now I am starting to think I misunderstood your concern from
the very beginning.

I thought that you meant that without mb() brw_start_write() can
race with brw_end_read() and hang forever.

But probably you meant that we need the barriers to ensure that,
say, if the reader does

	brw_start_read();
	CONDITION = 1;
	brw_end_read();

then the writer must see CONDITION != 0 after brw_start_write() ?
(or vice-versa)


In this case we need the barrier, yes. Obviously brw_start_write()
can return right after this_cpu_dec() and before wake_up_all().

2/2 doesn't need this guarantee but I agree, this doesn't look
sane in gerenal...

Or I misunderstood you again?

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-18 16:24         ` Oleg Nesterov
@ 2012-10-18 16:38           ` Paul E. McKenney
  2012-10-18 17:57             ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-18 16:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
> On 10/17, Paul E. McKenney wrote:
> >
> > On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote:
> > > On 10/16, Paul E. McKenney wrote:
> > > >
> > > > Suppose that the writer arrives and sees that the value of the counter
> > > > is zero,
> > >
> > > after synchronize_sched(). So there are no readers (but perhaps there
> > > are brw_end_read's in flight which already decremented read_ctr)
> >
> > But the preempt_disable() region only covers read acquisition.  So
> > synchronize_sched() waits only for all the brw_start_read() calls to
> > reach the preempt_enable()
> 
> Yes.
> 
> > -- it cannot wait for all the resulting
> > readers to reach the corresponding brw_end_read().
> 
> Indeed.
> 
> > > > and thus never sleeps, and so is also not awakened?
> > >
> > > and why do we need wakeup in this case?
> >
> > To get the memory barriers required to keep the critical sections
> > ordered -- to ensure that everyone sees the reader's critical section
> > as ending before the writer's critical section starts.
> 
> And now I am starting to think I misunderstood your concern from
> the very beginning.
> 
> I thought that you meant that without mb() brw_start_write() can
> race with brw_end_read() and hang forever.
> 
> But probably you meant that we need the barriers to ensure that,
> say, if the reader does
> 
> 	brw_start_read();
> 	CONDITION = 1;
> 	brw_end_read();
> 
> then the writer must see CONDITION != 0 after brw_start_write() ?
> (or vice-versa)

Yes, this is exactly my concern.

> In this case we need the barrier, yes. Obviously brw_start_write()
> can return right after this_cpu_dec() and before wake_up_all().
> 
> 2/2 doesn't need this guarantee but I agree, this doesn't look
> sane in gerenal...

Or name it something not containing "lock".  And clearly document
the behavior and how it is to be used.  ;-)

Otherwise, someone will get confused and introduce bugs.

> Or I misunderstood you again?

No, this was indeed my concern.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-18 16:38           ` Paul E. McKenney
@ 2012-10-18 17:57             ` Oleg Nesterov
  2012-10-18 19:28               ` Mikulas Patocka
  2012-10-19 19:28               ` Paul E. McKenney
  0 siblings, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-18 17:57 UTC (permalink / raw)
  To: Paul E. McKenney, Mikulas Patocka
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 10/18, Paul E. McKenney wrote:
>
> On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
> >
> > I thought that you meant that without mb() brw_start_write() can
> > race with brw_end_read() and hang forever.
> >
> > But probably you meant that we need the barriers to ensure that,
> > say, if the reader does
> >
> > 	brw_start_read();
> > 	CONDITION = 1;
> > 	brw_end_read();
> >
> > then the writer must see CONDITION != 0 after brw_start_write() ?
> > (or vice-versa)
>
> Yes, this is exactly my concern.

Oh, thanks at lot Paul (as always).

> > In this case we need the barrier, yes. Obviously brw_start_write()
> > can return right after this_cpu_dec() and before wake_up_all().
> >
> > 2/2 doesn't need this guarantee but I agree, this doesn't look
> > sane in gerenal...
>
> Or name it something not containing "lock".  And clearly document
> the behavior and how it is to be used.  ;-)

this would be insane, I guess ;)

So. Ignoring the possible optimization you mentioned before,
brw_end_read() should do:

	smp_mb();
	this_cpu_dec();

	wake_up_all();

And yes, we need the full mb(). wmb() is enough to ensure that the
writer will see the memory modifications done by the reader. But we
also need to ensure that any LOAD inside start_read/end_read can not
be moved outside of the critical section.

But we should also ensure that "read" will see all modifications
which were done under start_write/end_write. This means that
brw_end_write() needs another synchronize_sched() before
atomic_dec_and_test(), or brw_start_read() needs mb() in the
fast-path.

Correct?



Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
something similar. Certainly it was not in my tree when I started
this patch... percpu_down_write() doesn't allow multiple writers,
but the main problem it uses msleep(1). It should not, I think.

But. It seems that percpu_up_write() is equally wrong? Doesn't
it need synchronize_rcu() before "p->locked = false" ?

(add Mikulas)

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-18 17:57             ` Oleg Nesterov
@ 2012-10-18 19:28               ` Mikulas Patocka
  2012-10-19 12:38                 ` Peter Zijlstra
  2012-10-19 19:28               ` Paul E. McKenney
  1 sibling, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-18 19:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel



On Thu, 18 Oct 2012, Oleg Nesterov wrote:

> Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> something similar. Certainly it was not in my tree when I started
> this patch... percpu_down_write() doesn't allow multiple writers,
> but the main problem it uses msleep(1). It should not, I think.

synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not 
a big problem.

> But. It seems that percpu_up_write() is equally wrong? Doesn't
> it need synchronize_rcu() before "p->locked = false" ?

Yes, it does ... and I sent patch for that to Linus.

> (add Mikulas)
> 
> Oleg.

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-18 19:28               ` Mikulas Patocka
@ 2012-10-19 12:38                 ` Peter Zijlstra
  2012-10-19 15:32                   ` Mikulas Patocka
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2012-10-19 12:38 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Oleg Nesterov, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel, Thomas Gleixner

On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote:
> 
> On Thu, 18 Oct 2012, Oleg Nesterov wrote:
> 
> > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> > something similar. Certainly it was not in my tree when I started
> > this patch... percpu_down_write() doesn't allow multiple writers,
> > but the main problem it uses msleep(1). It should not, I think.
> 
> synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not 
> a big problem.

That code is beyond ugly though.. it should really not have been merged.

There's absolutely no reason for it to use RCU except to make it more
complicated. And as Oleg pointed out that msleep() is very ill
considered.

The very worst part of it seems to be that nobody who's usually involved
with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo,
tglx, dhowells and me). It doesn't even have lockdep annotations :/

So the only reason you appear to use RCU is because you don't actually
have a sane way to wait for count==0. And I'm contesting rcu_sync() is
sane here -- for the very simple reason you still need while (count)
loop right after it.

So it appears you want an totally reader biased, sleepable rw-lock like
thing?

So did you consider keeping the inc/dec on the same per-cpu variable?
Yes this adds a potential remote access to dec and requires you to use
atomics, but I would not be surprised if the inc/dec were mostly on the
same cpu most of the times -- which might be plenty fast for what you
want.

If you've got coherent per-cpu counts, you can better do the
waitqueue/wake condition for write_down.

It might also make sense to do away with the mutex, there's no point in
serializing the wakeups in the p->locked case of down_read. Furthermore,
p->locked seems a complete duplicate of the mutex state, so removing the
mutex also removes that duplication.

Also, that CONFIG_x86 thing.. *shudder*...

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-19 12:38                 ` Peter Zijlstra
@ 2012-10-19 15:32                   ` Mikulas Patocka
  2012-10-19 17:40                     ` Peter Zijlstra
  2012-10-19 17:49                     ` Oleg Nesterov
  0 siblings, 2 replies; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-19 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel, Thomas Gleixner



On Fri, 19 Oct 2012, Peter Zijlstra wrote:

> On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote:
> > 
> > On Thu, 18 Oct 2012, Oleg Nesterov wrote:
> > 
> > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> > > something similar. Certainly it was not in my tree when I started
> > > this patch... percpu_down_write() doesn't allow multiple writers,
> > > but the main problem it uses msleep(1). It should not, I think.
> > 
> > synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not 
> > a big problem.
> 
> That code is beyond ugly though.. it should really not have been merged.
> 
> There's absolutely no reason for it to use RCU except to make it more

So if you can do an alternative implementation without RCU, show it. The 
goal is - there should be no LOCK instructions on the read path and as 
few barriers as possible.

> complicated. And as Oleg pointed out that msleep() is very ill
> considered.
> 
> The very worst part of it seems to be that nobody who's usually involved
> with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo,
> tglx, dhowells and me). It doesn't even have lockdep annotations :/
> 
> So the only reason you appear to use RCU is because you don't actually
> have a sane way to wait for count==0. And I'm contesting rcu_sync() is
> sane here -- for the very simple reason you still need while (count)
> loop right after it.
> 
> So it appears you want an totally reader biased, sleepable rw-lock like
> thing?

Yes.

> So did you consider keeping the inc/dec on the same per-cpu variable?
> Yes this adds a potential remote access to dec and requires you to use
> atomics, but I would not be surprised if the inc/dec were mostly on the
> same cpu most of the times -- which might be plenty fast for what you
> want.

Yes, I tried this approach - it involves doing LOCK instruction on read 
lock, remembering the cpu and doing another LOCK instruction on read 
unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
happens in the common case). It was slower than the approach without any 
LOCK instructions (43.3 seconds seconds for the implementation with 
per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
instruction; the benchmark involved doing 512-byte direct-io reads and 
writes on a ramdisk with 8 processes on 8-core machine).

> If you've got coherent per-cpu counts, you can better do the
> waitqueue/wake condition for write_down.

synchronize_rcu() is way slower than msleep(1) - so I don't see a reason 
why should it be complicated to avoid msleep(1).

> It might also make sense to do away with the mutex, there's no point in
> serializing the wakeups in the p->locked case of down_read.

The advantage of a mutex is that it is already protected against 
starvation. If I replace the mutex with a wait queue and retry, there is 
no starvation protection.

> Furthermore,
> p->locked seems a complete duplicate of the mutex state, so removing the
> mutex also removes that duplication.

We could replace if (p->locked) with if (mutex_is_locked(p->mtx))

> Also, that CONFIG_x86 thing.. *shudder*...

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-19 15:32                   ` Mikulas Patocka
@ 2012-10-19 17:40                     ` Peter Zijlstra
  2012-10-19 17:57                       ` Oleg Nesterov
  2012-10-19 22:54                       ` Mikulas Patocka
  2012-10-19 17:49                     ` Oleg Nesterov
  1 sibling, 2 replies; 103+ messages in thread
From: Peter Zijlstra @ 2012-10-19 17:40 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Oleg Nesterov, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel, Thomas Gleixner

On Fri, 2012-10-19 at 11:32 -0400, Mikulas Patocka wrote:

> So if you can do an alternative implementation without RCU, show it.

Uhm,,. no that's not how it works. You just don't push through crap like
this and then demand someone else does it better.

But using preempt_{disable,enable} and using synchronize_sched() would
be better (for PREEMPT_RCU) although it wouldn't fix anything
fundamental.

>  The 
> goal is - there should be no LOCK instructions on the read path and as 
> few barriers as possible.

Fine goal, although somewhat arch specific. Also note that there's a
relation between atomics and memory barriers, one isn't necessarily
worse than the other, they all require synchronization of sorts.  

> > So did you consider keeping the inc/dec on the same per-cpu variable?
> > Yes this adds a potential remote access to dec and requires you to use
> > atomics, but I would not be surprised if the inc/dec were mostly on the
> > same cpu most of the times -- which might be plenty fast for what you
> > want.
> 
> Yes, I tried this approach - it involves doing LOCK instruction on read 
> lock, remembering the cpu and doing another LOCK instruction on read 
> unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
> happens in the common case). It was slower than the approach without any 
> LOCK instructions (43.3 seconds seconds for the implementation with 
> per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
> instruction; the benchmark involved doing 512-byte direct-io reads and 
> writes on a ramdisk with 8 processes on 8-core machine).

So why is that a problem? Surely that's already tons better then what
you've currently got. Also uncontended LOCK is something all x86 vendors
keep optimizing, they'll have to if they want to keep adding CPUs.

> > If you've got coherent per-cpu counts, you can better do the
> > waitqueue/wake condition for write_down.
> 
> synchronize_rcu() is way slower than msleep(1) - so I don't see a reason 
> why should it be complicated to avoid msleep(1).

Its not about slow, a polling write side is just fscking ugly. Also, if
you're already polling that *_sync() is bloody pointless.

> > It might also make sense to do away with the mutex, there's no point in
> > serializing the wakeups in the p->locked case of down_read.
> 
> The advantage of a mutex is that it is already protected against 
> starvation. If I replace the mutex with a wait queue and retry, there is 
> no starvation protection.

Which starvation? writer-writer order? What stops you from adding a list
there yourself? Also, writers had better be rare for this thing, so who
gives a crap?

> > Furthermore,
> > p->locked seems a complete duplicate of the mutex state, so removing the
> > mutex also removes that duplication.
> 
> We could replace if (p->locked) with if (mutex_is_locked(p->mtx))

Quite so.. 

You're also still lacking lockdep annotations...

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-19 15:32                   ` Mikulas Patocka
  2012-10-19 17:40                     ` Peter Zijlstra
@ 2012-10-19 17:49                     ` Oleg Nesterov
  2012-10-22 23:09                       ` Mikulas Patocka
  1 sibling, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-19 17:49 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel, Thomas Gleixner

On 10/19, Mikulas Patocka wrote:
>
> synchronize_rcu() is way slower than msleep(1) -

This depends, I guess. but this doesn't mmatter,

> so I don't see a reason
> why should it be complicated to avoid msleep(1).

I don't think this really needs complications. Please look at this
patch for example. Or initial (single writer) version below. It is
not finished and lacks the barriers too, but I do not think it is
more complex.

Oleg.

struct brw_sem {
	long __percpu		*read_ctr;
	wait_queue_head_t	read_waitq;
	struct mutex		writer_mutex;
	struct task_struct	*writer;
};

int brw_init(struct brw_sem *brw)
{
	brw->writer = NULL;
	mutex_init(&brw->writer_mutex);
	init_waitqueue_head(&brw->read_waitq);
	brw->read_ctr = alloc_percpu(long);
	return brw->read_ctr ? 0 : -ENOMEM;
}

void brw_down_read(struct brw_sem *brw)
{
	for (;;) {
		bool done = false;

		preempt_disable();
		if (likely(!brw->writer)) {
			__this_cpu_inc(*brw->read_ctr);
			done = true;
		}
		preempt_enable();

		if (likely(done))
			break;

		__wait_event(brw->read_waitq, !brw->writer);
	}
}

void brw_up_read(struct brw_sem *brw)
{
	struct task_struct *writer;

	preempt_disable();
	__this_cpu_dec(*brw->read_ctr);
	writer = ACCESS_ONCE(brw->writer);
	if (unlikely(writer))
		wake_up_process(writer);
	preempt_enable();
}

static inline long brw_read_ctr(struct brw_sem *brw)
{
	long sum = 0;
	int cpu;

	for_each_possible_cpu(cpu)
		sum += per_cpu(*brw->read_ctr, cpu);

	return sum;
}

void brw_down_write(struct brw_sem *brw)
{
	mutex_lock(&brw->writer_mutex);
	brw->writer = current;
	synchronize_sched();
	/*
	 * Thereafter brw_*_read() must see ->writer != NULL,
	 * and we should see the result of __this_cpu_inc().
	 */
	for (;;) {
		set_current_state(TASK_UNINTERRUPTIBLE);
		if (brw_read_ctr(brw) == 0)
			break;
		schedule();
	}
	__set_current_state(TASK_RUNNING);
	/*
	 * We can add another synchronize_sched() to avoid the
	 * spurious wakeups from brw_up_read() after return.
	 */
}

void brw_up_write(struct brw_sem *brw)
{
	brw->writer = NULL;
	synchronize_sched();
	wake_up_all(&brw->read_waitq);
	mutex_unlock(&brw->writer_mutex);
}


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-19 17:40                     ` Peter Zijlstra
@ 2012-10-19 17:57                       ` Oleg Nesterov
  2012-10-19 22:54                       ` Mikulas Patocka
  1 sibling, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-19 17:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mikulas Patocka, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel, Thomas Gleixner

On 10/19, Peter Zijlstra wrote:
>
> But using preempt_{disable,enable} and using synchronize_sched() would
> be better (for PREEMPT_RCU) although it wouldn't fix anything
> fundamental.

BTW, I agree. I didn't even notice percpu-rwsem.h uses _rcu, not _sched.

> Fine goal, although somewhat arch specific. Also note that there's a
> relation between atomics and memory barriers, one isn't necessarily
> worse than the other, they all require synchronization of sorts.

As Paul pointed out, the fast path can avoid mb(). It is only needed
when "up_read" detects the writer.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-18 17:57             ` Oleg Nesterov
  2012-10-18 19:28               ` Mikulas Patocka
@ 2012-10-19 19:28               ` Paul E. McKenney
  2012-10-22 23:36                 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka
  1 sibling, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-19 19:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Thu, Oct 18, 2012 at 07:57:47PM +0200, Oleg Nesterov wrote:
> On 10/18, Paul E. McKenney wrote:
> >
> > On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
> > >
> > > I thought that you meant that without mb() brw_start_write() can
> > > race with brw_end_read() and hang forever.
> > >
> > > But probably you meant that we need the barriers to ensure that,
> > > say, if the reader does
> > >
> > > 	brw_start_read();
> > > 	CONDITION = 1;
> > > 	brw_end_read();
> > >
> > > then the writer must see CONDITION != 0 after brw_start_write() ?
> > > (or vice-versa)
> >
> > Yes, this is exactly my concern.
> 
> Oh, thanks at lot Paul (as always).

Glad it helped.  ;-)

> > > In this case we need the barrier, yes. Obviously brw_start_write()
> > > can return right after this_cpu_dec() and before wake_up_all().
> > >
> > > 2/2 doesn't need this guarantee but I agree, this doesn't look
> > > sane in gerenal...
> >
> > Or name it something not containing "lock".  And clearly document
> > the behavior and how it is to be used.  ;-)
> 
> this would be insane, I guess ;)

Well, I suppose you could call it a "phase" : brw_start_phase_1() and
so on.

> So. Ignoring the possible optimization you mentioned before,
> brw_end_read() should do:
> 
> 	smp_mb();
> 	this_cpu_dec();
> 
> 	wake_up_all();
> 
> And yes, we need the full mb(). wmb() is enough to ensure that the
> writer will see the memory modifications done by the reader. But we
> also need to ensure that any LOAD inside start_read/end_read can not
> be moved outside of the critical section.
> 
> But we should also ensure that "read" will see all modifications
> which were done under start_write/end_write. This means that
> brw_end_write() needs another synchronize_sched() before
> atomic_dec_and_test(), or brw_start_read() needs mb() in the
> fast-path.
> 
> Correct?

Good point, I missed the need for synchronize_sched() to avoid
readers sleeping through the next write cycle due to racing with
an exiting writer.  But yes, this sounds correct.

> Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> something similar. Certainly it was not in my tree when I started
> this patch... percpu_down_write() doesn't allow multiple writers,
> but the main problem it uses msleep(1). It should not, I think.
> 
> But. It seems that percpu_up_write() is equally wrong? Doesn't
> it need synchronize_rcu() before "p->locked = false" ?
> 
> (add Mikulas)

Mikulas said something about doing an updated patch, so I figured I
would look at his next version.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-19 17:40                     ` Peter Zijlstra
  2012-10-19 17:57                       ` Oleg Nesterov
@ 2012-10-19 22:54                       ` Mikulas Patocka
  2012-10-24  3:08                         ` Dave Chinner
  1 sibling, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-19 22:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel, Thomas Gleixner



On Fri, 19 Oct 2012, Peter Zijlstra wrote:

> > Yes, I tried this approach - it involves doing LOCK instruction on read 
> > lock, remembering the cpu and doing another LOCK instruction on read 
> > unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
> > happens in the common case). It was slower than the approach without any 
> > LOCK instructions (43.3 seconds seconds for the implementation with 
> > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
> > instruction; the benchmark involved doing 512-byte direct-io reads and 
> > writes on a ramdisk with 8 processes on 8-core machine).
> 
> So why is that a problem? Surely that's already tons better then what
> you've currently got.

Percpu rw-semaphores do not improve performance at all. I put them there 
to avoid performance regression, not to improve performance.

All Linux kernels have a race condition - when you change block size of a 
block device and you read or write the device at the same time, a crash 
may happen. This bug is there since ever. Recently, this bug started to 
cause major trouble - multiple high profile business sites report crashes 
because of this race condition.

You can fix this race by using a read lock around I/O paths and write lock 
around block size changing, but normal rw semaphore cause cache line 
bouncing when taken for read by multiple processors and I/O performance 
degradation because of it is measurable.

So I put this percpu-rw-semaphore there to fix the crashes and minimize 
performance impact - on x86 it doesn't take any interlocked instructions 
in the read path.

I don't quite understand why are people opposing to this and what do they 
want to do instead? If you pull percpu-rw-semaphores out of the kernel, 
you introduce a performance regression (raw device i/o will be slower on 
3.7 than on 3.6, because on 3.6 it doesn't take any lock at all and on 3.7 
it takes a read lock).

So you have options:
1) don't lock i/o just like on 3.6 and previous versions - you get a fast 
	kernel that randomly crashes
2) lock i/o with normal rw semaphore - you get a kernel that doesn't 
	crash, but that is slower than previous versions
3) lock i/o with percpu rw semaphore - you get kernel that is almost as 
	fast as previous kernels and that doesn't crash

For the users, the option 3) is the best. The users don't care whether it 
looks ugly or not, they care about correctness and performance, that's 
all.

Obviously, you can improve rw semaphores by adding lockdep annotations, or 
by other things (turning rcu_read_lock/sychronize_rcu into 
preempt_disable/synchronize_sched, by using barrier()-synchronize_sched() 
instead of smp_mb()...), but I don't see a reason why do you want to hurt 
users' experience by pulling it out and reverting to state 1) or 2) and 
then, two kernel cycles later, come up with percpu-rw-semaphores again.

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-19 17:49                     ` Oleg Nesterov
@ 2012-10-22 23:09                       ` Mikulas Patocka
  2012-10-23 15:12                         ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-22 23:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel, Thomas Gleixner



On Fri, 19 Oct 2012, Oleg Nesterov wrote:

> On 10/19, Mikulas Patocka wrote:
> >
> > synchronize_rcu() is way slower than msleep(1) -
> 
> This depends, I guess. but this doesn't mmatter,
> 
> > so I don't see a reason
> > why should it be complicated to avoid msleep(1).
> 
> I don't think this really needs complications. Please look at this
> patch for example. Or initial (single writer) version below. It is
> not finished and lacks the barriers too, but I do not think it is
> more complex.

Hi

My implementation has a smaller structure (it doesn't have 
wait_queue_head_t).

Using preempt_disable()/synchronize_sched() instead of RCU seems like a 
good idea. Here, the locked region is so small that it doesn't make sense 
to play tricks with preemptible RCU.

Your implementation is prone to starvation - if the writer has a high 
priority and if it is doing back-to-back write unlocks/locks, it may 
happen that the readers have no chance to run.

The use of mutex instead of a wait queue in my implementation is unusual, 
but I don't see anything wrong with it - it makes the structure smaller 
and it solves the starvation problem (which would otherwise be complicated 
to solve).

Mikulas

> Oleg.
> 
> struct brw_sem {
> 	long __percpu		*read_ctr;
> 	wait_queue_head_t	read_waitq;
> 	struct mutex		writer_mutex;
> 	struct task_struct	*writer;
> };
> 
> int brw_init(struct brw_sem *brw)
> {
> 	brw->writer = NULL;
> 	mutex_init(&brw->writer_mutex);
> 	init_waitqueue_head(&brw->read_waitq);
> 	brw->read_ctr = alloc_percpu(long);
> 	return brw->read_ctr ? 0 : -ENOMEM;
> }
> 
> void brw_down_read(struct brw_sem *brw)
> {
> 	for (;;) {
> 		bool done = false;
> 
> 		preempt_disable();
> 		if (likely(!brw->writer)) {
> 			__this_cpu_inc(*brw->read_ctr);
> 			done = true;
> 		}
> 		preempt_enable();
> 
> 		if (likely(done))
> 			break;
> 
> 		__wait_event(brw->read_waitq, !brw->writer);
> 	}
> }
> 
> void brw_up_read(struct brw_sem *brw)
> {
> 	struct task_struct *writer;
> 
> 	preempt_disable();
> 	__this_cpu_dec(*brw->read_ctr);
> 	writer = ACCESS_ONCE(brw->writer);
> 	if (unlikely(writer))
> 		wake_up_process(writer);
> 	preempt_enable();
> }
> 
> static inline long brw_read_ctr(struct brw_sem *brw)
> {
> 	long sum = 0;
> 	int cpu;
> 
> 	for_each_possible_cpu(cpu)
> 		sum += per_cpu(*brw->read_ctr, cpu);

Integer overflow on signed types is undefined - you should use unsigned 
long - you can use -fwrapv option to gcc to make signed overflow defined, 
but Linux doesn't use it.

> 
> 	return sum;
> }
> 
> void brw_down_write(struct brw_sem *brw)
> {
> 	mutex_lock(&brw->writer_mutex);
> 	brw->writer = current;
> 	synchronize_sched();
> 	/*
> 	 * Thereafter brw_*_read() must see ->writer != NULL,
> 	 * and we should see the result of __this_cpu_inc().
> 	 */
> 	for (;;) {
> 		set_current_state(TASK_UNINTERRUPTIBLE);
> 		if (brw_read_ctr(brw) == 0)
> 			break;
> 		schedule();
> 	}
> 	__set_current_state(TASK_RUNNING);
> 	/*
> 	 * We can add another synchronize_sched() to avoid the
> 	 * spurious wakeups from brw_up_read() after return.
> 	 */
> }
> 
> void brw_up_write(struct brw_sem *brw)
> {
> 	brw->writer = NULL;
> 	synchronize_sched();

That synchronize_sched should be put before brw->writer = NULL. This is 
incorrect, because brw->writer = NULL may be reordered with previous 
writes done by this process and the other CPU may see brw->writer == NULL 
(and think that the lock is unlocked) while it doesn't see previous writes 
done by the writer.

I had this bug in my implementation too.

> 	wake_up_all(&brw->read_waitq);
> 	mutex_unlock(&brw->writer_mutex);
> }

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex)
  2012-10-19 19:28               ` Paul E. McKenney
@ 2012-10-22 23:36                 ` Mikulas Patocka
  2012-10-22 23:37                   ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka
  2012-10-30 18:48                   ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Oleg Nesterov
  0 siblings, 2 replies; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-22 23:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel,
	Paul E. McKenney

> > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> > something similar. Certainly it was not in my tree when I started
> > this patch... percpu_down_write() doesn't allow multiple writers,
> > but the main problem it uses msleep(1). It should not, I think.
> > 
> > But. It seems that percpu_up_write() is equally wrong? Doesn't
> > it need synchronize_rcu() before "p->locked = false" ?
> > 
> > (add Mikulas)
> 
> Mikulas said something about doing an updated patch, so I figured I
> would look at his next version.
> 
> 							Thanx, Paul

The best ideas proposed in this thread are:

Using heavy/light barries by Lai Jiangshan. This fixes the missing barrier 
bug, removes the ugly test "#if defined(X86) ..." and makes the read path 
use no barrier instruction on all architectures.

Instead of rcu_read_lock, we can use rcu_read_lock_sched (or 
preempt_disable) - the resulting code is smaller. The critical section is 
so small that there is no problem disabling preemption.

I am sending these two patches. Linus, please apply them if there are no 
objections.

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-22 23:36                 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka
@ 2012-10-22 23:37                   ` Mikulas Patocka
  2012-10-22 23:39                     ` [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched Mikulas Patocka
                                       ` (2 more replies)
  2012-10-30 18:48                   ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Oleg Nesterov
  1 sibling, 3 replies; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-22 23:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel,
	Paul E. McKenney

This patch introduces new barrier pair light_mb() and heavy_mb() for
percpu rw semaphores.

This patch fixes a bug in percpu-rw-semaphores where a barrier was
missing in percpu_up_write.

This patch improves performance on the read path of
percpu-rw-semaphores: on non-x86 cpus, there was a smp_mb() in
percpu_up_read. This patch changes it to a compiler barrier and removes
the "#if defined(X86) ..." condition.

From: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/percpu-rwsem.h |   20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

Index: linux-3.6.3-fast/include/linux/percpu-rwsem.h
===================================================================
--- linux-3.6.3-fast.orig/include/linux/percpu-rwsem.h	2012-10-22 23:37:57.000000000 +0200
+++ linux-3.6.3-fast/include/linux/percpu-rwsem.h	2012-10-23 01:21:23.000000000 +0200
@@ -12,6 +12,9 @@ struct percpu_rw_semaphore {
 	struct mutex mtx;
 };
 
+#define light_mb()	barrier()
+#define heavy_mb()	synchronize_sched()
+
 static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 {
 	rcu_read_lock();
@@ -24,22 +27,12 @@ static inline void percpu_down_read(stru
 	}
 	this_cpu_inc(*p->counters);
 	rcu_read_unlock();
+	light_mb(); /* A, between read of p->locked and read of data, paired with D */
 }
 
 static inline void percpu_up_read(struct percpu_rw_semaphore *p)
 {
-	/*
-	 * On X86, write operation in this_cpu_dec serves as a memory unlock
-	 * barrier (i.e. memory accesses may be moved before the write, but
-	 * no memory accesses are moved past the write).
-	 * On other architectures this may not be the case, so we need smp_mb()
-	 * there.
-	 */
-#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
-	barrier();
-#else
-	smp_mb();
-#endif
+	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
 	this_cpu_dec(*p->counters);
 }
 
@@ -61,11 +54,12 @@ static inline void percpu_down_write(str
 	synchronize_rcu();
 	while (__percpu_count(p->counters))
 		msleep(1);
-	smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */
+	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
 }
 
 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
+	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
 	p->locked = false;
 	mutex_unlock(&p->mtx);
 }


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
  2012-10-22 23:37                   ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka
@ 2012-10-22 23:39                     ` Mikulas Patocka
  2012-10-24 16:16                       ` Paul E. McKenney
  2012-10-23 16:59                     ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Oleg Nesterov
  2012-10-23 20:32                     ` Peter Zijlstra
  2 siblings, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-22 23:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel,
	Paul E. McKenney

Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.

This is an optimization. The RCU-protected region is very small, so
there will be no latency problems if we disable preempt in this region.

So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
to preempt_disable / preempt_disable. It is smaller (and supposedly
faster) than preemptible rcu_read_lock / rcu_read_unlock.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/percpu-rwsem.h |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-3.6.3-fast/include/linux/percpu-rwsem.h
===================================================================
--- linux-3.6.3-fast.orig/include/linux/percpu-rwsem.h	2012-10-23 01:21:49.000000000 +0200
+++ linux-3.6.3-fast/include/linux/percpu-rwsem.h	2012-10-23 01:36:23.000000000 +0200
@@ -17,16 +17,16 @@ struct percpu_rw_semaphore {
 
 static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 {
-	rcu_read_lock();
+	rcu_read_lock_sched();
 	if (unlikely(p->locked)) {
-		rcu_read_unlock();
+		rcu_read_unlock_sched();
 		mutex_lock(&p->mtx);
 		this_cpu_inc(*p->counters);
 		mutex_unlock(&p->mtx);
 		return;
 	}
 	this_cpu_inc(*p->counters);
-	rcu_read_unlock();
+	rcu_read_unlock_sched();
 	light_mb(); /* A, between read of p->locked and read of data, paired with D */
 }
 
@@ -51,7 +51,7 @@ static inline void percpu_down_write(str
 {
 	mutex_lock(&p->mtx);
 	p->locked = true;
-	synchronize_rcu();
+	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
 	while (__percpu_count(p->counters))
 		msleep(1);
 	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-22 23:09                       ` Mikulas Patocka
@ 2012-10-23 15:12                         ` Oleg Nesterov
  0 siblings, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-23 15:12 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel, Thomas Gleixner

Hi Mikulas,

On 10/22, Mikulas Patocka wrote:
>
> On Fri, 19 Oct 2012, Oleg Nesterov wrote:
>
> > On 10/19, Mikulas Patocka wrote:
> > >
> > > synchronize_rcu() is way slower than msleep(1) -
> >
> > This depends, I guess. but this doesn't mmatter,
> >
> > > so I don't see a reason
> > > why should it be complicated to avoid msleep(1).
> >
> > I don't think this really needs complications. Please look at this
> > patch for example. Or initial (single writer) version below. It is
> > not finished and lacks the barriers too, but I do not think it is
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

please note the comment above ;)

> > more complex.
>
> Hi
>
> My implementation has a smaller structure (it doesn't have
> wait_queue_head_t).

Oh, I don't think sizeof() really matters in this case.

> Your implementation is prone to starvation - if the writer has a high
> priority and if it is doing back-to-back write unlocks/locks, it may
> happen that the readers have no chance to run.

Yes, it is write-biased, this was intent. writers should be rare.

> The use of mutex instead of a wait queue in my implementation is unusual,
> but I don't see anything wrong with it

Neither me.

Mikulas, apart from _rcu/_sched change, my only point was msleep() can
(and imho should) be avoided.

> > static inline long brw_read_ctr(struct brw_sem *brw)
> > {
> > 	long sum = 0;
> > 	int cpu;
> >
> > 	for_each_possible_cpu(cpu)
> > 		sum += per_cpu(*brw->read_ctr, cpu);
>
> Integer overflow on signed types is undefined - you should use unsigned
> long - you can use -fwrapv option to gcc to make signed overflow defined,
> but Linux doesn't use it.

I don't think -fwrapv can make any difference in this case, but I agree
that "unsigned long" makes more sense.

> > void brw_up_write(struct brw_sem *brw)
> > {
> > 	brw->writer = NULL;
> > 	synchronize_sched();
>
> That synchronize_sched should be put before brw->writer = NULL.

Yes, I know. I mentioned this at the start, this lacks the necessary
barrier between this writer and the next reader.

> I had this bug in my implementation too.

Yes, exactly. And this is why I cc'ed you initially ;)

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-22 23:37                   ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka
  2012-10-22 23:39                     ` [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched Mikulas Patocka
@ 2012-10-23 16:59                     ` Oleg Nesterov
  2012-10-23 18:05                       ` Paul E. McKenney
  2012-10-23 19:23                       ` Oleg Nesterov
  2012-10-23 20:32                     ` Peter Zijlstra
  2 siblings, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-23 16:59 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel,
	Paul E. McKenney

Not really the comment, but the question...

On 10/22, Mikulas Patocka wrote:
>
>  static inline void percpu_down_read(struct percpu_rw_semaphore *p)
>  {
>  	rcu_read_lock();
> @@ -24,22 +27,12 @@ static inline void percpu_down_read(stru
>  	}
>  	this_cpu_inc(*p->counters);
>  	rcu_read_unlock();
> +	light_mb(); /* A, between read of p->locked and read of data, paired with D */
>  }

rcu_read_unlock() (or even preempt_enable) should have compiler barrier
semantics... But I agree, this adds more documentation for free.

>  static inline void percpu_up_read(struct percpu_rw_semaphore *p)
>  {
> -	/*
> -	 * On X86, write operation in this_cpu_dec serves as a memory unlock
> -	 * barrier (i.e. memory accesses may be moved before the write, but
> -	 * no memory accesses are moved past the write).
> -	 * On other architectures this may not be the case, so we need smp_mb()
> -	 * there.
> -	 */
> -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
> -	barrier();
> -#else
> -	smp_mb();
> -#endif
> +	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
>  	this_cpu_dec(*p->counters);
>  }
>
> @@ -61,11 +54,12 @@ static inline void percpu_down_write(str
>  	synchronize_rcu();
>  	while (__percpu_count(p->counters))
>  		msleep(1);
> -	smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */
> +	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */

I _think_ this is correct.


Just I am wondering if this is strongly correct in theory, I would
really like to know what Paul thinks.

Ignoring the current implementation, according to the documentation
synchronize_sched() has all rights to return immediately if there is
no active rcu_read_lock_sched() section. If this were possible, than
percpu_up_read() lacks mb.

So _perhaps_ it makes sense to document that synchronize_sched() also
guarantees that all pending loads/stores on other CPUs should be
completed upon return? Or I misunderstood the patch?

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 16:59                     ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Oleg Nesterov
@ 2012-10-23 18:05                       ` Paul E. McKenney
  2012-10-23 18:27                         ` Oleg Nesterov
  2012-10-23 18:41                         ` Oleg Nesterov
  2012-10-23 19:23                       ` Oleg Nesterov
  1 sibling, 2 replies; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-23 18:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Tue, Oct 23, 2012 at 06:59:12PM +0200, Oleg Nesterov wrote:
> Not really the comment, but the question...
> 
> On 10/22, Mikulas Patocka wrote:
> >
> >  static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> >  {
> >  	rcu_read_lock();
> > @@ -24,22 +27,12 @@ static inline void percpu_down_read(stru
> >  	}
> >  	this_cpu_inc(*p->counters);
> >  	rcu_read_unlock();
> > +	light_mb(); /* A, between read of p->locked and read of data, paired with D */
> >  }
> 
> rcu_read_unlock() (or even preempt_enable) should have compiler barrier
> semantics... But I agree, this adds more documentation for free.

Although rcu_read_lock() does have compiler-barrier semantics if
CONFIG_PREEMPT=y, it does not for CONFIG_PREEMPT=n.  So the
light_mb() (which appears to be barrier()) is needed in that case.

> >  static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> >  {
> > -	/*
> > -	 * On X86, write operation in this_cpu_dec serves as a memory unlock
> > -	 * barrier (i.e. memory accesses may be moved before the write, but
> > -	 * no memory accesses are moved past the write).
> > -	 * On other architectures this may not be the case, so we need smp_mb()
> > -	 * there.
> > -	 */
> > -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
> > -	barrier();
> > -#else
> > -	smp_mb();
> > -#endif
> > +	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
> >  	this_cpu_dec(*p->counters);
> >  }
> >
> > @@ -61,11 +54,12 @@ static inline void percpu_down_write(str
> >  	synchronize_rcu();
> >  	while (__percpu_count(p->counters))
> >  		msleep(1);
> > -	smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */
> > +	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
> 
> I _think_ this is correct.
> 
> 
> Just I am wondering if this is strongly correct in theory, I would
> really like to know what Paul thinks.

I need to take a closer look.

> Ignoring the current implementation, according to the documentation
> synchronize_sched() has all rights to return immediately if there is
> no active rcu_read_lock_sched() section. If this were possible, than
> percpu_up_read() lacks mb.

Even if there happen to be no RCU-sched read-side critical sections
at the current instant, synchronize_sched() is required to make sure
that everyone agrees that whatever code is executed by the caller after
synchronize_sched() returns happens after any of the preceding RCU
read-side critical sections.

So, if we have this, with x==0 initially:

	Task 0					Task 1

						rcu_read_lock_sched();
						x = 1;
						rcu_read_unlock_sched();
	synchronize_sched();
	r1 = x;

Then the value of r1 had better be one.

Of course, the above code fragment is doing absolutely nothing to ensure
that the synchronize_sched() really does start after Task 1's very strange
RCU read-side critical section, but if things did happen in that order,
synchronize_sched() would be required to make this guarantee.

> So _perhaps_ it makes sense to document that synchronize_sched() also
> guarantees that all pending loads/stores on other CPUs should be
> completed upon return? Or I misunderstood the patch?

Good point.  The current documentation implies that it does make that
guarantee, but it would be good for it to be explicit.  Queued for 3.8
is the following addition:

 * Note that this guarantee implies a further memory-ordering guarantee.
 * On systems with more than one CPU, when synchronize_sched() returns,
 * each CPU is guaranteed to have executed a full memory barrier since
 * the end of its last RCU read-side critical section whose beginning
 * preceded the call to synchronize_sched().  Note that this guarantee
 * includes CPUs that are offline, idle, or executing in user mode, as
 * well as CPUs that are executing in the kernel.  Furthermore, if CPU A
 * invoked synchronize_sched(), which returned to its caller on CPU B,
 * then both CPU A and CPU B are guaranteed to have executed a full memory
 * barrier during the execution of synchronize_sched().

The full comment block now reads:

/**
 * synchronize_sched - wait until an rcu-sched grace period has elapsed.
 *
 * Control will return to the caller some time after a full rcu-sched
 * grace period has elapsed, in other words after all currently executing
 * rcu-sched read-side critical sections have completed.   These read-side
 * critical sections are delimited by rcu_read_lock_sched() and
 * rcu_read_unlock_sched(), and may be nested.  Note that preempt_disable(),
 * local_irq_disable(), and so on may be used in place of
 * rcu_read_lock_sched().
 *
 * This means that all preempt_disable code sequences, including NMI and
 * hardware-interrupt handlers, in progress on entry will have completed
 * before this primitive returns.  However, this does not guarantee that
 * softirq handlers will have completed, since in some kernels, these
 * handlers can run in process context, and can block.
 *
 * Note that this guarantee implies a further memory-ordering guarantee.
 * On systems with more than one CPU, when synchronize_sched() returns,
 * each CPU is guaranteed to have executed a full memory barrier since
 * the end of its last RCU read-side critical section whose beginning
 * preceded the call to synchronize_sched().  Note that this guarantee
 * includes CPUs that are offline, idle, or executing in user mode, as
 * well as CPUs that are executing in the kernel.  Furthermore, if CPU A
 * invoked synchronize_sched(), which returned to its caller on CPU B,
 * then both CPU A and CPU B are guaranteed to have executed a full memory
 * barrier during the execution of synchronize_sched().
 *
 * This primitive provides the guarantees made by the (now removed)
 * synchronize_kernel() API.  In contrast, synchronize_rcu() only
 * guarantees that rcu_read_lock() sections will have completed.
 * In "classic RCU", these two guarantees happen to be one and
 * the same, but can differ in realtime RCU implementations.
 */

If this wording looks good to you, I will apply it to the other
grace-period primitives as well.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 18:05                       ` Paul E. McKenney
@ 2012-10-23 18:27                         ` Oleg Nesterov
  2012-10-23 18:41                         ` Oleg Nesterov
  1 sibling, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-23 18:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On 10/23, Paul E. McKenney wrote:
>
> On Tue, Oct 23, 2012 at 06:59:12PM +0200, Oleg Nesterov wrote:
> > Not really the comment, but the question...
> >
> > On 10/22, Mikulas Patocka wrote:
> > >
> > >  static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> > >  {
> > >  	rcu_read_lock();
> > > @@ -24,22 +27,12 @@ static inline void percpu_down_read(stru
> > >  	}
> > >  	this_cpu_inc(*p->counters);
> > >  	rcu_read_unlock();
> > > +	light_mb(); /* A, between read of p->locked and read of data, paired with D */
> > >  }
> >
> > rcu_read_unlock() (or even preempt_enable) should have compiler barrier
> > semantics... But I agree, this adds more documentation for free.
>
> Although rcu_read_lock() does have compiler-barrier semantics if
> CONFIG_PREEMPT=y, it does not for CONFIG_PREEMPT=n.  So the
> light_mb() (which appears to be barrier()) is needed in that case.

Indeed, I missed this.

> > Ignoring the current implementation, according to the documentation
> > synchronize_sched() has all rights to return immediately if there is
> > no active rcu_read_lock_sched() section. If this were possible, than
> > percpu_up_read() lacks mb.
>
> Even if there happen to be no RCU-sched read-side critical sections
> at the current instant, synchronize_sched() is required to make sure
> that everyone agrees that whatever code is executed by the caller after
> synchronize_sched() returns happens after any of the preceding RCU
> read-side critical sections.
>
> So, if we have this, with x==0 initially:
>
> 	Task 0					Task 1
>
> 						rcu_read_lock_sched();
> 						x = 1;
> 						rcu_read_unlock_sched();
> 	synchronize_sched();
> 	r1 = x;
>
> Then the value of r1 had better be one.

Yes, yes, this too. ("active rcu_read_lock_sched() section" above
was confusing, I agree).

>  * Note that this guarantee implies a further memory-ordering guarantee.
>  * On systems with more than one CPU, when synchronize_sched() returns,
>  * each CPU is guaranteed to have executed a full memory barrier since
>  * the end of its last RCU read-side critical section whose beginning
>  * preceded the call to synchronize_sched().  Note that this guarantee
>  * includes CPUs that are offline, idle, or executing in user mode, as
>  * well as CPUs that are executing in the kernel.  Furthermore, if CPU A
>  * invoked synchronize_sched(), which returned to its caller on CPU B,
>  * then both CPU A and CPU B are guaranteed to have executed a full memory
>  * barrier during the execution of synchronize_sched().

Great!

Thanks Paul.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 18:05                       ` Paul E. McKenney
  2012-10-23 18:27                         ` Oleg Nesterov
@ 2012-10-23 18:41                         ` Oleg Nesterov
  2012-10-23 20:29                           ` Paul E. McKenney
  1 sibling, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-23 18:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On 10/23, Paul E. McKenney wrote:
>
>  * Note that this guarantee implies a further memory-ordering guarantee.
>  * On systems with more than one CPU, when synchronize_sched() returns,
>  * each CPU is guaranteed to have executed a full memory barrier since
>  * the end of its last RCU read-side critical section
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ah wait... I misread this comment.

But this patch needs more? Or I misunderstood. There is no RCU unlock
in percpu_up_read().

IOW. Suppose the code does

	percpu_down_read();
	x = PROTECTED_BY_THIS_RW_SEM;
	percpu_up_read();

Withoit mb() the load above can be reordered with this_cpu_dec() in
percpu_up_read().

However, we do not care if we can guarantee that the next
percpu_down_write() can not return (iow, the next "write" section can
not start) until this load is complete.

And I _think_ that another synchronize_sched() in percpu_down_write()
added by this patch should work.

But, "since the end of its last  RCU read-side critical section"
does not look enough.

Or I misundersood you/Mikulas/both ?

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 16:59                     ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Oleg Nesterov
  2012-10-23 18:05                       ` Paul E. McKenney
@ 2012-10-23 19:23                       ` Oleg Nesterov
  2012-10-23 20:45                         ` Peter Zijlstra
                                           ` (2 more replies)
  1 sibling, 3 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-23 19:23 UTC (permalink / raw)
  To: Mikulas Patocka, Peter Zijlstra, Paul E. McKenney
  Cc: Linus Torvalds, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 10/23, Oleg Nesterov wrote:
>
> Not really the comment, but the question...

Damn. And another question.

Mikulas, I am sorry for this (almost) off-topic noise. Let me repeat
just in case that I am not arguing with your patches.




So write_lock/write_unlock needs to call synchronize_sched() 3 times.
I am wondering if it makes any sense to try to make it a bit heavier
but faster.

What if we change the reader to use local_irq_disable/enable around
this_cpu_inc/dec (instead of rcu read lock)? I have to admit, I have
no idea how much cli/sti is slower compared to preempt_disable/enable.

Then the writer can use

	static void mb_ipi(void *arg)
	{
		smp_mb(); /* unneeded ? */
	}

	static void force_mb_on_each_cpu(void)
	{
		smp_mb();
		smp_call_function(mb_ipi, NULL, 1);
	}

to a) synchronise with irq_disable and b) to insert the necessary mb's.

Of course smp_call_function() means more work for each CPU, but
write_lock() should be rare...

This can also wakeup the idle CPU's, but probably we can do
on_each_cpu_cond(cond_func => !idle_cpu). Perhaps cond_func() can
also return false if rcu_user_enter() was called...

Actually I was thinking about this from the very beginning, but I do
not feel this looks like a good idea. Still I'd like to ask what do
you think.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 18:41                         ` Oleg Nesterov
@ 2012-10-23 20:29                           ` Paul E. McKenney
  2012-10-23 20:32                             ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-23 20:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote:
> On 10/23, Paul E. McKenney wrote:
> >
> >  * Note that this guarantee implies a further memory-ordering guarantee.
> >  * On systems with more than one CPU, when synchronize_sched() returns,
> >  * each CPU is guaranteed to have executed a full memory barrier since
> >  * the end of its last RCU read-side critical section
>          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Ah wait... I misread this comment.

And I miswrote it.  It should say "since the end of its last RCU-sched
read-side critical section."  So, for example, RCU-sched need not force
a CPU that is idle, offline, or (eventually) executing in user mode to
execute a memory barrier.  Fixed this.

> But this patch needs more? Or I misunderstood. There is no RCU unlock
> in percpu_up_read().
> 
> IOW. Suppose the code does
> 
> 	percpu_down_read();
> 	x = PROTECTED_BY_THIS_RW_SEM;
> 	percpu_up_read();
> 
> Withoit mb() the load above can be reordered with this_cpu_dec() in
> percpu_up_read().
> 
> However, we do not care if we can guarantee that the next
> percpu_down_write() can not return (iow, the next "write" section can
> not start) until this load is complete.
> 
> And I _think_ that another synchronize_sched() in percpu_down_write()
> added by this patch should work.
> 
> But, "since the end of its last  RCU read-side critical section"
> does not look enough.
> 
> Or I misundersood you/Mikulas/both ?

I clearly need to look more carefully at Mikulas's code...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-22 23:37                   ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka
  2012-10-22 23:39                     ` [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched Mikulas Patocka
  2012-10-23 16:59                     ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Oleg Nesterov
@ 2012-10-23 20:32                     ` Peter Zijlstra
  2 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2012-10-23 20:32 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Linus Torvalds, Oleg Nesterov, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel,
	Paul E. McKenney

On Mon, 2012-10-22 at 19:37 -0400, Mikulas Patocka wrote:
> -       /*
> -        * On X86, write operation in this_cpu_dec serves as a memory unlock
> -        * barrier (i.e. memory accesses may be moved before the write, but
> -        * no memory accesses are moved past the write).
> -        * On other architectures this may not be the case, so we need smp_mb()
> -        * there.
> -        */
> -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
> -       barrier();
> -#else
> -       smp_mb();
> -#endif
> +       light_mb(); /* B, between read of the data and write to p->counter, paired with C */ 

If we're going to invent new primitives for this, shouldn't we call
this: smp_unlock_barrier() or something? That at least has well defined
semantics.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 20:29                           ` Paul E. McKenney
@ 2012-10-23 20:32                             ` Paul E. McKenney
  2012-10-23 21:39                               ` Mikulas Patocka
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-23 20:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote:
> > On 10/23, Paul E. McKenney wrote:
> > >
> > >  * Note that this guarantee implies a further memory-ordering guarantee.
> > >  * On systems with more than one CPU, when synchronize_sched() returns,
> > >  * each CPU is guaranteed to have executed a full memory barrier since
> > >  * the end of its last RCU read-side critical section
> >          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > Ah wait... I misread this comment.
> 
> And I miswrote it.  It should say "since the end of its last RCU-sched
> read-side critical section."  So, for example, RCU-sched need not force
> a CPU that is idle, offline, or (eventually) executing in user mode to
> execute a memory barrier.  Fixed this.

And I should hasten to add that for synchronize_sched(), disabling
preemption (including disabling irqs, further including NMI handlers)
acts as an RCU-sched read-side critical section.  (This is in the
comment header for synchronize_sched() up above my addition to it.)
	
							Thanx, Paul

> > But this patch needs more? Or I misunderstood. There is no RCU unlock
> > in percpu_up_read().
> > 
> > IOW. Suppose the code does
> > 
> > 	percpu_down_read();
> > 	x = PROTECTED_BY_THIS_RW_SEM;
> > 	percpu_up_read();
> > 
> > Withoit mb() the load above can be reordered with this_cpu_dec() in
> > percpu_up_read().
> > 
> > However, we do not care if we can guarantee that the next
> > percpu_down_write() can not return (iow, the next "write" section can
> > not start) until this load is complete.
> > 
> > And I _think_ that another synchronize_sched() in percpu_down_write()
> > added by this patch should work.
> > 
> > But, "since the end of its last  RCU read-side critical section"
> > does not look enough.
> > 
> > Or I misundersood you/Mikulas/both ?
> 
> I clearly need to look more carefully at Mikulas's code...
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 19:23                       ` Oleg Nesterov
@ 2012-10-23 20:45                         ` Peter Zijlstra
  2012-10-23 20:57                         ` Peter Zijlstra
  2012-10-23 21:26                         ` Mikulas Patocka
  2 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2012-10-23 20:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Tue, 2012-10-23 at 21:23 +0200, Oleg Nesterov wrote:
> I have to admit, I have
> no idea how much cli/sti is slower compared to preempt_disable/enable.
> 
A lot.. esp on stupid hardware (insert pentium-4 reference), but I think
its more expensive for pretty much all hardware, preempt_disable() is
only a non-atomic cpu local increment and a compiler barrier, enable the
same and a single conditional.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 19:23                       ` Oleg Nesterov
  2012-10-23 20:45                         ` Peter Zijlstra
@ 2012-10-23 20:57                         ` Peter Zijlstra
  2012-10-24 15:11                           ` Oleg Nesterov
  2012-10-23 21:26                         ` Mikulas Patocka
  2 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2012-10-23 20:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Tue, 2012-10-23 at 21:23 +0200, Oleg Nesterov wrote:
> 
>         static void mb_ipi(void *arg)
>         {
>                 smp_mb(); /* unneeded ? */
>         }
> 
>         static void force_mb_on_each_cpu(void)
>         {
>                 smp_mb();
>                 smp_call_function(mb_ipi, NULL, 1);
>         } 

You know we're spending an awful lot of time and effort to get rid of
such things, right? RT and HPC people absolutely hate these random IPI
things.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 19:23                       ` Oleg Nesterov
  2012-10-23 20:45                         ` Peter Zijlstra
  2012-10-23 20:57                         ` Peter Zijlstra
@ 2012-10-23 21:26                         ` Mikulas Patocka
  2 siblings, 0 replies; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-23 21:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel



On Tue, 23 Oct 2012, Oleg Nesterov wrote:

> On 10/23, Oleg Nesterov wrote:
> >
> > Not really the comment, but the question...
> 
> Damn. And another question.
> 
> Mikulas, I am sorry for this (almost) off-topic noise. Let me repeat
> just in case that I am not arguing with your patches.
> 
> 
> 
> 
> So write_lock/write_unlock needs to call synchronize_sched() 3 times.
> I am wondering if it makes any sense to try to make it a bit heavier
> but faster.
> 
> What if we change the reader to use local_irq_disable/enable around
> this_cpu_inc/dec (instead of rcu read lock)? I have to admit, I have
> no idea how much cli/sti is slower compared to preempt_disable/enable.
> 
> Then the writer can use
> 
> 	static void mb_ipi(void *arg)
> 	{
> 		smp_mb(); /* unneeded ? */
> 	}
> 
> 	static void force_mb_on_each_cpu(void)
> 	{
> 		smp_mb();
> 		smp_call_function(mb_ipi, NULL, 1);
> 	}
> 
> to a) synchronise with irq_disable and b) to insert the necessary mb's.
> 
> Of course smp_call_function() means more work for each CPU, but
> write_lock() should be rare...
> 
> This can also wakeup the idle CPU's, but probably we can do
> on_each_cpu_cond(cond_func => !idle_cpu). Perhaps cond_func() can
> also return false if rcu_user_enter() was called...
> 
> Actually I was thinking about this from the very beginning, but I do
> not feel this looks like a good idea. Still I'd like to ask what do
> you think.
> 
> Oleg.

I think - if we can avoid local_irq_disable/enable, just avoid it (and use 
barrier-vs-synchronize_kernel).

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 20:32                             ` Paul E. McKenney
@ 2012-10-23 21:39                               ` Mikulas Patocka
  2012-10-24 16:23                                 ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-23 21:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel



On Tue, 23 Oct 2012, Paul E. McKenney wrote:

> On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote:
> > > On 10/23, Paul E. McKenney wrote:
> > > >
> > > >  * Note that this guarantee implies a further memory-ordering guarantee.
> > > >  * On systems with more than one CPU, when synchronize_sched() returns,
> > > >  * each CPU is guaranteed to have executed a full memory barrier since
> > > >  * the end of its last RCU read-side critical section
> > >          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > 
> > > Ah wait... I misread this comment.
> > 
> > And I miswrote it.  It should say "since the end of its last RCU-sched
> > read-side critical section."  So, for example, RCU-sched need not force
> > a CPU that is idle, offline, or (eventually) executing in user mode to
> > execute a memory barrier.  Fixed this.

Or you can write "each CPU that is executing a kernel code is guaranteed 
to have executed a full memory barrier".

It would be consistent with the current implementation and it would make 
it possible to use

barrier()-synchronize_sched() as biased memory barriers.

---

In percpu-rwlocks, CPU 1 executes

...make some writes in the critical section...
barrier();
this_cpu_dec(*p->counters);

and the CPU 2 executes

while (__percpu_count(p->counters))
	msleep(1);
synchronize_sched();

So, when CPU 2 finishes synchronize_sched(), we must make sure that
all writes done by CPU 1 are visible to CPU 2.

The current implementation fulfills this requirement, you can just add it 
to the specification so that whoever changes the implementation keeps it.

Mikulas

> And I should hasten to add that for synchronize_sched(), disabling
> preemption (including disabling irqs, further including NMI handlers)
> acts as an RCU-sched read-side critical section.  (This is in the
> comment header for synchronize_sched() up above my addition to it.)
> 	
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-19 22:54                       ` Mikulas Patocka
@ 2012-10-24  3:08                         ` Dave Chinner
  2012-10-25 14:09                           ` Mikulas Patocka
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Chinner @ 2012-10-24  3:08 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Peter Zijlstra, Oleg Nesterov, Paul E. McKenney, Linus Torvalds,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel, Thomas Gleixner

On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
> 
> 
> On Fri, 19 Oct 2012, Peter Zijlstra wrote:
> 
> > > Yes, I tried this approach - it involves doing LOCK instruction on read 
> > > lock, remembering the cpu and doing another LOCK instruction on read 
> > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
> > > happens in the common case). It was slower than the approach without any 
> > > LOCK instructions (43.3 seconds seconds for the implementation with 
> > > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
> > > instruction; the benchmark involved doing 512-byte direct-io reads and 
> > > writes on a ramdisk with 8 processes on 8-core machine).
> > 
> > So why is that a problem? Surely that's already tons better then what
> > you've currently got.
> 
> Percpu rw-semaphores do not improve performance at all. I put them there 
> to avoid performance regression, not to improve performance.
> 
> All Linux kernels have a race condition - when you change block size of a 
> block device and you read or write the device at the same time, a crash 
> may happen. This bug is there since ever. Recently, this bug started to 
> cause major trouble - multiple high profile business sites report crashes 
> because of this race condition.
>
> You can fix this race by using a read lock around I/O paths and write lock 
> around block size changing, but normal rw semaphore cause cache line 
> bouncing when taken for read by multiple processors and I/O performance 
> degradation because of it is measurable.

This doesn't sound like a new problem.  Hasn't this global access,
single modifier exclusion problem been solved before in the VFS?
e.g. mnt_want_write()/mnt_make_readonly()

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 20:57                         ` Peter Zijlstra
@ 2012-10-24 15:11                           ` Oleg Nesterov
  0 siblings, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-24 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mikulas Patocka, Paul E. McKenney, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On 10/23, Peter Zijlstra wrote:
>
> On Tue, 2012-10-23 at 21:23 +0200, Oleg Nesterov wrote:
> >
> >         static void mb_ipi(void *arg)
> >         {
> >                 smp_mb(); /* unneeded ? */
> >         }
> >
> >         static void force_mb_on_each_cpu(void)
> >         {
> >                 smp_mb();
> >                 smp_call_function(mb_ipi, NULL, 1);
> >         }
>
> You know we're spending an awful lot of time and effort to get rid of
> such things, right? RT and HPC people absolutely hate these random IPI
> things.

No I do not know ;) but I am not suprized.

And,

> > I have to admit, I have
> > no idea how much cli/sti is slower compared to preempt_disable/enable.
> >
> A lot.. esp on stupid hardware (insert pentium-4 reference), but I think
> its more expensive for pretty much all hardware,

Thanks Peter, this alone answers my question.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
  2012-10-22 23:39                     ` [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched Mikulas Patocka
@ 2012-10-24 16:16                       ` Paul E. McKenney
  2012-10-24 17:18                         ` Oleg Nesterov
  2012-10-25 14:54                         ` Mikulas Patocka
  0 siblings, 2 replies; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-24 16:16 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Linus Torvalds, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
> Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
> instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
> 
> This is an optimization. The RCU-protected region is very small, so
> there will be no latency problems if we disable preempt in this region.
> 
> So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
> to preempt_disable / preempt_disable. It is smaller (and supposedly
> faster) than preemptible rcu_read_lock / rcu_read_unlock.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

OK, as promised/threatened, I finally got a chance to take a closer look.

The light_mb() and heavy_mb() definitions aren't doing much for me,
the code would be cleared with them expanded inline.  And while the
approach of pairing barrier() with synchronize_sched() is interesting,
it would be simpler to rely on RCU's properties.  The key point is that
if RCU cannot prove that a given RCU-sched read-side critical section
is seen by all CPUs to have started after a given synchronize_sched(),
then that synchronize_sched() must wait for that RCU-sched read-side
critical section to complete.

This means, as discussed earlier, that there will be a memory barrier
somewhere following the end of that RCU-sched read-side critical section,
and that this memory barrier executes before the completion of the
synchronize_sched().

So I suggest something like the following (untested!) implementation:

------------------------------------------------------------------------

struct percpu_rw_semaphore {
	unsigned __percpu *counters;
	bool locked;
	struct mutex mtx;
	wait_queue_head_t wq;
};

static inline void percpu_down_read(struct percpu_rw_semaphore *p)
{
	rcu_read_lock_sched();
	if (unlikely(p->locked)) {
		rcu_read_unlock_sched();

		/*
		 * There might (or might not) be a writer.  Acquire &p->mtx,
		 * it is always safe (if a bit slow) to do so.
		 */
		mutex_lock(&p->mtx);
		this_cpu_inc(*p->counters);
		mutex_unlock(&p->mtx);
		return;
	}

	/* No writer, proceed locklessly. */
	this_cpu_inc(*p->counters);
	rcu_read_unlock_sched();
}

static inline void percpu_up_read(struct percpu_rw_semaphore *p)
{
	/*
	 * Decrement our count, but protected by RCU-sched so that
	 * the writer can force proper serialization.
	 */
	rcu_read_lock_sched();
	this_cpu_dec(*p->counters);
	rcu_read_unlock_sched();
}

static inline unsigned __percpu_count(unsigned __percpu *counters)
{
	unsigned total = 0;
	int cpu;

	for_each_possible_cpu(cpu)
		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));

	return total;
}

static inline void percpu_down_write(struct percpu_rw_semaphore *p)
{
	mutex_lock(&p->mtx);

	/* Wait for a previous writer, if necessary. */
	wait_event(p->wq, !ACCESS_ONCE(p->locked));

	/* Force the readers to acquire the lock when manipulating counts. */
	ACCESS_ONCE(p->locked) = true;

	/* Wait for all pre-existing readers' checks of ->locked to finish. */
	synchronize_sched();
	/*
	 * At this point, all percpu_down_read() invocations will
	 * acquire p->mtx.
	 */

	/*
	 * Wait for all pre-existing readers to complete their
	 * percpu_up_read() calls.  Because ->locked is set and
	 * because we hold ->mtx, there cannot be any new readers.
	 * ->counters will therefore monotonically decrement to zero.
	 */
	while (__percpu_count(p->counters))
		msleep(1);

	/*
	 * Invoke synchronize_sched() in order to force the last
	 * caller of percpu_up_read() to exit its RCU-sched read-side
	 * critical section.  On SMP systems, this also forces the CPU
	 * that invoked that percpu_up_read() to execute a full memory
	 * barrier between the time it exited the RCU-sched read-side
	 * critical section and the time that synchronize_sched() returns,
	 * so that the critical section begun by this invocation of
	 * percpu_down_write() will happen after the critical section
	 * ended by percpu_up_read().
	 */
	synchronize_sched();
}

static inline void percpu_up_write(struct percpu_rw_semaphore *p)
{
	/* Allow others to proceed, but not yet locklessly. */
	mutex_unlock(&p->mtx);

	/*
	 * Ensure that all calls to percpu_down_read() that did not
	 * start unambiguously after the above mutex_unlock() still
	 * acquire the lock, forcing their critical sections to be
	 * serialized with the one terminated by this call to
	 * percpu_up_write().
	 */
	synchronize_sched();

	/* Now it is safe to allow readers to proceed locklessly. */
	ACCESS_ONCE(p->locked) = false;

	/*
	 * If there is another writer waiting, wake it up.  Note that
	 * p->mtx properly serializes its critical section with the
	 * critical section terminated by this call to percpu_up_write().
	 */
	wake_up(&p->wq);
}

static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
{
	p->counters = alloc_percpu(unsigned);
	if (unlikely(!p->counters))
		return -ENOMEM;
	p->locked = false;
	mutex_init(&p->mtx);
	init_waitqueue_head(&p->wq);
	return 0;
}

static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
{
	free_percpu(p->counters);
	p->counters = NULL; /* catch use after free bugs */
}

------------------------------------------------------------------------

Of course, it would be nice to get rid of the extra synchronize_sched().
One way to do this is to use SRCU, which allows blocking operations in
its read-side critical sections (though also increasing read-side overhead
a bit, and also untested):

------------------------------------------------------------------------

struct percpu_rw_semaphore {
	bool locked;
	struct mutex mtx; /* Could also be rw_semaphore. */
	struct srcu_struct s;
	wait_queue_head_t wq;
};

static inline int percpu_down_read(struct percpu_rw_semaphore *p)
{
	int idx;

	idx = srcu_read_lock(&p->s);
	if (unlikely(p->locked)) {
		srcu_read_unlock(&p->s, idx);

		/*
		 * There might (or might not) be a writer.  Acquire &p->mtx,
		 * it is always safe (if a bit slow) to do so.
		 */
		mutex_lock(&p->mtx);
		return -1;  /* srcu_read_lock() cannot return -1. */
	}
	return idx;
}

static inline void percpu_up_read(struct percpu_rw_semaphore *p, int idx)
{
	if (idx == -1)
		mutex_unlock(&p->mtx);
	else
		srcu_read_unlock(&p->s, idx);
}

static inline void percpu_down_write(struct percpu_rw_semaphore *p)
{
	mutex_lock(&p->mtx);

	/* Wait for a previous writer, if necessary. */
	wait_event(p->wq, !ACCESS_ONCE(p->locked));

	/* Force new readers to acquire the lock when manipulating counts. */
	ACCESS_ONCE(p->locked) = true;

	/* Wait for all pre-existing readers' checks of ->locked to finish. */
	synchronize_srcu(&p->s);
	/* At this point, all lockless readers have completed. */
}

static inline void percpu_up_write(struct percpu_rw_semaphore *p)
{
	/* Allow others to proceed, but not yet locklessly. */
	mutex_unlock(&p->mtx);

	/*
	 * Ensure that all calls to percpu_down_read() that did not
	 * start unambiguously after the above mutex_unlock() still
	 * acquire the lock, forcing their critical sections to be
	 * serialized with the one terminated by this call to
	 * percpu_up_write().
	 */
	synchronize_sched();

	/* Now it is safe to allow readers to proceed locklessly. */
	ACCESS_ONCE(p->locked) = false;

	/*
	 * If there is another writer waiting, wake it up.  Note that
	 * p->mtx properly serializes its critical section with the
	 * critical section terminated by this call to percpu_up_write().
	 */
	wake_up(&p->wq);
}

static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
{
	p->locked = false;
	mutex_init(&p->mtx);
	if (unlikely(!init_srcu_struct(&p->s)));
		return -ENOMEM;
	init_waitqueue_head(&p->wq);
	return 0;
}

static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
{
	cleanup_srcu_struct(&p->s);
}

------------------------------------------------------------------------

Of course, there was a question raised as to whether something already
exists that does this job...

And you guys did ask!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-23 21:39                               ` Mikulas Patocka
@ 2012-10-24 16:23                                 ` Paul E. McKenney
  2012-10-24 20:22                                   ` Mikulas Patocka
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-24 16:23 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 23 Oct 2012, Paul E. McKenney wrote:
> 
> > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote:
> > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote:
> > > > On 10/23, Paul E. McKenney wrote:
> > > > >
> > > > >  * Note that this guarantee implies a further memory-ordering guarantee.
> > > > >  * On systems with more than one CPU, when synchronize_sched() returns,
> > > > >  * each CPU is guaranteed to have executed a full memory barrier since
> > > > >  * the end of its last RCU read-side critical section
> > > >          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > 
> > > > Ah wait... I misread this comment.
> > > 
> > > And I miswrote it.  It should say "since the end of its last RCU-sched
> > > read-side critical section."  So, for example, RCU-sched need not force
> > > a CPU that is idle, offline, or (eventually) executing in user mode to
> > > execute a memory barrier.  Fixed this.
> 
> Or you can write "each CPU that is executing a kernel code is guaranteed 
> to have executed a full memory barrier".

Perhaps I could, but it isn't needed, nor is it particularly helpful.
Please see suggestions in preceding email.

> It would be consistent with the current implementation and it would make 
> it possible to use
> 
> barrier()-synchronize_sched() as biased memory barriers.

But it is simpler to rely on the properties of RCU.  We really should
avoid memory barriers where possible, as they are way too easy to
get wrong.

> ---
> 
> In percpu-rwlocks, CPU 1 executes
> 
> ...make some writes in the critical section...
> barrier();
> this_cpu_dec(*p->counters);
> 
> and the CPU 2 executes
> 
> while (__percpu_count(p->counters))
> 	msleep(1);
> synchronize_sched();
> 
> So, when CPU 2 finishes synchronize_sched(), we must make sure that
> all writes done by CPU 1 are visible to CPU 2.
> 
> The current implementation fulfills this requirement, you can just add it 
> to the specification so that whoever changes the implementation keeps it.

I will consider doing that if and when someone shows me a situation where
adding that requirement makes things simpler and/or faster.  From what I
can see, your example does not do so.

							Thanx, Paul

> Mikulas
> 
> > And I should hasten to add that for synchronize_sched(), disabling
> > preemption (including disabling irqs, further including NMI handlers)
> > acts as an RCU-sched read-side critical section.  (This is in the
> > comment header for synchronize_sched() up above my addition to it.)
> > 	
> > 							Thanx, Paul
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
  2012-10-24 16:16                       ` Paul E. McKenney
@ 2012-10-24 17:18                         ` Oleg Nesterov
  2012-10-24 18:20                           ` Paul E. McKenney
  2012-10-25 14:54                         ` Mikulas Patocka
  1 sibling, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-24 17:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On 10/24, Paul E. McKenney wrote:
>
> static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> {
> 	/*
> 	 * Decrement our count, but protected by RCU-sched so that
> 	 * the writer can force proper serialization.
> 	 */
> 	rcu_read_lock_sched();
> 	this_cpu_dec(*p->counters);
> 	rcu_read_unlock_sched();
> }

Yes, the explicit lock/unlock makes the new assumptions about
synchronize_sched && barriers unnecessary. And iiuc this could
even written as

	rcu_read_lock_sched();
	rcu_read_unlock_sched();

	this_cpu_dec(*p->counters);


> Of course, it would be nice to get rid of the extra synchronize_sched().
> One way to do this is to use SRCU, which allows blocking operations in
> its read-side critical sections (though also increasing read-side overhead
> a bit, and also untested):
>
> ------------------------------------------------------------------------
>
> struct percpu_rw_semaphore {
> 	bool locked;
> 	struct mutex mtx; /* Could also be rw_semaphore. */
> 	struct srcu_struct s;
> 	wait_queue_head_t wq;
> };

but in this case I don't understand

> static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> {
> 	/* Allow others to proceed, but not yet locklessly. */
> 	mutex_unlock(&p->mtx);
>
> 	/*
> 	 * Ensure that all calls to percpu_down_read() that did not
> 	 * start unambiguously after the above mutex_unlock() still
> 	 * acquire the lock, forcing their critical sections to be
> 	 * serialized with the one terminated by this call to
> 	 * percpu_up_write().
> 	 */
> 	synchronize_sched();

how this synchronize_sched() can help...

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
  2012-10-24 17:18                         ` Oleg Nesterov
@ 2012-10-24 18:20                           ` Paul E. McKenney
  2012-10-24 18:43                             ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-24 18:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote:
> On 10/24, Paul E. McKenney wrote:
> >
> > static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> > {
> > 	/*
> > 	 * Decrement our count, but protected by RCU-sched so that
> > 	 * the writer can force proper serialization.
> > 	 */
> > 	rcu_read_lock_sched();
> > 	this_cpu_dec(*p->counters);
> > 	rcu_read_unlock_sched();
> > }
> 
> Yes, the explicit lock/unlock makes the new assumptions about
> synchronize_sched && barriers unnecessary. And iiuc this could
> even written as
> 
> 	rcu_read_lock_sched();
> 	rcu_read_unlock_sched();
> 
> 	this_cpu_dec(*p->counters);

But this would lose the memory barrier that is inserted by
synchronize_sched() after the CPU's last RCU-sched read-side critical
section.

> > Of course, it would be nice to get rid of the extra synchronize_sched().
> > One way to do this is to use SRCU, which allows blocking operations in
> > its read-side critical sections (though also increasing read-side overhead
> > a bit, and also untested):
> >
> > ------------------------------------------------------------------------
> >
> > struct percpu_rw_semaphore {
> > 	bool locked;
> > 	struct mutex mtx; /* Could also be rw_semaphore. */
> > 	struct srcu_struct s;
> > 	wait_queue_head_t wq;
> > };
> 
> but in this case I don't understand
> 
> > static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> > {
> > 	/* Allow others to proceed, but not yet locklessly. */
> > 	mutex_unlock(&p->mtx);
> >
> > 	/*
> > 	 * Ensure that all calls to percpu_down_read() that did not
> > 	 * start unambiguously after the above mutex_unlock() still
> > 	 * acquire the lock, forcing their critical sections to be
> > 	 * serialized with the one terminated by this call to
> > 	 * percpu_up_write().
> > 	 */
> > 	synchronize_sched();
> 
> how this synchronize_sched() can help...

Indeed it cannot!  It should instead be synchronize_srcu(&p->s).  I guess that
I really meant it when I said it was untested.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
  2012-10-24 18:20                           ` Paul E. McKenney
@ 2012-10-24 18:43                             ` Oleg Nesterov
  2012-10-24 19:43                               ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-24 18:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On 10/24, Paul E. McKenney wrote:
>
> On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote:
> > On 10/24, Paul E. McKenney wrote:
> > >
> > > static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> > > {
> > > 	/*
> > > 	 * Decrement our count, but protected by RCU-sched so that
> > > 	 * the writer can force proper serialization.
> > > 	 */
> > > 	rcu_read_lock_sched();
> > > 	this_cpu_dec(*p->counters);
> > > 	rcu_read_unlock_sched();
> > > }
> >
> > Yes, the explicit lock/unlock makes the new assumptions about
> > synchronize_sched && barriers unnecessary. And iiuc this could
> > even written as
> >
> > 	rcu_read_lock_sched();
> > 	rcu_read_unlock_sched();
> >
> > 	this_cpu_dec(*p->counters);
>
> But this would lose the memory barrier that is inserted by
> synchronize_sched() after the CPU's last RCU-sched read-side critical
> section.

How? Afaics there is no need to synchronize with this_cpu_dec(), its
result was already seen before the 2nd synchronize_sched() was called
in percpu_down_write().

IOW, this memory barrier is only needed to synchronize with memory
changes inside down_read/up_read.

To clarify, of course I do not suggest to write is this way. I am just
trying to check my understanding.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
  2012-10-24 18:43                             ` Oleg Nesterov
@ 2012-10-24 19:43                               ` Paul E. McKenney
  0 siblings, 0 replies; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-24 19:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Oct 24, 2012 at 08:43:11PM +0200, Oleg Nesterov wrote:
> On 10/24, Paul E. McKenney wrote:
> >
> > On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote:
> > > On 10/24, Paul E. McKenney wrote:
> > > >
> > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> > > > {
> > > > 	/*
> > > > 	 * Decrement our count, but protected by RCU-sched so that
> > > > 	 * the writer can force proper serialization.
> > > > 	 */
> > > > 	rcu_read_lock_sched();
> > > > 	this_cpu_dec(*p->counters);
> > > > 	rcu_read_unlock_sched();
> > > > }
> > >
> > > Yes, the explicit lock/unlock makes the new assumptions about
> > > synchronize_sched && barriers unnecessary. And iiuc this could
> > > even written as
> > >
> > > 	rcu_read_lock_sched();
> > > 	rcu_read_unlock_sched();
> > >
> > > 	this_cpu_dec(*p->counters);
> >
> > But this would lose the memory barrier that is inserted by
> > synchronize_sched() after the CPU's last RCU-sched read-side critical
> > section.
> 
> How? Afaics there is no need to synchronize with this_cpu_dec(), its
> result was already seen before the 2nd synchronize_sched() was called
> in percpu_down_write().
> 
> IOW, this memory barrier is only needed to synchronize with memory
> changes inside down_read/up_read.
> 
> To clarify, of course I do not suggest to write is this way. I am just
> trying to check my understanding.

You are quite correct -- once the writer has seen the change in the
counter, it knows that the reader's empty RCU-sched read must have
at least started, and thus can rely on the following memory barrier
to guarantee that it sees the reader's critical section.

But that code really does look strange, I will grant you that!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-24 16:23                                 ` Paul E. McKenney
@ 2012-10-24 20:22                                   ` Mikulas Patocka
  2012-10-24 20:36                                     ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-24 20:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel



On Wed, 24 Oct 2012, Paul E. McKenney wrote:

> On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 23 Oct 2012, Paul E. McKenney wrote:
> > 
> > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote:
> > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote:
> > > > > On 10/23, Paul E. McKenney wrote:
> > > > > >
> > > > > >  * Note that this guarantee implies a further memory-ordering guarantee.
> > > > > >  * On systems with more than one CPU, when synchronize_sched() returns,
> > > > > >  * each CPU is guaranteed to have executed a full memory barrier since
> > > > > >  * the end of its last RCU read-side critical section
> > > > >          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > 
> > > > > Ah wait... I misread this comment.
> > > > 
> > > > And I miswrote it.  It should say "since the end of its last RCU-sched
> > > > read-side critical section."  So, for example, RCU-sched need not force
> > > > a CPU that is idle, offline, or (eventually) executing in user mode to
> > > > execute a memory barrier.  Fixed this.
> > 
> > Or you can write "each CPU that is executing a kernel code is guaranteed 
> > to have executed a full memory barrier".
> 
> Perhaps I could, but it isn't needed, nor is it particularly helpful.
> Please see suggestions in preceding email.

It is helpful, because if you add this requirement (that already holds for 
the current implementation), you can drop rcu_read_lock_sched() and 
rcu_read_unlock_sched() from the following code that you submitted.

static inline void percpu_up_read(struct percpu_rw_semaphore *p)
{
        /*
         * Decrement our count, but protected by RCU-sched so that
         * the writer can force proper serialization.
         */
        rcu_read_lock_sched();
        this_cpu_dec(*p->counters);
        rcu_read_unlock_sched();
}

> > The current implementation fulfills this requirement, you can just add it 
> > to the specification so that whoever changes the implementation keeps it.
> 
> I will consider doing that if and when someone shows me a situation where
> adding that requirement makes things simpler and/or faster.  From what I
> can see, your example does not do so.
> 
> 							Thanx, Paul

If you do, the above code can be simplified to:
{
	barrier();
	this_cpu_dec(*p->counters);
}

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-24 20:22                                   ` Mikulas Patocka
@ 2012-10-24 20:36                                     ` Paul E. McKenney
  2012-10-24 20:44                                       ` Mikulas Patocka
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-24 20:36 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote:
> 
> 
> On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> 
> > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 23 Oct 2012, Paul E. McKenney wrote:
> > > 
> > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote:
> > > > > > On 10/23, Paul E. McKenney wrote:
> > > > > > >
> > > > > > >  * Note that this guarantee implies a further memory-ordering guarantee.
> > > > > > >  * On systems with more than one CPU, when synchronize_sched() returns,
> > > > > > >  * each CPU is guaranteed to have executed a full memory barrier since
> > > > > > >  * the end of its last RCU read-side critical section
> > > > > >          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > > 
> > > > > > Ah wait... I misread this comment.
> > > > > 
> > > > > And I miswrote it.  It should say "since the end of its last RCU-sched
> > > > > read-side critical section."  So, for example, RCU-sched need not force
> > > > > a CPU that is idle, offline, or (eventually) executing in user mode to
> > > > > execute a memory barrier.  Fixed this.
> > > 
> > > Or you can write "each CPU that is executing a kernel code is guaranteed 
> > > to have executed a full memory barrier".
> > 
> > Perhaps I could, but it isn't needed, nor is it particularly helpful.
> > Please see suggestions in preceding email.
> 
> It is helpful, because if you add this requirement (that already holds for 
> the current implementation), you can drop rcu_read_lock_sched() and 
> rcu_read_unlock_sched() from the following code that you submitted.
> 
> static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> {
>         /*
>          * Decrement our count, but protected by RCU-sched so that
>          * the writer can force proper serialization.
>          */
>         rcu_read_lock_sched();
>         this_cpu_dec(*p->counters);
>         rcu_read_unlock_sched();
> }
> 
> > > The current implementation fulfills this requirement, you can just add it 
> > > to the specification so that whoever changes the implementation keeps it.
> > 
> > I will consider doing that if and when someone shows me a situation where
> > adding that requirement makes things simpler and/or faster.  From what I
> > can see, your example does not do so.
> > 
> > 							Thanx, Paul
> 
> If you do, the above code can be simplified to:
> {
> 	barrier();
> 	this_cpu_dec(*p->counters);
> }

The readers are lightweight enough that you are worried about the overhead
of rcu_read_lock_sched() and rcu_read_unlock_sched()?  Really???

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-24 20:36                                     ` Paul E. McKenney
@ 2012-10-24 20:44                                       ` Mikulas Patocka
  2012-10-24 23:57                                         ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-24 20:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel



On Wed, 24 Oct 2012, Paul E. McKenney wrote:

> On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> > 
> > > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote:
> > > > 
> > > > 
> > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote:
> > > > 
> > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote:
> > > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote:
> > > > > > > On 10/23, Paul E. McKenney wrote:
> > > > > > > >
> > > > > > > >  * Note that this guarantee implies a further memory-ordering guarantee.
> > > > > > > >  * On systems with more than one CPU, when synchronize_sched() returns,
> > > > > > > >  * each CPU is guaranteed to have executed a full memory barrier since
> > > > > > > >  * the end of its last RCU read-side critical section
> > > > > > >          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > > > 
> > > > > > > Ah wait... I misread this comment.
> > > > > > 
> > > > > > And I miswrote it.  It should say "since the end of its last RCU-sched
> > > > > > read-side critical section."  So, for example, RCU-sched need not force
> > > > > > a CPU that is idle, offline, or (eventually) executing in user mode to
> > > > > > execute a memory barrier.  Fixed this.
> > > > 
> > > > Or you can write "each CPU that is executing a kernel code is guaranteed 
> > > > to have executed a full memory barrier".
> > > 
> > > Perhaps I could, but it isn't needed, nor is it particularly helpful.
> > > Please see suggestions in preceding email.
> > 
> > It is helpful, because if you add this requirement (that already holds for 
> > the current implementation), you can drop rcu_read_lock_sched() and 
> > rcu_read_unlock_sched() from the following code that you submitted.
> > 
> > static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> > {
> >         /*
> >          * Decrement our count, but protected by RCU-sched so that
> >          * the writer can force proper serialization.
> >          */
> >         rcu_read_lock_sched();
> >         this_cpu_dec(*p->counters);
> >         rcu_read_unlock_sched();
> > }
> > 
> > > > The current implementation fulfills this requirement, you can just add it 
> > > > to the specification so that whoever changes the implementation keeps it.
> > > 
> > > I will consider doing that if and when someone shows me a situation where
> > > adding that requirement makes things simpler and/or faster.  From what I
> > > can see, your example does not do so.
> > > 
> > > 							Thanx, Paul
> > 
> > If you do, the above code can be simplified to:
> > {
> > 	barrier();
> > 	this_cpu_dec(*p->counters);
> > }
> 
> The readers are lightweight enough that you are worried about the overhead
> of rcu_read_lock_sched() and rcu_read_unlock_sched()?  Really???
> 
> 							Thanx, Paul

There was no lock in previous kernels, so we should make it as simple as 
possible. Disabling and reenabling preemption is probably not a big deal, 
but if don't have to do it, why do it?

Mikulas


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-24 20:44                                       ` Mikulas Patocka
@ 2012-10-24 23:57                                         ` Paul E. McKenney
  2012-10-25 12:39                                           ` Paul E. McKenney
  2012-10-25 13:48                                           ` Mikulas Patocka
  0 siblings, 2 replies; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-24 23:57 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Oct 24, 2012 at 04:44:14PM -0400, Mikulas Patocka wrote:
> 
> 
> On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> 
> > On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> > > 
> > > > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote:
> > > > > 
> > > > > 
> > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote:
> > > > > 
> > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote:
> > > > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote:
> > > > > > > > On 10/23, Paul E. McKenney wrote:
> > > > > > > > >
> > > > > > > > >  * Note that this guarantee implies a further memory-ordering guarantee.
> > > > > > > > >  * On systems with more than one CPU, when synchronize_sched() returns,
> > > > > > > > >  * each CPU is guaranteed to have executed a full memory barrier since
> > > > > > > > >  * the end of its last RCU read-side critical section
> > > > > > > >          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > > > > 
> > > > > > > > Ah wait... I misread this comment.
> > > > > > > 
> > > > > > > And I miswrote it.  It should say "since the end of its last RCU-sched
> > > > > > > read-side critical section."  So, for example, RCU-sched need not force
> > > > > > > a CPU that is idle, offline, or (eventually) executing in user mode to
> > > > > > > execute a memory barrier.  Fixed this.
> > > > > 
> > > > > Or you can write "each CPU that is executing a kernel code is guaranteed 
> > > > > to have executed a full memory barrier".
> > > > 
> > > > Perhaps I could, but it isn't needed, nor is it particularly helpful.
> > > > Please see suggestions in preceding email.
> > > 
> > > It is helpful, because if you add this requirement (that already holds for 
> > > the current implementation), you can drop rcu_read_lock_sched() and 
> > > rcu_read_unlock_sched() from the following code that you submitted.
> > > 
> > > static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> > > {
> > >         /*
> > >          * Decrement our count, but protected by RCU-sched so that
> > >          * the writer can force proper serialization.
> > >          */
> > >         rcu_read_lock_sched();
> > >         this_cpu_dec(*p->counters);
> > >         rcu_read_unlock_sched();
> > > }
> > > 
> > > > > The current implementation fulfills this requirement, you can just add it 
> > > > > to the specification so that whoever changes the implementation keeps it.
> > > > 
> > > > I will consider doing that if and when someone shows me a situation where
> > > > adding that requirement makes things simpler and/or faster.  From what I
> > > > can see, your example does not do so.
> > > > 
> > > > 							Thanx, Paul
> > > 
> > > If you do, the above code can be simplified to:
> > > {
> > > 	barrier();
> > > 	this_cpu_dec(*p->counters);
> > > }
> > 
> > The readers are lightweight enough that you are worried about the overhead
> > of rcu_read_lock_sched() and rcu_read_unlock_sched()?  Really???
> > 
> > 							Thanx, Paul
> 
> There was no lock in previous kernels, so we should make it as simple as 
> possible. Disabling and reenabling preemption is probably not a big deal, 
> but if don't have to do it, why do it?

Because I don't consider the barrier()-paired-with-synchronize_sched()
to be a simplification.

While we are discussing this, I have been assuming that readers must block
from time to time.  Is this the case?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-24 23:57                                         ` Paul E. McKenney
@ 2012-10-25 12:39                                           ` Paul E. McKenney
  2012-10-25 13:48                                           ` Mikulas Patocka
  1 sibling, 0 replies; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-25 12:39 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Oct 24, 2012 at 04:57:35PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 24, 2012 at 04:44:14PM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> > 
> > > On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote:
> > > > 
> > > > 
> > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> > > > 
> > > > > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote:
> > > > > > 
> > > > > > 
> > > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote:
> > > > > > 
> > > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote:
> > > > > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote:
> > > > > > > > > On 10/23, Paul E. McKenney wrote:
> > > > > > > > > >
> > > > > > > > > >  * Note that this guarantee implies a further memory-ordering guarantee.
> > > > > > > > > >  * On systems with more than one CPU, when synchronize_sched() returns,
> > > > > > > > > >  * each CPU is guaranteed to have executed a full memory barrier since
> > > > > > > > > >  * the end of its last RCU read-side critical section
> > > > > > > > >          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > > > > > 
> > > > > > > > > Ah wait... I misread this comment.
> > > > > > > > 
> > > > > > > > And I miswrote it.  It should say "since the end of its last RCU-sched
> > > > > > > > read-side critical section."  So, for example, RCU-sched need not force
> > > > > > > > a CPU that is idle, offline, or (eventually) executing in user mode to
> > > > > > > > execute a memory barrier.  Fixed this.
> > > > > > 
> > > > > > Or you can write "each CPU that is executing a kernel code is guaranteed 
> > > > > > to have executed a full memory barrier".
> > > > > 
> > > > > Perhaps I could, but it isn't needed, nor is it particularly helpful.
> > > > > Please see suggestions in preceding email.
> > > > 
> > > > It is helpful, because if you add this requirement (that already holds for 
> > > > the current implementation), you can drop rcu_read_lock_sched() and 
> > > > rcu_read_unlock_sched() from the following code that you submitted.
> > > > 
> > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> > > > {
> > > >         /*
> > > >          * Decrement our count, but protected by RCU-sched so that
> > > >          * the writer can force proper serialization.
> > > >          */
> > > >         rcu_read_lock_sched();
> > > >         this_cpu_dec(*p->counters);
> > > >         rcu_read_unlock_sched();
> > > > }
> > > > 
> > > > > > The current implementation fulfills this requirement, you can just add it 
> > > > > > to the specification so that whoever changes the implementation keeps it.
> > > > > 
> > > > > I will consider doing that if and when someone shows me a situation where
> > > > > adding that requirement makes things simpler and/or faster.  From what I
> > > > > can see, your example does not do so.
> > > > > 
> > > > > 							Thanx, Paul
> > > > 
> > > > If you do, the above code can be simplified to:
> > > > {
> > > > 	barrier();
> > > > 	this_cpu_dec(*p->counters);
> > > > }
> > > 
> > > The readers are lightweight enough that you are worried about the overhead
> > > of rcu_read_lock_sched() and rcu_read_unlock_sched()?  Really???
> > > 
> > > 							Thanx, Paul
> > 
> > There was no lock in previous kernels, so we should make it as simple as 
> > possible. Disabling and reenabling preemption is probably not a big deal, 
> > but if don't have to do it, why do it?
> 
> Because I don't consider the barrier()-paired-with-synchronize_sched()
> to be a simplification.

In addition, please note that synchronize_srcu() used to guarantee a
memory barrier on all online non-idle CPUs, but that it no longer does
after Lai Jiangshan's recent rewrite.  Given this change, I would have
to be quite foolish not to be very reluctant to make this guarantee for
other flavors of RCU, unless there was an extremely good reason for it.
Dropping a preempt_disable()/preempt_enable() pair doesn't even come
close to being a good enough reason.

> While we are discussing this, I have been assuming that readers must block
> from time to time.  Is this the case?

And this really is a serious question.  If the answer is "no", that
readers never block, a much simpler and faster approach is possible.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers
  2012-10-24 23:57                                         ` Paul E. McKenney
  2012-10-25 12:39                                           ` Paul E. McKenney
@ 2012-10-25 13:48                                           ` Mikulas Patocka
  1 sibling, 0 replies; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-25 13:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel



On Wed, 24 Oct 2012, Paul E. McKenney wrote:

> On Wed, Oct 24, 2012 at 04:44:14PM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> > 
> > > On Wed, Oct 24, 2012 at 04:22:17PM -0400, Mikulas Patocka wrote:
> > > > 
> > > > 
> > > > On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> > > > 
> > > > > On Tue, Oct 23, 2012 at 05:39:43PM -0400, Mikulas Patocka wrote:
> > > > > > 
> > > > > > 
> > > > > > On Tue, 23 Oct 2012, Paul E. McKenney wrote:
> > > > > > 
> > > > > > > On Tue, Oct 23, 2012 at 01:29:02PM -0700, Paul E. McKenney wrote:
> > > > > > > > On Tue, Oct 23, 2012 at 08:41:23PM +0200, Oleg Nesterov wrote:
> > > > > > > > > On 10/23, Paul E. McKenney wrote:
> > > > > > > > > >
> > > > > > > > > >  * Note that this guarantee implies a further memory-ordering guarantee.
> > > > > > > > > >  * On systems with more than one CPU, when synchronize_sched() returns,
> > > > > > > > > >  * each CPU is guaranteed to have executed a full memory barrier since
> > > > > > > > > >  * the end of its last RCU read-side critical section
> > > > > > > > >          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > > > > > 
> > > > > > > > > Ah wait... I misread this comment.
> > > > > > > > 
> > > > > > > > And I miswrote it.  It should say "since the end of its last RCU-sched
> > > > > > > > read-side critical section."  So, for example, RCU-sched need not force
> > > > > > > > a CPU that is idle, offline, or (eventually) executing in user mode to
> > > > > > > > execute a memory barrier.  Fixed this.
> > > > > > 
> > > > > > Or you can write "each CPU that is executing a kernel code is guaranteed 
> > > > > > to have executed a full memory barrier".
> > > > > 
> > > > > Perhaps I could, but it isn't needed, nor is it particularly helpful.
> > > > > Please see suggestions in preceding email.
> > > > 
> > > > It is helpful, because if you add this requirement (that already holds for 
> > > > the current implementation), you can drop rcu_read_lock_sched() and 
> > > > rcu_read_unlock_sched() from the following code that you submitted.
> > > > 
> > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> > > > {
> > > >         /*
> > > >          * Decrement our count, but protected by RCU-sched so that
> > > >          * the writer can force proper serialization.
> > > >          */
> > > >         rcu_read_lock_sched();
> > > >         this_cpu_dec(*p->counters);
> > > >         rcu_read_unlock_sched();
> > > > }
> > > > 
> > > > > > The current implementation fulfills this requirement, you can just add it 
> > > > > > to the specification so that whoever changes the implementation keeps it.
> > > > > 
> > > > > I will consider doing that if and when someone shows me a situation where
> > > > > adding that requirement makes things simpler and/or faster.  From what I
> > > > > can see, your example does not do so.
> > > > > 
> > > > > 							Thanx, Paul
> > > > 
> > > > If you do, the above code can be simplified to:
> > > > {
> > > > 	barrier();
> > > > 	this_cpu_dec(*p->counters);
> > > > }
> > > 
> > > The readers are lightweight enough that you are worried about the overhead
> > > of rcu_read_lock_sched() and rcu_read_unlock_sched()?  Really???
> > > 
> > > 							Thanx, Paul
> > 
> > There was no lock in previous kernels, so we should make it as simple as 
> > possible. Disabling and reenabling preemption is probably not a big deal, 
> > but if don't have to do it, why do it?
> 
> Because I don't consider the barrier()-paired-with-synchronize_sched()
> to be a simplification.

It is a simplification because it makes the code smaller (just one 
instruction on x86):
this_cpu_dec(*p->counters):
   0:   64 ff 08                decl   %fs:(%eax)
preempt_disable()
this_cpu_dec(*p->counters)
preempt_enable():
  10:   89 e2                   mov    %esp,%edx
  12:   81 e2 00 e0 ff ff       and    $0xffffe000,%edx
  18:   ff 42 14                incl   0x14(%edx)
  1b:   64 ff 08                decl   %fs:(%eax)
  1e:   ff 4a 14                decl   0x14(%edx)
  21:   8b 42 08                mov    0x8(%edx),%eax
  24:   a8 08                   test   $0x8,%al
  26:   75 03                   jne    2b

this_cpu_dec is uninterruptible, so there is no reason why would you want 
to put preempt_disable and preempt_enable around it.

Disabling preemption may actually improve performance on RISC machines. 
RISC architectures have load/store instructions and they do not have a 
single instruction to load a value from memory, decrement it and write it 
back. So, on RISC architectures, this_cpu_dec is implemented as: disable 
interrupts, load the value, decrement the value, write the value, restore 
interrupt state. Disabling interrupts slows down because it triggers 
microcode.

For example, on PA-RISC
                preempt_disable();
                (*this_cpu_ptr(counters))--;
                preempt_enable();
is faster than
                this_cpu_dec(*counters);

But on X86, this_cpu_inc(*counters) is faster.

> While we are discussing this, I have been assuming that readers must block
> from time to time.  Is this the case?
> 
> 							Thanx, Paul

Processes that hold the read lock block in the i/o path - they may block 
to wait until the data is read from disk. Or for other reasons.

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-24  3:08                         ` Dave Chinner
@ 2012-10-25 14:09                           ` Mikulas Patocka
  2012-10-25 23:40                             ` Dave Chinner
  0 siblings, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-25 14:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Peter Zijlstra, Oleg Nesterov, Paul E. McKenney, Linus Torvalds,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel, Thomas Gleixner



On Wed, 24 Oct 2012, Dave Chinner wrote:

> On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 19 Oct 2012, Peter Zijlstra wrote:
> > 
> > > > Yes, I tried this approach - it involves doing LOCK instruction on read 
> > > > lock, remembering the cpu and doing another LOCK instruction on read 
> > > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
> > > > happens in the common case). It was slower than the approach without any 
> > > > LOCK instructions (43.3 seconds seconds for the implementation with 
> > > > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
> > > > instruction; the benchmark involved doing 512-byte direct-io reads and 
> > > > writes on a ramdisk with 8 processes on 8-core machine).
> > > 
> > > So why is that a problem? Surely that's already tons better then what
> > > you've currently got.
> > 
> > Percpu rw-semaphores do not improve performance at all. I put them there 
> > to avoid performance regression, not to improve performance.
> > 
> > All Linux kernels have a race condition - when you change block size of a 
> > block device and you read or write the device at the same time, a crash 
> > may happen. This bug is there since ever. Recently, this bug started to 
> > cause major trouble - multiple high profile business sites report crashes 
> > because of this race condition.
> >
> > You can fix this race by using a read lock around I/O paths and write lock 
> > around block size changing, but normal rw semaphore cause cache line 
> > bouncing when taken for read by multiple processors and I/O performance 
> > degradation because of it is measurable.
> 
> This doesn't sound like a new problem.  Hasn't this global access,
> single modifier exclusion problem been solved before in the VFS?
> e.g. mnt_want_write()/mnt_make_readonly()
> 
> Cheers,
> 
> Dave.

Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw 
semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() 
to use percpu rw semaphores and remove the duplicated code.

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
  2012-10-24 16:16                       ` Paul E. McKenney
  2012-10-24 17:18                         ` Oleg Nesterov
@ 2012-10-25 14:54                         ` Mikulas Patocka
  2012-10-25 15:07                           ` Paul E. McKenney
  1 sibling, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-25 14:54 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel



On Wed, 24 Oct 2012, Paul E. McKenney wrote:

> On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
> > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
> > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
> > 
> > This is an optimization. The RCU-protected region is very small, so
> > there will be no latency problems if we disable preempt in this region.
> > 
> > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
> > to preempt_disable / preempt_disable. It is smaller (and supposedly
> > faster) than preemptible rcu_read_lock / rcu_read_unlock.
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> OK, as promised/threatened, I finally got a chance to take a closer look.
> 
> The light_mb() and heavy_mb() definitions aren't doing much for me,
> the code would be cleared with them expanded inline.  And while the
> approach of pairing barrier() with synchronize_sched() is interesting,
> it would be simpler to rely on RCU's properties.  The key point is that
> if RCU cannot prove that a given RCU-sched read-side critical section
> is seen by all CPUs to have started after a given synchronize_sched(),
> then that synchronize_sched() must wait for that RCU-sched read-side
> critical section to complete.

Also note that you can define both light_mb() and heavy_mb() to be 
smp_mb() and slow down the reader path a bit and speed up the writer path.

On architectures with in-order memory access (and thus smp_mb() equals 
barrier()), it doesn't hurt the reader but helps the writer, for example:
#ifdef ARCH_HAS_INORDER_MEMORY_ACCESS
#define light_mb()      smp_mb()
#define heavy_mb()      smp_mb()
#else
#define light_mb()      barrier()
#define heavy_mb()      synchronize_sched()
#endif

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
  2012-10-25 14:54                         ` Mikulas Patocka
@ 2012-10-25 15:07                           ` Paul E. McKenney
  2012-10-25 16:15                             ` Mikulas Patocka
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-10-25 15:07 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Linus Torvalds, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Thu, Oct 25, 2012 at 10:54:11AM -0400, Mikulas Patocka wrote:
> 
> 
> On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> 
> > On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
> > > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
> > > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
> > > 
> > > This is an optimization. The RCU-protected region is very small, so
> > > there will be no latency problems if we disable preempt in this region.
> > > 
> > > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
> > > to preempt_disable / preempt_disable. It is smaller (and supposedly
> > > faster) than preemptible rcu_read_lock / rcu_read_unlock.
> > > 
> > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > 
> > OK, as promised/threatened, I finally got a chance to take a closer look.
> > 
> > The light_mb() and heavy_mb() definitions aren't doing much for me,
> > the code would be cleared with them expanded inline.  And while the
> > approach of pairing barrier() with synchronize_sched() is interesting,
> > it would be simpler to rely on RCU's properties.  The key point is that
> > if RCU cannot prove that a given RCU-sched read-side critical section
> > is seen by all CPUs to have started after a given synchronize_sched(),
> > then that synchronize_sched() must wait for that RCU-sched read-side
> > critical section to complete.
> 
> Also note that you can define both light_mb() and heavy_mb() to be 
> smp_mb() and slow down the reader path a bit and speed up the writer path.
> 
> On architectures with in-order memory access (and thus smp_mb() equals 
> barrier()), it doesn't hurt the reader but helps the writer, for example:
> #ifdef ARCH_HAS_INORDER_MEMORY_ACCESS
> #define light_mb()      smp_mb()
> #define heavy_mb()      smp_mb()
> #else
> #define light_mb()      barrier()
> #define heavy_mb()      synchronize_sched()
> #endif

Except that there are no systems running Linux with in-order memory
access.  Even x86 and s390 require a barrier instruction for smp_mb()
on SMP=y builds.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched
  2012-10-25 15:07                           ` Paul E. McKenney
@ 2012-10-25 16:15                             ` Mikulas Patocka
  0 siblings, 0 replies; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-25 16:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel



On Thu, 25 Oct 2012, Paul E. McKenney wrote:

> On Thu, Oct 25, 2012 at 10:54:11AM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> > 
> > > On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
> > > > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
> > > > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
> > > > 
> > > > This is an optimization. The RCU-protected region is very small, so
> > > > there will be no latency problems if we disable preempt in this region.
> > > > 
> > > > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
> > > > to preempt_disable / preempt_disable. It is smaller (and supposedly
> > > > faster) than preemptible rcu_read_lock / rcu_read_unlock.
> > > > 
> > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > > 
> > > OK, as promised/threatened, I finally got a chance to take a closer look.
> > > 
> > > The light_mb() and heavy_mb() definitions aren't doing much for me,
> > > the code would be cleared with them expanded inline.  And while the
> > > approach of pairing barrier() with synchronize_sched() is interesting,
> > > it would be simpler to rely on RCU's properties.  The key point is that
> > > if RCU cannot prove that a given RCU-sched read-side critical section
> > > is seen by all CPUs to have started after a given synchronize_sched(),
> > > then that synchronize_sched() must wait for that RCU-sched read-side
> > > critical section to complete.
> > 
> > Also note that you can define both light_mb() and heavy_mb() to be 
> > smp_mb() and slow down the reader path a bit and speed up the writer path.
> > 
> > On architectures with in-order memory access (and thus smp_mb() equals 
> > barrier()), it doesn't hurt the reader but helps the writer, for example:
> > #ifdef ARCH_HAS_INORDER_MEMORY_ACCESS
> > #define light_mb()      smp_mb()
> > #define heavy_mb()      smp_mb()
> > #else
> > #define light_mb()      barrier()
> > #define heavy_mb()      synchronize_sched()
> > #endif
> 
> Except that there are no systems running Linux with in-order memory
> access.  Even x86 and s390 require a barrier instruction for smp_mb()
> on SMP=y builds.
> 
> 							Thanx, Paul

PA-RISC is in-order. But it is used very rarely.

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-25 14:09                           ` Mikulas Patocka
@ 2012-10-25 23:40                             ` Dave Chinner
  2012-10-26 12:06                               ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Chinner @ 2012-10-25 23:40 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Peter Zijlstra, Oleg Nesterov, Paul E. McKenney, Linus Torvalds,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel, Thomas Gleixner

On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote:
> 
> 
> On Wed, 24 Oct 2012, Dave Chinner wrote:
> 
> > On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Fri, 19 Oct 2012, Peter Zijlstra wrote:
> > > 
> > > > > Yes, I tried this approach - it involves doing LOCK instruction on read 
> > > > > lock, remembering the cpu and doing another LOCK instruction on read 
> > > > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
> > > > > happens in the common case). It was slower than the approach without any 
> > > > > LOCK instructions (43.3 seconds seconds for the implementation with 
> > > > > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
> > > > > instruction; the benchmark involved doing 512-byte direct-io reads and 
> > > > > writes on a ramdisk with 8 processes on 8-core machine).
> > > > 
> > > > So why is that a problem? Surely that's already tons better then what
> > > > you've currently got.
> > > 
> > > Percpu rw-semaphores do not improve performance at all. I put them there 
> > > to avoid performance regression, not to improve performance.
> > > 
> > > All Linux kernels have a race condition - when you change block size of a 
> > > block device and you read or write the device at the same time, a crash 
> > > may happen. This bug is there since ever. Recently, this bug started to 
> > > cause major trouble - multiple high profile business sites report crashes 
> > > because of this race condition.
> > >
> > > You can fix this race by using a read lock around I/O paths and write lock 
> > > around block size changing, but normal rw semaphore cause cache line 
> > > bouncing when taken for read by multiple processors and I/O performance 
> > > degradation because of it is measurable.
> > 
> > This doesn't sound like a new problem.  Hasn't this global access,
> > single modifier exclusion problem been solved before in the VFS?
> > e.g. mnt_want_write()/mnt_make_readonly()
> > 
> > Cheers,
> > 
> > Dave.
> 
> Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw 
> semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() 
> to use percpu rw semaphores and remove the duplicated code.

I think you misunderstood my point - that rather than re-inventing
the wheel, why didn't you just copy something that is known to
work?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-25 23:40                             ` Dave Chinner
@ 2012-10-26 12:06                               ` Oleg Nesterov
  2012-10-26 13:22                                 ` Mikulas Patocka
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-26 12:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mikulas Patocka, Peter Zijlstra, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel,
	Thomas Gleixner

On 10/26, Dave Chinner wrote:
>
> On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote:
> >
> > Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw
> > semaphores. I think you can convert mnt_want_write()/mnt_make_readonly()
> > to use percpu rw semaphores and remove the duplicated code.
>
> I think you misunderstood my point - that rather than re-inventing
> the wheel, why didn't you just copy something that is known to
> work?

I don't understand why do you both think that __mnt_want_write()
and mnt_make_readonly() provides the same functionality. I looked
at this code before I started this patch, and unless I completely
misread it this does very different things. It is not "lock" at all.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-26 12:06                               ` Oleg Nesterov
@ 2012-10-26 13:22                                 ` Mikulas Patocka
  2012-10-26 14:12                                   ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-26 13:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Chinner, Peter Zijlstra, Paul E. McKenney, Linus Torvalds,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel, Thomas Gleixner



On Fri, 26 Oct 2012, Oleg Nesterov wrote:

> On 10/26, Dave Chinner wrote:
> >
> > On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote:
> > >
> > > Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw
> > > semaphores. I think you can convert mnt_want_write()/mnt_make_readonly()
> > > to use percpu rw semaphores and remove the duplicated code.
> >
> > I think you misunderstood my point - that rather than re-inventing
> > the wheel, why didn't you just copy something that is known to
> > work?

I didn't know about. The code is not reusable, and it doesn't really do 
locking. And it has two barriers on the read path, while percpu rw 
semaphores have none.

> I don't understand why do you both think that __mnt_want_write()
> and mnt_make_readonly() provides the same functionality. I looked
> at this code before I started this patch, and unless I completely
> misread it this does very different things. It is not "lock" at all.
> 
> Oleg.

mnt_want_write uses percpu array of counters, just like percpu semaphores.

The code is different, but it can be changed to use percpu rw semaphores 
(if we add percpu_down_write_trylock).

__mnt_want_write could call percpu_down_read and check if it is readonly 
(if it is, drop the lock and return -EROFS)
__mnt_drop_write could call percpu_up_read
mnt_make_readonly and sb_prepare_remount_readonly could call 
percpu_down_write_trylock instead of mnt_get_writers (if they get the 
write lock, set it to readonly and drop the write lock)

... and that's it, then, you can remove MNT_WRITE_HOLD, the barriers, 
spinning and other complexity from fs/namespace.c

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-26 13:22                                 ` Mikulas Patocka
@ 2012-10-26 14:12                                   ` Oleg Nesterov
  2012-10-26 15:23                                     ` mark_files_ro && sb_end_write Oleg Nesterov
  2012-10-26 16:09                                     ` [PATCH 1/2] brw_mutex: big read-write mutex Mikulas Patocka
  0 siblings, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-26 14:12 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Dave Chinner, Peter Zijlstra, Paul E. McKenney, Linus Torvalds,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel, Thomas Gleixner

On 10/26, Mikulas Patocka wrote:
>
> On Fri, 26 Oct 2012, Oleg Nesterov wrote:
>
> I didn't know about. The code is not reusable, and it doesn't really do
> locking.

That was my main point.

As for the changing fs/namespace.c to use percpu_rwsem, I am not sure
it is that simple and even worthwhile but I won't argue, I do not
pretend I understand this code.

> > I don't understand why do you both think that __mnt_want_write()
> > and mnt_make_readonly() provides the same functionality. I looked
> > at this code before I started this patch, and unless I completely
> > misread it this does very different things. It is not "lock" at all.
> >
> > Oleg.
>
> mnt_want_write uses percpu array of counters, just like percpu semaphores.

and this is all imo ;)

> The code is different, but it can be changed to use percpu rw semaphores
> (if we add percpu_down_write_trylock).

I don't really understand how you can make percpu_down_write_trylock()
atomic so that it can be called under br_write_lock(vfsmount_lock) in
sb_prepare_remount_readonly(). So I guess you also need to replace
vfsmount_lock at least. Or _trylock needs the barriers in _down_read.
Or I missed something.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* mark_files_ro && sb_end_write
  2012-10-26 14:12                                   ` Oleg Nesterov
@ 2012-10-26 15:23                                     ` Oleg Nesterov
  2012-10-26 16:09                                     ` [PATCH 1/2] brw_mutex: big read-write mutex Mikulas Patocka
  1 sibling, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-26 15:23 UTC (permalink / raw)
  To: Mikulas Patocka, Al Viro
  Cc: Dave Chinner, Peter Zijlstra, Paul E. McKenney, Linus Torvalds,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel, Thomas Gleixner

On 10/26, Oleg Nesterov wrote:
>
> As for the changing fs/namespace.c to use percpu_rwsem, I am not sure
> it is that simple and even worthwhile but I won't argue, I do not
> pretend I understand this code.

BTW, speaking about these counters...

Is mark_files_ro()->mnt_drop_write_file() properly balanced?
__mnt_drop_write() looks fine, but sb_end_write?

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-26 14:12                                   ` Oleg Nesterov
  2012-10-26 15:23                                     ` mark_files_ro && sb_end_write Oleg Nesterov
@ 2012-10-26 16:09                                     ` Mikulas Patocka
  1 sibling, 0 replies; 103+ messages in thread
From: Mikulas Patocka @ 2012-10-26 16:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Chinner, Peter Zijlstra, Paul E. McKenney, Linus Torvalds,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel, Thomas Gleixner



On Fri, 26 Oct 2012, Oleg Nesterov wrote:

> > The code is different, but it can be changed to use percpu rw semaphores
> > (if we add percpu_down_write_trylock).
> 
> I don't really understand how you can make percpu_down_write_trylock()
> atomic so that it can be called under br_write_lock(vfsmount_lock) in
> sb_prepare_remount_readonly(). So I guess you also need to replace
> vfsmount_lock at least. Or _trylock needs the barriers in _down_read.
> Or I missed something.
> 
> Oleg.

That's true - that code is under spinlock and you can't implement 
non-blocking percpu_down_write_trylock.

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex)
  2012-10-22 23:36                 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka
  2012-10-22 23:37                   ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka
@ 2012-10-30 18:48                   ` Oleg Nesterov
  2012-10-31 19:41                     ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov
  1 sibling, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-30 18:48 UTC (permalink / raw)
  To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 10/22, Mikulas Patocka wrote:
>
> > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> > > something similar. Certainly it was not in my tree when I started
> > > this patch... percpu_down_write() doesn't allow multiple writers,
> > > but the main problem it uses msleep(1). It should not, I think.

But, since we already have percpu_rw_semaphore, I do not think I can
add another similar thing,

However percpu_rw_semaphore is sub-optimal, not sure uprobes can use
it to block dup_mmap(). Perhaps we can improve it?

> > > But. It seems that percpu_up_write() is equally wrong? Doesn't
> > > it need synchronize_rcu() before "p->locked = false" ?
> > >
> > > (add Mikulas)
> >
> > Mikulas said something about doing an updated patch, so I figured I
> > would look at his next version.
> >
> > 							Thanx, Paul
>
> The best ideas proposed in this thread are:
>
>
> Using heavy/light barries by Lai Jiangshan.

So. down_write/up_right does msleep() and it needs to call
synchronize_sched() 3 times.

This looks too much. It is not that I am worried about the writers,
the problem is that the new readers are blocked completely while the
writer sleeps in msleep/synchronize_sched.

Paul, Mikulas, et al. Could you please look at the new implementation
below? Completely untested/uncompiled, just for discussion.

Compared to the current implementation, down_read() is still possible
while the writer sleeps in synchronize_sched(), but the reader uses
rw_semaphore/atomic_inc when it detects the waiting writer.

Can this work? Do you think this is better than we have now?


Note: probably we can optimize percpu_down/up_write more, we can
"factor out" synchronize_sched(), multiple writers can do this in
parallel before they take ->writer_mutex to exclude each other.
But this won't affect the readers, and this can be done later.

Oleg.

------------------------------------------------------------------------------
struct percpu_rw_semaphore {
	long __percpu		*fast_read_ctr;
	struct mutex            writer_mutex;
	struct rw_semaphore 	rw_sem;
	atomit_t		slow_read_ctr;
};

static bool update_fast_ctr(struct percpu_rw_semaphore *brw, long val)
{
	bool success = false;

	preempt_disable();
	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
		__this_cpu_add(*brw->fast_read_ctr, val);
		success = true;
	}
	preempt_enable();

	return success;
}

static long clear_fast_read_ctr(struct percpu_rw_semaphore *brw)
{
	long sum = 0;
	int cpu;

	for_each_possible_cpu(cpu) {
		sum += per_cpu(*brw->fast_read_ctr, cpu);
		per_cpu(*brw->fast_read_ctr, cpu) = 0;
	}

	return sum;
}

void percpu_down_read(struct percpu_rw_semaphore *brw)
{
	if (likely(update_fast_ctr(+1)))
		return;

	down_read(&brw->rw_sem);
	atomic_inc(&brw->slow_read_ctr);
	up_read(&brw->rw_sem);
}

void percpu_up_read(struct percpu_rw_semaphore *brw)
{
	if (likely(update_fast_ctr(-1)))
		return;

	if (atomic_dec_and_test(&brw->slow_read_ctr))
		wake_up_all(&brw->write_waitq);
}

void percpu_down_write(struct percpu_rw_semaphore *brw)
{
	mutex_lock(&brw->writer_mutex);
	/* ensure mutex_is_locked() is visible to the readers */
	synchronize_sched();

	/* block the new readers */
	down_write(&brw->rw_sem);

	atomic_add(&brw->slow_read_ctr, clear_fast_read_ctr(brw));

	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
}

void percpu_up_write(struct percpu_rw_semaphore *brw)
{
	up_write(&brw->rw_sem);
	/* insert the barrier before the next fast-path in down_read */
	synchronize_sched();

	mutex_unlock(&brw->writer_mutex);
}


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-10-30 18:48                   ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Oleg Nesterov
@ 2012-10-31 19:41                     ` Oleg Nesterov
  2012-10-31 19:41                       ` [PATCH 1/1] " Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-31 19:41 UTC (permalink / raw)
  To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Linus Torvalds
  Cc: Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On 10/30, Oleg Nesterov wrote:
>
> So. down_write/up_right does msleep() and it needs to call
> synchronize_sched() 3 times.
>
> This looks too much. It is not that I am worried about the writers,
> the problem is that the new readers are blocked completely while the
> writer sleeps in msleep/synchronize_sched.
>
> Paul, Mikulas, et al. Could you please look at the new implementation
> below? Completely untested/uncompiled, just for discussion.

I tried to test it, seems to work...

But. I guess the only valid test is: pass the review from Paul/Peter.

Todo:
	- add the lockdep annotations

	- we can speedup the down_write-right-aftet-up_write case

What do you all think?

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-10-31 19:41                     ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov
@ 2012-10-31 19:41                       ` Oleg Nesterov
  2012-11-01 15:10                         ` Linus Torvalds
  2012-11-01 15:43                         ` [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Paul E. McKenney
  0 siblings, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-10-31 19:41 UTC (permalink / raw)
  To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Linus Torvalds
  Cc: Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

Currently the writer does msleep() plus synchronize_sched() 3 times
to acquire/release the semaphore, and during this time the readers
are blocked completely. Even if the "write" section was not actually
started or if it was already finished.

With this patch down_read/up_read does synchronize_sched() twice and
down_read/up_read are still possible during this time, just they use
the slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/percpu-rwsem.h |   83 +++++---------------------------
 lib/Makefile                 |    2 +-
 lib/percpu-rwsem.c           |  106 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 120 insertions(+), 71 deletions(-)
 create mode 100644 lib/percpu-rwsem.c

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 250a4ac..7f738ca 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -2,82 +2,25 @@
 #define _LINUX_PERCPU_RWSEM_H
 
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/percpu.h>
-#include <linux/rcupdate.h>
-#include <linux/delay.h>
+#include <linux/wait.h>
 
 struct percpu_rw_semaphore {
-	unsigned __percpu *counters;
-	bool locked;
-	struct mutex mtx;
+	int __percpu		*fast_read_ctr;
+	struct mutex		writer_mutex;
+	struct rw_semaphore	rw_sem;
+	atomic_t		slow_read_ctr;
+	wait_queue_head_t	write_waitq;
 };
 
-#define light_mb()	barrier()
-#define heavy_mb()	synchronize_sched()
+extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern void percpu_up_read(struct percpu_rw_semaphore *);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
-	rcu_read_lock_sched();
-	if (unlikely(p->locked)) {
-		rcu_read_unlock_sched();
-		mutex_lock(&p->mtx);
-		this_cpu_inc(*p->counters);
-		mutex_unlock(&p->mtx);
-		return;
-	}
-	this_cpu_inc(*p->counters);
-	rcu_read_unlock_sched();
-	light_mb(); /* A, between read of p->locked and read of data, paired with D */
-}
+extern void percpu_down_write(struct percpu_rw_semaphore *);
+extern void percpu_up_write(struct percpu_rw_semaphore *);
 
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
-	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
-	this_cpu_dec(*p->counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
-	unsigned total = 0;
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
-
-	return total;
-}
-
-static inline void percpu_down_write(struct percpu_rw_semaphore *p)
-{
-	mutex_lock(&p->mtx);
-	p->locked = true;
-	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
-	while (__percpu_count(p->counters))
-		msleep(1);
-	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
-}
-
-static inline void percpu_up_write(struct percpu_rw_semaphore *p)
-{
-	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
-	p->locked = false;
-	mutex_unlock(&p->mtx);
-}
-
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
-{
-	p->counters = alloc_percpu(unsigned);
-	if (unlikely(!p->counters))
-		return -ENOMEM;
-	p->locked = false;
-	mutex_init(&p->mtx);
-	return 0;
-}
-
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
-{
-	free_percpu(p->counters);
-	p->counters = NULL; /* catch use after free bugs */
-}
+extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
+extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 
 #endif
diff --git a/lib/Makefile b/lib/Makefile
index 821a162..4dad4a7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 idr.o int_sqrt.o extable.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o
+	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
new file mode 100644
index 0000000..40a415d
--- /dev/null
+++ b/lib/percpu-rwsem.c
@@ -0,0 +1,106 @@
+#include <linux/percpu-rwsem.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
+{
+	brw->fast_read_ctr = alloc_percpu(int);
+	if (unlikely(!brw->fast_read_ctr))
+		return -ENOMEM;
+
+	mutex_init(&brw->writer_mutex);
+	init_rwsem(&brw->rw_sem);
+	atomic_set(&brw->slow_read_ctr, 0);
+	init_waitqueue_head(&brw->write_waitq);
+	return 0;
+}
+
+void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
+{
+	free_percpu(brw->fast_read_ctr);
+	brw->fast_read_ctr = NULL; /* catch use after free bugs */
+}
+
+static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val)
+{
+	bool success = false;
+
+	preempt_disable();
+	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
+		__this_cpu_add(*brw->fast_read_ctr, val);
+		success = true;
+	}
+	preempt_enable();
+
+	return success;
+}
+
+void percpu_down_read(struct percpu_rw_semaphore *brw)
+{
+	if (likely(update_fast_ctr(brw, +1)))
+		return;
+
+	down_read(&brw->rw_sem);
+	atomic_inc(&brw->slow_read_ctr);
+	up_read(&brw->rw_sem);
+}
+
+void percpu_up_read(struct percpu_rw_semaphore *brw)
+{
+	if (likely(update_fast_ctr(brw, -1)))
+		return;
+
+	/* false-positive is possible but harmless */
+	if (atomic_dec_and_test(&brw->slow_read_ctr))
+		wake_up_all(&brw->write_waitq);
+}
+
+static int clear_fast_read_ctr(struct percpu_rw_semaphore *brw)
+{
+	int cpu, sum = 0;
+
+	for_each_possible_cpu(cpu) {
+		sum += per_cpu(*brw->fast_read_ctr, cpu);
+		per_cpu(*brw->fast_read_ctr, cpu) = 0;
+	}
+
+	return sum;
+}
+
+void percpu_down_write(struct percpu_rw_semaphore *brw)
+{
+	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
+	mutex_lock(&brw->writer_mutex);
+
+	/*
+	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
+	 *    so that update_fast_ctr() can't succeed.
+	 *
+	 * 2. Ensures we see the result of every previous this_cpu_add() in
+	 *    update_fast_ctr().
+	 *
+	 * 3. Ensures that if any reader has exited its critical section via
+	 *    fast-path, it executes a full memory barrier before we return.
+	 */
+	synchronize_sched();
+
+	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
+	atomic_add(clear_fast_read_ctr(brw), &brw->slow_read_ctr);
+
+	/* block the new readers completely */
+	down_write(&brw->rw_sem);
+
+	/* wait for all readers to complete their percpu_up_read() */
+	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
+}
+
+void percpu_up_write(struct percpu_rw_semaphore *brw)
+{
+	/* allow the new readers, but only the slow-path */
+	up_write(&brw->rw_sem);
+
+	/* insert the barrier before the next fast-path in down_read */
+	synchronize_sched();
+
+	mutex_unlock(&brw->writer_mutex);
+}
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-10-31 19:41                       ` [PATCH 1/1] " Oleg Nesterov
@ 2012-11-01 15:10                         ` Linus Torvalds
  2012-11-01 15:34                           ` Oleg Nesterov
  2012-11-02 18:06                           ` [PATCH v2 0/1] " Oleg Nesterov
  2012-11-01 15:43                         ` [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Paul E. McKenney
  1 sibling, 2 replies; 103+ messages in thread
From: Linus Torvalds @ 2012-11-01 15:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> Currently the writer does msleep() plus synchronize_sched() 3 times
> to acquire/release the semaphore, and during this time the readers
> are blocked completely. Even if the "write" section was not actually
> started or if it was already finished.
>
> With this patch down_read/up_read does synchronize_sched() twice and
> down_read/up_read are still possible during this time, just they use
> the slow path.

The changelog is wrong (it's the write path, not read path, that does
the synchronize_sched).

>  struct percpu_rw_semaphore {
> -       unsigned __percpu *counters;
> -       bool locked;
> -       struct mutex mtx;
> +       int __percpu            *fast_read_ctr;

This change is wrong.

You must not make the 'fast_read_ctr' thing be an int. Or at least you
need to be a hell of a lot more careful about it.

Why?

Because the readers update the counters while possibly moving around
cpu's, the increment and decrement of the counters may be on different
CPU's. But that means that when you add all the counters together,
things can overflow (only the final sum is meaningful). And THAT in
turn means that you should not use a signed count, for the simple
reason that signed integers don't have well-behaved overflow behavior
in C.

Now, I doubt you'll find an architecture or C compiler where this will
actually ever make a difference, but the fact remains that you
shouldn't use signed integers for counters like this. You should use
unsigned, and you should rely on the well-defined modulo-2**n
semantics.

I'd also like to see a comment somewhere in the source code about the
whole algorithm and the rules.

Other than that, I guess it looks ok.

            Linus

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-01 15:10                         ` Linus Torvalds
@ 2012-11-01 15:34                           ` Oleg Nesterov
  2012-11-02 18:06                           ` [PATCH v2 0/1] " Oleg Nesterov
  1 sibling, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-01 15:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

Thanks!

I'll send v2 tomorrow.

On 11/01, Linus Torvalds wrote:
> On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > Currently the writer does msleep() plus synchronize_sched() 3 times
> > to acquire/release the semaphore, and during this time the readers
> > are blocked completely. Even if the "write" section was not actually
> > started or if it was already finished.
> >
> > With this patch down_read/up_read does synchronize_sched() twice and
> > down_read/up_read are still possible during this time, just they use
> > the slow path.
> 
> The changelog is wrong (it's the write path, not read path, that does
> the synchronize_sched).
> 
> >  struct percpu_rw_semaphore {
> > -       unsigned __percpu *counters;
> > -       bool locked;
> > -       struct mutex mtx;
> > +       int __percpu            *fast_read_ctr;
> 
> This change is wrong.
> 
> You must not make the 'fast_read_ctr' thing be an int. Or at least you
> need to be a hell of a lot more careful about it.
> 
> Why?
> 
> Because the readers update the counters while possibly moving around
> cpu's, the increment and decrement of the counters may be on different
> CPU's. But that means that when you add all the counters together,
> things can overflow (only the final sum is meaningful). And THAT in
> turn means that you should not use a signed count, for the simple
> reason that signed integers don't have well-behaved overflow behavior
> in C.
> 
> Now, I doubt you'll find an architecture or C compiler where this will
> actually ever make a difference, but the fact remains that you
> shouldn't use signed integers for counters like this. You should use
> unsigned, and you should rely on the well-defined modulo-2**n
> semantics.
> 
> I'd also like to see a comment somewhere in the source code about the
> whole algorithm and the rules.
> 
> Other than that, I guess it looks ok.
> 
>             Linus


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-10-31 19:41                       ` [PATCH 1/1] " Oleg Nesterov
  2012-11-01 15:10                         ` Linus Torvalds
@ 2012-11-01 15:43                         ` Paul E. McKenney
  2012-11-01 18:33                           ` Oleg Nesterov
  1 sibling, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-01 15:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Oct 31, 2012 at 08:41:58PM +0100, Oleg Nesterov wrote:
> Currently the writer does msleep() plus synchronize_sched() 3 times
> to acquire/release the semaphore, and during this time the readers
> are blocked completely. Even if the "write" section was not actually
> started or if it was already finished.
> 
> With this patch down_read/up_read does synchronize_sched() twice and
> down_read/up_read are still possible during this time, just they use
> the slow path.
> 
> percpu_down_write() first forces the readers to use rw_semaphore and
> increment the "slow" counter to take the lock for reading, then it
> takes that rw_semaphore for writing and blocks the readers.
> 
> Also. With this patch the code relies on the documented behaviour of
> synchronize_sched(), it doesn't try to pair synchronize_sched() with
> barrier.

OK, so it looks to me that this code relies on synchronize_sched()
forcing a memory barrier on each CPU executing in the kernel.  I
might well be confused, so here is the sequence of events that leads
me to believe this:

1.	A task running on CPU 0 currently write-holds the lock.

2.	CPU 1 is running in the kernel, executing a longer-than-average
	loop of normal instructions (no atomic instructions or memory
	barriers).

3.	CPU 0 invokes percpu_up_write(), calling up_write(),
	synchronize_sched(), and finally mutex_unlock().

4.	CPU 1 executes percpu_down_read(), which calls update_fast_ctr(),
	which finds that ->writer_mutex is not held.  CPU 1 therefore
	increments >fast_read_ctr and returns success.

Of course, as Mikulas pointed out, the actual implementation will
have forced CPU 1 to execute a memory barrier in the course of the
synchronize_sched() implementation.  However, if synchronize_sched() had
been modified to act as synchronize_srcu() currently does, there would
be no memory barrier, and thus no guarantee that CPU 1's subsequent
read-side critical section would seen the effect of CPU 0's previous
write-side critical section.

Fortunately, this is easy to fix, with zero added overhead on the
read-side fastpath, as shown by the notes interspersed below.

Thoughts?

						Thanx, Paul

> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/percpu-rwsem.h |   83 +++++---------------------------
>  lib/Makefile                 |    2 +-
>  lib/percpu-rwsem.c           |  106 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 120 insertions(+), 71 deletions(-)
>  create mode 100644 lib/percpu-rwsem.c
> 
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 250a4ac..7f738ca 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -2,82 +2,25 @@
>  #define _LINUX_PERCPU_RWSEM_H
> 
>  #include <linux/mutex.h>
> +#include <linux/rwsem.h>
>  #include <linux/percpu.h>
> -#include <linux/rcupdate.h>
> -#include <linux/delay.h>
> +#include <linux/wait.h>
> 
>  struct percpu_rw_semaphore {
> -	unsigned __percpu *counters;
> -	bool locked;
> -	struct mutex mtx;
> +	int __percpu		*fast_read_ctr;
> +	struct mutex		writer_mutex;
> +	struct rw_semaphore	rw_sem;
> +	atomic_t		slow_read_ctr;
> +	wait_queue_head_t	write_waitq;
	int			wstate;
>  };
> 
> -#define light_mb()	barrier()
> -#define heavy_mb()	synchronize_sched()
> +extern void percpu_down_read(struct percpu_rw_semaphore *);
> +extern void percpu_up_read(struct percpu_rw_semaphore *);
> 
> -static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> -{
> -	rcu_read_lock_sched();
> -	if (unlikely(p->locked)) {
> -		rcu_read_unlock_sched();
> -		mutex_lock(&p->mtx);
> -		this_cpu_inc(*p->counters);
> -		mutex_unlock(&p->mtx);
> -		return;
> -	}
> -	this_cpu_inc(*p->counters);
> -	rcu_read_unlock_sched();
> -	light_mb(); /* A, between read of p->locked and read of data, paired with D */
> -}
> +extern void percpu_down_write(struct percpu_rw_semaphore *);
> +extern void percpu_up_write(struct percpu_rw_semaphore *);
> 
> -static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> -{
> -	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
> -	this_cpu_dec(*p->counters);
> -}
> -
> -static inline unsigned __percpu_count(unsigned __percpu *counters)
> -{
> -	unsigned total = 0;
> -	int cpu;
> -
> -	for_each_possible_cpu(cpu)
> -		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
> -
> -	return total;
> -}
> -
> -static inline void percpu_down_write(struct percpu_rw_semaphore *p)
> -{
> -	mutex_lock(&p->mtx);
> -	p->locked = true;
> -	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
> -	while (__percpu_count(p->counters))
> -		msleep(1);
> -	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
> -}
> -
> -static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> -{
> -	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
> -	p->locked = false;
> -	mutex_unlock(&p->mtx);
> -}
> -
> -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
> -{
> -	p->counters = alloc_percpu(unsigned);
> -	if (unlikely(!p->counters))
> -		return -ENOMEM;
> -	p->locked = false;
> -	mutex_init(&p->mtx);
> -	return 0;
> -}
> -
> -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
> -{
> -	free_percpu(p->counters);
> -	p->counters = NULL; /* catch use after free bugs */
> -}
> +extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
> +extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
> 
>  #endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 821a162..4dad4a7 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>  	 idr.o int_sqrt.o extable.o \
>  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> -	 is_single_threaded.o plist.o decompress.o
> +	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
> 
>  lib-$(CONFIG_MMU) += ioremap.o
>  lib-$(CONFIG_SMP) += cpumask.o
> diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
> new file mode 100644
> index 0000000..40a415d
> --- /dev/null
> +++ b/lib/percpu-rwsem.c
> @@ -0,0 +1,106 @@
> +#include <linux/percpu-rwsem.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>

#define WSTATE_NEED_LOCK 1
#define WSTATE_NEED_MB	 2

> +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	brw->fast_read_ctr = alloc_percpu(int);
> +	if (unlikely(!brw->fast_read_ctr))
> +		return -ENOMEM;
> +
> +	mutex_init(&brw->writer_mutex);
> +	init_rwsem(&brw->rw_sem);
> +	atomic_set(&brw->slow_read_ctr, 0);
> +	init_waitqueue_head(&brw->write_waitq);
> +	return 0;
> +}
> +
> +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	free_percpu(brw->fast_read_ctr);
> +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> +}
> +
> +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val)
> +{
> +	bool success = false;

	int state;

> +
> +	preempt_disable();
> +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {

	state = ACCESS_ONCE(brw->wstate);
	if (likely(!state)) {

> +		__this_cpu_add(*brw->fast_read_ctr, val);
> +		success = true;

	} else if (state & WSTATE_NEED_MB) {
		__this_cpu_add(*brw->fast_read_ctr, val);
		smb_mb(); /* Order increment against critical section. */
		success = true;
	}

> +	preempt_enable();
> +
> +	return success;
> +}
> +
> +void percpu_down_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, +1)))
> +		return;
> +
> +	down_read(&brw->rw_sem);
> +	atomic_inc(&brw->slow_read_ctr);
> +	up_read(&brw->rw_sem);
> +}
> +
> +void percpu_up_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, -1)))
> +		return;
> +
> +	/* false-positive is possible but harmless */
> +	if (atomic_dec_and_test(&brw->slow_read_ctr))
> +		wake_up_all(&brw->write_waitq);
> +}
> +
> +static int clear_fast_read_ctr(struct percpu_rw_semaphore *brw)
> +{
> +	int cpu, sum = 0;
> +
> +	for_each_possible_cpu(cpu) {
> +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> +	}
> +
> +	return sum;
> +}
> +
> +void percpu_down_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> +	mutex_lock(&brw->writer_mutex);

	ACCESS_ONCE(brw->wstate) = WSTATE_NEED_LOCK;

> +	/*
> +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> +	 *    so that update_fast_ctr() can't succeed.
> +	 *
> +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> +	 *    update_fast_ctr().
> +	 *
> +	 * 3. Ensures that if any reader has exited its critical section via
> +	 *    fast-path, it executes a full memory barrier before we return.
> +	 */
> +	synchronize_sched();
> +
> +	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> +	atomic_add(clear_fast_read_ctr(brw), &brw->slow_read_ctr);
> +
> +	/* block the new readers completely */
> +	down_write(&brw->rw_sem);
> +
> +	/* wait for all readers to complete their percpu_up_read() */
> +	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> +}
> +
> +void percpu_up_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* allow the new readers, but only the slow-path */
> +	up_write(&brw->rw_sem);

	ACCESS_ONCE(brw->wstate) = WSTATE_NEED_MB;

> +
> +	/* insert the barrier before the next fast-path in down_read */
> +	synchronize_sched();

	ACCESS_ONCE(brw->wstate) = 0;

> +	mutex_unlock(&brw->writer_mutex);
> +}

OK, o


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-01 15:43                         ` [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Paul E. McKenney
@ 2012-11-01 18:33                           ` Oleg Nesterov
  2012-11-02 16:18                             ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-01 18:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

Paul, thanks.

Sorry, I can't reply today, just one note...

On 11/01, Paul E. McKenney wrote:
>
> OK, so it looks to me that this code relies on synchronize_sched()
> forcing a memory barrier on each CPU executing in the kernel.

No, the patch tries to avoid this assumption, but probably I missed
something.

> 1.	A task running on CPU 0 currently write-holds the lock.
>
> 2.	CPU 1 is running in the kernel, executing a longer-than-average
> 	loop of normal instructions (no atomic instructions or memory
> 	barriers).
>
> 3.	CPU 0 invokes percpu_up_write(), calling up_write(),
> 	synchronize_sched(), and finally mutex_unlock().

And my expectation was, this should be enough because ...

> 4.	CPU 1 executes percpu_down_read(), which calls update_fast_ctr(),

since update_fast_ctr does preempt_disable/enable it should see all
modifications done by CPU 0.

IOW. Suppose that the writer (CPU 0) does

	percpu_done_write();
	STORE;
	percpu_up_write();

This means

	STORE;
	synchronize_sched();
	mutex_unlock();

Now. Do you mean that the next preempt_disable/enable can see the
result of mutex_unlock() but not STORE?

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-01 18:33                           ` Oleg Nesterov
@ 2012-11-02 16:18                             ` Oleg Nesterov
  0 siblings, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-02 16:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On 11/01, Oleg Nesterov wrote:
>
> On 11/01, Paul E. McKenney wrote:
> >
> > OK, so it looks to me that this code relies on synchronize_sched()
> > forcing a memory barrier on each CPU executing in the kernel.
>
> No, the patch tries to avoid this assumption, but probably I missed
> something.
>
> > 1.	A task running on CPU 0 currently write-holds the lock.
> >
> > 2.	CPU 1 is running in the kernel, executing a longer-than-average
> > 	loop of normal instructions (no atomic instructions or memory
> > 	barriers).
> >
> > 3.	CPU 0 invokes percpu_up_write(), calling up_write(),
> > 	synchronize_sched(), and finally mutex_unlock().
>
> And my expectation was, this should be enough because ...
>
> > 4.	CPU 1 executes percpu_down_read(), which calls update_fast_ctr(),
>
> since update_fast_ctr does preempt_disable/enable it should see all
> modifications done by CPU 0.
>
> IOW. Suppose that the writer (CPU 0) does
>
> 	percpu_done_write();
> 	STORE;
> 	percpu_up_write();
>
> This means
>
> 	STORE;
> 	synchronize_sched();
> 	mutex_unlock();
>
> Now. Do you mean that the next preempt_disable/enable can see the
> result of mutex_unlock() but not STORE?

So far I think this is not possible, so the code doesn't need the
additional wstate/barriers.

> > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val)
> > +{
> > +	bool success = false;
>
> 	int state;
>
> > +
> > +	preempt_disable();
> > +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
>
> 	state = ACCESS_ONCE(brw->wstate);
> 	if (likely(!state)) {
>
> > +		__this_cpu_add(*brw->fast_read_ctr, val);
> > +		success = true;
>
> 	} else if (state & WSTATE_NEED_MB) {
> 		__this_cpu_add(*brw->fast_read_ctr, val);
> 		smb_mb(); /* Order increment against critical section. */
> 		success = true;
> 	}

...

> > +void percpu_up_write(struct percpu_rw_semaphore *brw)
> > +{
> > +	/* allow the new readers, but only the slow-path */
> > +	up_write(&brw->rw_sem);
>
> 	ACCESS_ONCE(brw->wstate) = WSTATE_NEED_MB;
>
> > +
> > +	/* insert the barrier before the next fast-path in down_read */
> > +	synchronize_sched();

But update_fast_ctr() should see mutex_is_locked(), obiously down_write()
must ensure this.

So update_fast_ctr() can execute the WSTATE_NEED_MB code only if it
races with

> 	ACCESS_ONCE(brw->wstate) = 0;
>
> > +	mutex_unlock(&brw->writer_mutex);

these 2 stores and sees them in reverse order.



I guess that mutex_is_locked() in update_fast_ctr() looks a bit confusing.
It means no-fast-path for the reader, we could use ->state instead.

And even ->writer_mutex should go away if we want to optimize the
write-contended case, but I think this needs another patch on top of
this initial implementation.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-01 15:10                         ` Linus Torvalds
  2012-11-01 15:34                           ` Oleg Nesterov
@ 2012-11-02 18:06                           ` Oleg Nesterov
  2012-11-02 18:06                             ` [PATCH v2 1/1] " Oleg Nesterov
  2012-11-08 13:48                             ` [PATCH RESEND v2 0/1] " Oleg Nesterov
  1 sibling, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-02 18:06 UTC (permalink / raw)
  To: Linus Torvalds, Paul E. McKenney
  Cc: Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 11/01, Linus Torvalds wrote:
>
> On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > With this patch down_read/up_read does synchronize_sched() twice and
> > down_read/up_read are still possible during this time, just they use
> > the slow path.
>
> The changelog is wrong (it's the write path, not read path, that does
> the synchronize_sched).

Fixed, thanks,

> >  struct percpu_rw_semaphore {
> > -       unsigned __percpu *counters;
> > -       bool locked;
> > -       struct mutex mtx;
> > +       int __percpu            *fast_read_ctr;
>
> This change is wrong.
>
> You must not make the 'fast_read_ctr' thing be an int. Or at least you
> need to be a hell of a lot more careful about it.
>
> Why?
>
> Because the readers update the counters while possibly moving around
> cpu's, the increment and decrement of the counters may be on different
> CPU's. But that means that when you add all the counters together,
> things can overflow (only the final sum is meaningful). And THAT in
> turn means that you should not use a signed count, for the simple
> reason that signed integers don't have well-behaved overflow behavior
> in C.

Yes, Mikulas has pointed this too, but I forgot to make it "unsigned".

> Now, I doubt you'll find an architecture or C compiler where this will
> actually ever make a difference,

Yes. And we have other examples, say, mnt->mnt_pcp->mnt_writers is "int".

> but the fact remains that you
> shouldn't use signed integers for counters like this. You should use
> unsigned, and you should rely on the well-defined modulo-2**n
> semantics.

OK, I changed this.

But please note that clear_fast_ctr() still returns "int", even if it
uses "unsigned" to calculate the result. Because we use this value for
atomic_add(int i) and it can be actually negative, so to me it looks
a bit better this way even if the generated code is the same.

> I'd also like to see a comment somewhere in the source code about the
> whole algorithm and the rules.

Added the comments before down_read and down_write.

> Other than that, I guess it looks ok.

Great, please see v2.

I am not sure I addressed Paul's concerns, so I guess I need his ack.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-02 18:06                           ` [PATCH v2 0/1] " Oleg Nesterov
@ 2012-11-02 18:06                             ` Oleg Nesterov
  2012-11-07 17:04                               ` [PATCH v3 " Mikulas Patocka
  2012-11-08  1:16                               ` [PATCH v2 " Paul E. McKenney
  2012-11-08 13:48                             ` [PATCH RESEND v2 0/1] " Oleg Nesterov
  1 sibling, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-02 18:06 UTC (permalink / raw)
  To: Linus Torvalds, Paul E. McKenney
  Cc: Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

Currently the writer does msleep() plus synchronize_sched() 3 times
to acquire/release the semaphore, and during this time the readers
are blocked completely. Even if the "write" section was not actually
started or if it was already finished.

With this patch down_write/up_write does synchronize_sched() twice
and down_read/up_read are still possible during this time, just they
use the slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/percpu-rwsem.h |   83 +++++------------------------
 lib/Makefile                 |    2 +-
 lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 137 insertions(+), 71 deletions(-)
 create mode 100644 lib/percpu-rwsem.c

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 250a4ac..592f0d6 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -2,82 +2,25 @@
 #define _LINUX_PERCPU_RWSEM_H
 
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/percpu.h>
-#include <linux/rcupdate.h>
-#include <linux/delay.h>
+#include <linux/wait.h>
 
 struct percpu_rw_semaphore {
-	unsigned __percpu *counters;
-	bool locked;
-	struct mutex mtx;
+	unsigned int __percpu	*fast_read_ctr;
+	struct mutex		writer_mutex;
+	struct rw_semaphore	rw_sem;
+	atomic_t		slow_read_ctr;
+	wait_queue_head_t	write_waitq;
 };
 
-#define light_mb()	barrier()
-#define heavy_mb()	synchronize_sched()
+extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern void percpu_up_read(struct percpu_rw_semaphore *);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
-	rcu_read_lock_sched();
-	if (unlikely(p->locked)) {
-		rcu_read_unlock_sched();
-		mutex_lock(&p->mtx);
-		this_cpu_inc(*p->counters);
-		mutex_unlock(&p->mtx);
-		return;
-	}
-	this_cpu_inc(*p->counters);
-	rcu_read_unlock_sched();
-	light_mb(); /* A, between read of p->locked and read of data, paired with D */
-}
+extern void percpu_down_write(struct percpu_rw_semaphore *);
+extern void percpu_up_write(struct percpu_rw_semaphore *);
 
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
-	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
-	this_cpu_dec(*p->counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
-	unsigned total = 0;
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
-
-	return total;
-}
-
-static inline void percpu_down_write(struct percpu_rw_semaphore *p)
-{
-	mutex_lock(&p->mtx);
-	p->locked = true;
-	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
-	while (__percpu_count(p->counters))
-		msleep(1);
-	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
-}
-
-static inline void percpu_up_write(struct percpu_rw_semaphore *p)
-{
-	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
-	p->locked = false;
-	mutex_unlock(&p->mtx);
-}
-
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
-{
-	p->counters = alloc_percpu(unsigned);
-	if (unlikely(!p->counters))
-		return -ENOMEM;
-	p->locked = false;
-	mutex_init(&p->mtx);
-	return 0;
-}
-
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
-{
-	free_percpu(p->counters);
-	p->counters = NULL; /* catch use after free bugs */
-}
+extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
+extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 
 #endif
diff --git a/lib/Makefile b/lib/Makefile
index 821a162..4dad4a7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 idr.o int_sqrt.o extable.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o
+	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
new file mode 100644
index 0000000..0e3bc0f
--- /dev/null
+++ b/lib/percpu-rwsem.c
@@ -0,0 +1,123 @@
+#include <linux/percpu-rwsem.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
+{
+	brw->fast_read_ctr = alloc_percpu(int);
+	if (unlikely(!brw->fast_read_ctr))
+		return -ENOMEM;
+
+	mutex_init(&brw->writer_mutex);
+	init_rwsem(&brw->rw_sem);
+	atomic_set(&brw->slow_read_ctr, 0);
+	init_waitqueue_head(&brw->write_waitq);
+	return 0;
+}
+
+void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
+{
+	free_percpu(brw->fast_read_ctr);
+	brw->fast_read_ctr = NULL; /* catch use after free bugs */
+}
+
+static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
+{
+	bool success = false;
+
+	preempt_disable();
+	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
+		__this_cpu_add(*brw->fast_read_ctr, val);
+		success = true;
+	}
+	preempt_enable();
+
+	return success;
+}
+
+/*
+ * Like the normal down_read() this is not recursive, the writer can
+ * come after the first percpu_down_read() and create the deadlock.
+ */
+void percpu_down_read(struct percpu_rw_semaphore *brw)
+{
+	if (likely(update_fast_ctr(brw, +1)))
+		return;
+
+	down_read(&brw->rw_sem);
+	atomic_inc(&brw->slow_read_ctr);
+	up_read(&brw->rw_sem);
+}
+
+void percpu_up_read(struct percpu_rw_semaphore *brw)
+{
+	if (likely(update_fast_ctr(brw, -1)))
+		return;
+
+	/* false-positive is possible but harmless */
+	if (atomic_dec_and_test(&brw->slow_read_ctr))
+		wake_up_all(&brw->write_waitq);
+}
+
+static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
+{
+	unsigned int sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		sum += per_cpu(*brw->fast_read_ctr, cpu);
+		per_cpu(*brw->fast_read_ctr, cpu) = 0;
+	}
+
+	return sum;
+}
+
+/*
+ * A writer takes ->writer_mutex to exclude other writers and to force the
+ * readers to switch to the slow mode, note the mutex_is_locked() check in
+ * update_fast_ctr().
+ *
+ * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
+ * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
+ * counter it represents the number of active readers.
+ *
+ * Finally the writer takes ->rw_sem for writing and blocks the new readers,
+ * then waits until the slow counter becomes zero.
+ */
+void percpu_down_write(struct percpu_rw_semaphore *brw)
+{
+	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
+	mutex_lock(&brw->writer_mutex);
+
+	/*
+	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
+	 *    so that update_fast_ctr() can't succeed.
+	 *
+	 * 2. Ensures we see the result of every previous this_cpu_add() in
+	 *    update_fast_ctr().
+	 *
+	 * 3. Ensures that if any reader has exited its critical section via
+	 *    fast-path, it executes a full memory barrier before we return.
+	 */
+	synchronize_sched();
+
+	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
+	atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
+
+	/* block the new readers completely */
+	down_write(&brw->rw_sem);
+
+	/* wait for all readers to complete their percpu_up_read() */
+	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
+}
+
+void percpu_up_write(struct percpu_rw_semaphore *brw)
+{
+	/* allow the new readers, but only the slow-path */
+	up_write(&brw->rw_sem);
+
+	/* insert the barrier before the next fast-path in down_read */
+	synchronize_sched();
+
+	mutex_unlock(&brw->writer_mutex);
+}
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-02 18:06                             ` [PATCH v2 1/1] " Oleg Nesterov
@ 2012-11-07 17:04                               ` Mikulas Patocka
  2012-11-07 17:47                                 ` Oleg Nesterov
  2012-11-08  1:23                                 ` Paul E. McKenney
  2012-11-08  1:16                               ` [PATCH v2 " Paul E. McKenney
  1 sibling, 2 replies; 103+ messages in thread
From: Mikulas Patocka @ 2012-11-07 17:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

It looks sensible.

Here I'm sending an improvement of the patch - I changed it so that there 
are not two-level nested functions for the fast path and so that both 
percpu_down_read and percpu_up_read use the same piece of code (to reduce 
cache footprint).

---

Currently the writer does msleep() plus synchronize_sched() 3 times
to acquire/release the semaphore, and during this time the readers
are blocked completely. Even if the "write" section was not actually
started or if it was already finished.

With this patch down_write/up_write does synchronize_sched() twice
and down_read/up_read are still possible during this time, just they
use the slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
 include/linux/percpu-rwsem.h |   80 ++++++-------------------------
 lib/Makefile                 |    2 
 lib/percpu-rwsem.c           |  110 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 127 insertions(+), 65 deletions(-)
 create mode 100644 lib/percpu-rwsem.c

Index: linux-3.6.6-fast/include/linux/percpu-rwsem.h
===================================================================
--- linux-3.6.6-fast.orig/include/linux/percpu-rwsem.h	2012-11-05 16:21:29.000000000 +0100
+++ linux-3.6.6-fast/include/linux/percpu-rwsem.h	2012-11-07 16:44:04.000000000 +0100
@@ -2,82 +2,34 @@
 #define _LINUX_PERCPU_RWSEM_H
 
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/percpu.h>
-#include <linux/rcupdate.h>
-#include <linux/delay.h>
+#include <linux/wait.h>
 
 struct percpu_rw_semaphore {
-	unsigned __percpu *counters;
-	bool locked;
-	struct mutex mtx;
+	unsigned int __percpu	*fast_read_ctr;
+	struct mutex		writer_mutex;
+	struct rw_semaphore	rw_sem;
+	atomic_t		slow_read_ctr;
+	wait_queue_head_t	write_waitq;
 };
 
-#define light_mb()	barrier()
-#define heavy_mb()	synchronize_sched()
+extern void __percpu_down_up_read(struct percpu_rw_semaphore *, int);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
-	rcu_read_lock_sched();
-	if (unlikely(p->locked)) {
-		rcu_read_unlock_sched();
-		mutex_lock(&p->mtx);
-		this_cpu_inc(*p->counters);
-		mutex_unlock(&p->mtx);
-		return;
-	}
-	this_cpu_inc(*p->counters);
-	rcu_read_unlock_sched();
-	light_mb(); /* A, between read of p->locked and read of data, paired with D */
-}
-
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
-	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
-	this_cpu_dec(*p->counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
-	unsigned total = 0;
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
+extern void percpu_down_write(struct percpu_rw_semaphore *);
+extern void percpu_up_write(struct percpu_rw_semaphore *);
 
-	return total;
-}
-
-static inline void percpu_down_write(struct percpu_rw_semaphore *p)
-{
-	mutex_lock(&p->mtx);
-	p->locked = true;
-	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
-	while (__percpu_count(p->counters))
-		msleep(1);
-	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
-}
-
-static inline void percpu_up_write(struct percpu_rw_semaphore *p)
-{
-	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
-	p->locked = false;
-	mutex_unlock(&p->mtx);
-}
+extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
+extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
+static inline void percpu_down_read(struct percpu_rw_semaphore *s)
 {
-	p->counters = alloc_percpu(unsigned);
-	if (unlikely(!p->counters))
-		return -ENOMEM;
-	p->locked = false;
-	mutex_init(&p->mtx);
-	return 0;
+	__percpu_down_up_read(s, 1);
 }
 
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
+static inline void percpu_up_read(struct percpu_rw_semaphore *s)
 {
-	free_percpu(p->counters);
-	p->counters = NULL; /* catch use after free bugs */
+	__percpu_down_up_read(s, -1);
 }
 
 #endif
Index: linux-3.6.6-fast/lib/Makefile
===================================================================
--- linux-3.6.6-fast.orig/lib/Makefile	2012-10-02 00:47:57.000000000 +0200
+++ linux-3.6.6-fast/lib/Makefile	2012-11-07 03:10:44.000000000 +0100
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmd
 	 idr.o int_sqrt.o extable.o prio_tree.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o
+	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
Index: linux-3.6.6-fast/lib/percpu-rwsem.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.6.6-fast/lib/percpu-rwsem.c	2012-11-07 16:43:27.000000000 +0100
@@ -0,0 +1,110 @@
+#include <linux/percpu-rwsem.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/module.h>
+
+int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
+{
+	brw->fast_read_ctr = alloc_percpu(int);
+	if (unlikely(!brw->fast_read_ctr))
+		return -ENOMEM;
+
+	mutex_init(&brw->writer_mutex);
+	init_rwsem(&brw->rw_sem);
+	atomic_set(&brw->slow_read_ctr, 0);
+	init_waitqueue_head(&brw->write_waitq);
+	return 0;
+}
+EXPORT_SYMBOL(percpu_init_rwsem);
+
+void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
+{
+	free_percpu(brw->fast_read_ctr);
+	brw->fast_read_ctr = NULL; /* catch use after free bugs */
+}
+EXPORT_SYMBOL(percpu_free_rwsem);
+
+void __percpu_down_up_read(struct percpu_rw_semaphore *brw, int val)
+{
+	preempt_disable();
+	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
+		__this_cpu_add(*brw->fast_read_ctr, val);
+		preempt_enable();
+		return;
+	}
+	preempt_enable();
+	if (val >= 0) {
+		down_read(&brw->rw_sem);
+		atomic_inc(&brw->slow_read_ctr);
+		up_read(&brw->rw_sem);
+	} else {
+		if (atomic_dec_and_test(&brw->slow_read_ctr))
+			wake_up_all(&brw->write_waitq);
+	}
+}
+EXPORT_SYMBOL(__percpu_down_up_read);
+
+static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
+{
+	unsigned int sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		sum += per_cpu(*brw->fast_read_ctr, cpu);
+		per_cpu(*brw->fast_read_ctr, cpu) = 0;
+	}
+
+	return sum;
+}
+
+/*
+ * A writer takes ->writer_mutex to exclude other writers and to force the
+ * readers to switch to the slow mode, note the mutex_is_locked() check in
+ * update_fast_ctr().
+ *
+ * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
+ * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
+ * counter it represents the number of active readers.
+ *
+ * Finally the writer takes ->rw_sem for writing and blocks the new readers,
+ * then waits until the slow counter becomes zero.
+ */
+void percpu_down_write(struct percpu_rw_semaphore *brw)
+{
+	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
+	mutex_lock(&brw->writer_mutex);
+
+	/*
+	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
+	 *    so that update_fast_ctr() can't succeed.
+	 *
+	 * 2. Ensures we see the result of every previous this_cpu_add() in
+	 *    update_fast_ctr().
+	 *
+	 * 3. Ensures that if any reader has exited its critical section via
+	 *    fast-path, it executes a full memory barrier before we return.
+	 */
+	synchronize_sched();
+
+	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
+	atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
+
+	/* block the new readers completely */
+	down_write(&brw->rw_sem);
+
+	/* wait for all readers to complete their percpu_up_read() */
+	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
+}
+EXPORT_SYMBOL(percpu_down_write);
+
+void percpu_up_write(struct percpu_rw_semaphore *brw)
+{
+	/* allow the new readers, but only the slow-path */
+	up_write(&brw->rw_sem);
+
+	/* insert the barrier before the next fast-path in down_read */
+	synchronize_sched();
+
+	mutex_unlock(&brw->writer_mutex);
+}
+EXPORT_SYMBOL(percpu_up_write);

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-07 17:04                               ` [PATCH v3 " Mikulas Patocka
@ 2012-11-07 17:47                                 ` Oleg Nesterov
  2012-11-07 19:17                                   ` Mikulas Patocka
  2012-11-08  1:23                                 ` Paul E. McKenney
  1 sibling, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-07 17:47 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On 11/07, Mikulas Patocka wrote:
>
> It looks sensible.
>
> Here I'm sending an improvement of the patch - I changed it so that there
> are not two-level nested functions for the fast path and so that both
> percpu_down_read and percpu_up_read use the same piece of code (to reduce
> cache footprint).

IOW, the only change is that you eliminate "static update_fast_ctr()"
and fold it into down/up_read which takes the additional argument.

Honestly, personally I do not think this is better, but I won't argue.
I agree with everything but I guess we need the ack from Paul.

As for EXPORT_SYMBOL, I do not mind of course. But currently the only
user is block_dev.c.

> Currently the writer does msleep() plus synchronize_sched() 3 times
> to acquire/release the semaphore, and during this time the readers
> are blocked completely. Even if the "write" section was not actually
> started or if it was already finished.
> 
> With this patch down_write/up_write does synchronize_sched() twice
> and down_read/up_read are still possible during this time, just they
> use the slow path.
> 
> percpu_down_write() first forces the readers to use rw_semaphore and
> increment the "slow" counter to take the lock for reading, then it
> takes that rw_semaphore for writing and blocks the readers.
> 
> Also. With this patch the code relies on the documented behaviour of
> synchronize_sched(), it doesn't try to pair synchronize_sched() with
> barrier.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> ---
>  include/linux/percpu-rwsem.h |   80 ++++++-------------------------
>  lib/Makefile                 |    2 
>  lib/percpu-rwsem.c           |  110 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 127 insertions(+), 65 deletions(-)
>  create mode 100644 lib/percpu-rwsem.c
> 
> Index: linux-3.6.6-fast/include/linux/percpu-rwsem.h
> ===================================================================
> --- linux-3.6.6-fast.orig/include/linux/percpu-rwsem.h	2012-11-05 16:21:29.000000000 +0100
> +++ linux-3.6.6-fast/include/linux/percpu-rwsem.h	2012-11-07 16:44:04.000000000 +0100
> @@ -2,82 +2,34 @@
>  #define _LINUX_PERCPU_RWSEM_H
>  
>  #include <linux/mutex.h>
> +#include <linux/rwsem.h>
>  #include <linux/percpu.h>
> -#include <linux/rcupdate.h>
> -#include <linux/delay.h>
> +#include <linux/wait.h>
>  
>  struct percpu_rw_semaphore {
> -	unsigned __percpu *counters;
> -	bool locked;
> -	struct mutex mtx;
> +	unsigned int __percpu	*fast_read_ctr;
> +	struct mutex		writer_mutex;
> +	struct rw_semaphore	rw_sem;
> +	atomic_t		slow_read_ctr;
> +	wait_queue_head_t	write_waitq;
>  };
>  
> -#define light_mb()	barrier()
> -#define heavy_mb()	synchronize_sched()
> +extern void __percpu_down_up_read(struct percpu_rw_semaphore *, int);
>  
> -static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> -{
> -	rcu_read_lock_sched();
> -	if (unlikely(p->locked)) {
> -		rcu_read_unlock_sched();
> -		mutex_lock(&p->mtx);
> -		this_cpu_inc(*p->counters);
> -		mutex_unlock(&p->mtx);
> -		return;
> -	}
> -	this_cpu_inc(*p->counters);
> -	rcu_read_unlock_sched();
> -	light_mb(); /* A, between read of p->locked and read of data, paired with D */
> -}
> -
> -static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> -{
> -	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
> -	this_cpu_dec(*p->counters);
> -}
> -
> -static inline unsigned __percpu_count(unsigned __percpu *counters)
> -{
> -	unsigned total = 0;
> -	int cpu;
> -
> -	for_each_possible_cpu(cpu)
> -		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
> +extern void percpu_down_write(struct percpu_rw_semaphore *);
> +extern void percpu_up_write(struct percpu_rw_semaphore *);
>  
> -	return total;
> -}
> -
> -static inline void percpu_down_write(struct percpu_rw_semaphore *p)
> -{
> -	mutex_lock(&p->mtx);
> -	p->locked = true;
> -	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
> -	while (__percpu_count(p->counters))
> -		msleep(1);
> -	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
> -}
> -
> -static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> -{
> -	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
> -	p->locked = false;
> -	mutex_unlock(&p->mtx);
> -}
> +extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
> +extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
>  
> -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
> +static inline void percpu_down_read(struct percpu_rw_semaphore *s)
>  {
> -	p->counters = alloc_percpu(unsigned);
> -	if (unlikely(!p->counters))
> -		return -ENOMEM;
> -	p->locked = false;
> -	mutex_init(&p->mtx);
> -	return 0;
> +	__percpu_down_up_read(s, 1);
>  }
>  
> -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
> +static inline void percpu_up_read(struct percpu_rw_semaphore *s)
>  {
> -	free_percpu(p->counters);
> -	p->counters = NULL; /* catch use after free bugs */
> +	__percpu_down_up_read(s, -1);
>  }
>  
>  #endif
> Index: linux-3.6.6-fast/lib/Makefile
> ===================================================================
> --- linux-3.6.6-fast.orig/lib/Makefile	2012-10-02 00:47:57.000000000 +0200
> +++ linux-3.6.6-fast/lib/Makefile	2012-11-07 03:10:44.000000000 +0100
> @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmd
>  	 idr.o int_sqrt.o extable.o prio_tree.o \
>  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> -	 is_single_threaded.o plist.o decompress.o
> +	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
>  
>  lib-$(CONFIG_MMU) += ioremap.o
>  lib-$(CONFIG_SMP) += cpumask.o
> Index: linux-3.6.6-fast/lib/percpu-rwsem.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-3.6.6-fast/lib/percpu-rwsem.c	2012-11-07 16:43:27.000000000 +0100
> @@ -0,0 +1,110 @@
> +#include <linux/percpu-rwsem.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>
> +#include <linux/module.h>
> +
> +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	brw->fast_read_ctr = alloc_percpu(int);
> +	if (unlikely(!brw->fast_read_ctr))
> +		return -ENOMEM;
> +
> +	mutex_init(&brw->writer_mutex);
> +	init_rwsem(&brw->rw_sem);
> +	atomic_set(&brw->slow_read_ctr, 0);
> +	init_waitqueue_head(&brw->write_waitq);
> +	return 0;
> +}
> +EXPORT_SYMBOL(percpu_init_rwsem);
> +
> +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	free_percpu(brw->fast_read_ctr);
> +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> +}
> +EXPORT_SYMBOL(percpu_free_rwsem);
> +
> +void __percpu_down_up_read(struct percpu_rw_semaphore *brw, int val)
> +{
> +	preempt_disable();
> +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> +		__this_cpu_add(*brw->fast_read_ctr, val);
> +		preempt_enable();
> +		return;
> +	}
> +	preempt_enable();
> +	if (val >= 0) {
> +		down_read(&brw->rw_sem);
> +		atomic_inc(&brw->slow_read_ctr);
> +		up_read(&brw->rw_sem);
> +	} else {
> +		if (atomic_dec_and_test(&brw->slow_read_ctr))
> +			wake_up_all(&brw->write_waitq);
> +	}
> +}
> +EXPORT_SYMBOL(__percpu_down_up_read);
> +
> +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> +{
> +	unsigned int sum = 0;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> +	}
> +
> +	return sum;
> +}
> +
> +/*
> + * A writer takes ->writer_mutex to exclude other writers and to force the
> + * readers to switch to the slow mode, note the mutex_is_locked() check in
> + * update_fast_ctr().
> + *
> + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> + * counter it represents the number of active readers.
> + *
> + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> + * then waits until the slow counter becomes zero.
> + */
> +void percpu_down_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> +	mutex_lock(&brw->writer_mutex);
> +
> +	/*
> +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> +	 *    so that update_fast_ctr() can't succeed.
> +	 *
> +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> +	 *    update_fast_ctr().
> +	 *
> +	 * 3. Ensures that if any reader has exited its critical section via
> +	 *    fast-path, it executes a full memory barrier before we return.
> +	 */
> +	synchronize_sched();
> +
> +	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> +	atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
> +
> +	/* block the new readers completely */
> +	down_write(&brw->rw_sem);
> +
> +	/* wait for all readers to complete their percpu_up_read() */
> +	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> +}
> +EXPORT_SYMBOL(percpu_down_write);
> +
> +void percpu_up_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* allow the new readers, but only the slow-path */
> +	up_write(&brw->rw_sem);
> +
> +	/* insert the barrier before the next fast-path in down_read */
> +	synchronize_sched();
> +
> +	mutex_unlock(&brw->writer_mutex);
> +}
> +EXPORT_SYMBOL(percpu_up_write);


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-07 17:47                                 ` Oleg Nesterov
@ 2012-11-07 19:17                                   ` Mikulas Patocka
  2012-11-08 13:42                                     ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-11-07 19:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel



On Wed, 7 Nov 2012, Oleg Nesterov wrote:

> On 11/07, Mikulas Patocka wrote:
> >
> > It looks sensible.
> >
> > Here I'm sending an improvement of the patch - I changed it so that there
> > are not two-level nested functions for the fast path and so that both
> > percpu_down_read and percpu_up_read use the same piece of code (to reduce
> > cache footprint).
> 
> IOW, the only change is that you eliminate "static update_fast_ctr()"
> and fold it into down/up_read which takes the additional argument.
> 
> Honestly, personally I do not think this is better, but I won't argue.
> I agree with everything but I guess we need the ack from Paul.

If you look at generated assembly (for x86-64), the footprint of my patch 
is 78 bytes shared for both percpu_down_read and percpu_up_read.

The footprint of your patch is 62 bytes for update_fast_ctr, 46 bytes for 
percpu_down_read and 20 bytes for percpu_up_read.

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-02 18:06                             ` [PATCH v2 1/1] " Oleg Nesterov
  2012-11-07 17:04                               ` [PATCH v3 " Mikulas Patocka
@ 2012-11-08  1:16                               ` Paul E. McKenney
  2012-11-08 13:33                                 ` Oleg Nesterov
  1 sibling, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-08  1:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Fri, Nov 02, 2012 at 07:06:29PM +0100, Oleg Nesterov wrote:
> Currently the writer does msleep() plus synchronize_sched() 3 times
> to acquire/release the semaphore, and during this time the readers
> are blocked completely. Even if the "write" section was not actually
> started or if it was already finished.
> 
> With this patch down_write/up_write does synchronize_sched() twice
> and down_read/up_read are still possible during this time, just they
> use the slow path.
> 
> percpu_down_write() first forces the readers to use rw_semaphore and
> increment the "slow" counter to take the lock for reading, then it
> takes that rw_semaphore for writing and blocks the readers.
> 
> Also. With this patch the code relies on the documented behaviour of
> synchronize_sched(), it doesn't try to pair synchronize_sched() with
> barrier.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/percpu-rwsem.h |   83 +++++------------------------
>  lib/Makefile                 |    2 +-
>  lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 137 insertions(+), 71 deletions(-)
>  create mode 100644 lib/percpu-rwsem.c
> 
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 250a4ac..592f0d6 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -2,82 +2,25 @@
>  #define _LINUX_PERCPU_RWSEM_H
> 
>  #include <linux/mutex.h>
> +#include <linux/rwsem.h>
>  #include <linux/percpu.h>
> -#include <linux/rcupdate.h>
> -#include <linux/delay.h>
> +#include <linux/wait.h>
> 
>  struct percpu_rw_semaphore {
> -	unsigned __percpu *counters;
> -	bool locked;
> -	struct mutex mtx;
> +	unsigned int __percpu	*fast_read_ctr;
> +	struct mutex		writer_mutex;
> +	struct rw_semaphore	rw_sem;
> +	atomic_t		slow_read_ctr;
> +	wait_queue_head_t	write_waitq;
>  };
> 
> -#define light_mb()	barrier()
> -#define heavy_mb()	synchronize_sched()
> +extern void percpu_down_read(struct percpu_rw_semaphore *);
> +extern void percpu_up_read(struct percpu_rw_semaphore *);
> 
> -static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> -{
> -	rcu_read_lock_sched();
> -	if (unlikely(p->locked)) {
> -		rcu_read_unlock_sched();
> -		mutex_lock(&p->mtx);
> -		this_cpu_inc(*p->counters);
> -		mutex_unlock(&p->mtx);
> -		return;
> -	}
> -	this_cpu_inc(*p->counters);
> -	rcu_read_unlock_sched();
> -	light_mb(); /* A, between read of p->locked and read of data, paired with D */
> -}
> +extern void percpu_down_write(struct percpu_rw_semaphore *);
> +extern void percpu_up_write(struct percpu_rw_semaphore *);
> 
> -static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> -{
> -	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
> -	this_cpu_dec(*p->counters);
> -}
> -
> -static inline unsigned __percpu_count(unsigned __percpu *counters)
> -{
> -	unsigned total = 0;
> -	int cpu;
> -
> -	for_each_possible_cpu(cpu)
> -		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
> -
> -	return total;
> -}
> -
> -static inline void percpu_down_write(struct percpu_rw_semaphore *p)
> -{
> -	mutex_lock(&p->mtx);
> -	p->locked = true;
> -	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
> -	while (__percpu_count(p->counters))
> -		msleep(1);
> -	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
> -}
> -
> -static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> -{
> -	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
> -	p->locked = false;
> -	mutex_unlock(&p->mtx);
> -}
> -
> -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
> -{
> -	p->counters = alloc_percpu(unsigned);
> -	if (unlikely(!p->counters))
> -		return -ENOMEM;
> -	p->locked = false;
> -	mutex_init(&p->mtx);
> -	return 0;
> -}
> -
> -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
> -{
> -	free_percpu(p->counters);
> -	p->counters = NULL; /* catch use after free bugs */
> -}
> +extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
> +extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
> 
>  #endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 821a162..4dad4a7 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>  	 idr.o int_sqrt.o extable.o \
>  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> -	 is_single_threaded.o plist.o decompress.o
> +	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
> 
>  lib-$(CONFIG_MMU) += ioremap.o
>  lib-$(CONFIG_SMP) += cpumask.o
> diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
> new file mode 100644
> index 0000000..0e3bc0f
> --- /dev/null
> +++ b/lib/percpu-rwsem.c
> @@ -0,0 +1,123 @@
> +#include <linux/percpu-rwsem.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>
> +
> +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	brw->fast_read_ctr = alloc_percpu(int);
> +	if (unlikely(!brw->fast_read_ctr))
> +		return -ENOMEM;
> +
> +	mutex_init(&brw->writer_mutex);
> +	init_rwsem(&brw->rw_sem);
> +	atomic_set(&brw->slow_read_ctr, 0);
> +	init_waitqueue_head(&brw->write_waitq);
> +	return 0;
> +}
> +
> +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	free_percpu(brw->fast_read_ctr);
> +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> +}
> +
> +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
> +{
> +	bool success = false;
> +
> +	preempt_disable();
> +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> +		__this_cpu_add(*brw->fast_read_ctr, val);
> +		success = true;
> +	}
> +	preempt_enable();
> +
> +	return success;
> +}
> +
> +/*
> + * Like the normal down_read() this is not recursive, the writer can
> + * come after the first percpu_down_read() and create the deadlock.
> + */
> +void percpu_down_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, +1)))
> +		return;
> +
> +	down_read(&brw->rw_sem);
> +	atomic_inc(&brw->slow_read_ctr);
> +	up_read(&brw->rw_sem);
> +}
> +
> +void percpu_up_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, -1)))
> +		return;
> +
> +	/* false-positive is possible but harmless */
> +	if (atomic_dec_and_test(&brw->slow_read_ctr))
> +		wake_up_all(&brw->write_waitq);
> +}
> +
> +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> +{
> +	unsigned int sum = 0;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> +	}
> +
> +	return sum;
> +}
> +
> +/*
> + * A writer takes ->writer_mutex to exclude other writers and to force the
> + * readers to switch to the slow mode, note the mutex_is_locked() check in
> + * update_fast_ctr().
> + *
> + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> + * counter it represents the number of active readers.
> + *
> + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> + * then waits until the slow counter becomes zero.
> + */
> +void percpu_down_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> +	mutex_lock(&brw->writer_mutex);
> +
> +	/*
> +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> +	 *    so that update_fast_ctr() can't succeed.
> +	 *
> +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> +	 *    update_fast_ctr().
> +	 *
> +	 * 3. Ensures that if any reader has exited its critical section via
> +	 *    fast-path, it executes a full memory barrier before we return.
> +	 */
> +	synchronize_sched();
> +
> +	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> +	atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
> +
> +	/* block the new readers completely */
> +	down_write(&brw->rw_sem);
> +
> +	/* wait for all readers to complete their percpu_up_read() */
> +	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> +}
> +
> +void percpu_up_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* allow the new readers, but only the slow-path */
> +	up_write(&brw->rw_sem);
> +
> +	/* insert the barrier before the next fast-path in down_read */
> +	synchronize_sched();

Ah, my added comments describing the memory-order properties of
synchronize_sched() were incomplete.  As you say in the comment above,
a valid RCU implementation must ensure that each CPU executes a memory
barrier between the time that synchronize_sched() starts executing and
the time that this same CPU starts its first RCU read-side critical
section that ends after synchronize_sched() finishes executing.  (This
is symmetric with the requirement discussed earlier.)

This works for the user-level RCU implementations as well -- some of
them supply the memory barriers under the control of the synchronize_rcu(),
while others supply them at the beginnings and ends of the RCU read-side
critical section.  Either way works, as required.

(Why do I care about potential implementations with memory barriers in
the read-side primitives?  Well, I hope that I never have reason to.
But if memory barriers do some day become free and if energy efficiency
continues to grow in importance, some hardware might prefer the memory
barriers in rcu_read_lock() and rcu_read_unlock() to interrupting CPUs
to force them to execute memory barriers.)

This in turn means that if a given RCU read-side critical section is
totally overlapped by a synchronize_sched(), there are no guarantees
of any memory barriers.  Which is OK, you don't rely on this.

> +	mutex_unlock(&brw->writer_mutex);

And if a reader sees brw->writer_mutex as unlocked, then that reader's
RCU read-side critical section must end after the above synchronize_sched()
completes, which in turn means that there must have been a memory barrier
on that reader's CPU after the synchronize_sched() started, so that the
reader correctly sees the writer's updates.

> +}

Sorry to be such a pain (and a slow pain at that) on this one, but
we really needed to get this right.  But please let me know what you
think of the added memory-order constraint.  Note that a CPU that never
ever executes any RCU read-side critical sections need not execute any
synchronize_sched()-induced memory barriers.

So:

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-07 17:04                               ` [PATCH v3 " Mikulas Patocka
  2012-11-07 17:47                                 ` Oleg Nesterov
@ 2012-11-08  1:23                                 ` Paul E. McKenney
  1 sibling, 0 replies; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-08  1:23 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Oleg Nesterov, Linus Torvalds, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Nov 07, 2012 at 12:04:48PM -0500, Mikulas Patocka wrote:
> It looks sensible.
> 
> Here I'm sending an improvement of the patch - I changed it so that there 
> are not two-level nested functions for the fast path and so that both 
> percpu_down_read and percpu_up_read use the same piece of code (to reduce 
> cache footprint).
> 
> ---
> 
> Currently the writer does msleep() plus synchronize_sched() 3 times
> to acquire/release the semaphore, and during this time the readers
> are blocked completely. Even if the "write" section was not actually
> started or if it was already finished.
> 
> With this patch down_write/up_write does synchronize_sched() twice
> and down_read/up_read are still possible during this time, just they
> use the slow path.
> 
> percpu_down_write() first forces the readers to use rw_semaphore and
> increment the "slow" counter to take the lock for reading, then it
> takes that rw_semaphore for writing and blocks the readers.
> 
> Also. With this patch the code relies on the documented behaviour of
> synchronize_sched(), it doesn't try to pair synchronize_sched() with
> barrier.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

>From a memory-ordering viewpoint, this looks to me to work the same
way that Oleg's does.  Oleg's approach looks better to me, though that
might be because I have looked at it quite a few times over the past
several days.

							Thanx, Paul

> ---
>  include/linux/percpu-rwsem.h |   80 ++++++-------------------------
>  lib/Makefile                 |    2 
>  lib/percpu-rwsem.c           |  110 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 127 insertions(+), 65 deletions(-)
>  create mode 100644 lib/percpu-rwsem.c
> 
> Index: linux-3.6.6-fast/include/linux/percpu-rwsem.h
> ===================================================================
> --- linux-3.6.6-fast.orig/include/linux/percpu-rwsem.h	2012-11-05 16:21:29.000000000 +0100
> +++ linux-3.6.6-fast/include/linux/percpu-rwsem.h	2012-11-07 16:44:04.000000000 +0100
> @@ -2,82 +2,34 @@
>  #define _LINUX_PERCPU_RWSEM_H
> 
>  #include <linux/mutex.h>
> +#include <linux/rwsem.h>
>  #include <linux/percpu.h>
> -#include <linux/rcupdate.h>
> -#include <linux/delay.h>
> +#include <linux/wait.h>
> 
>  struct percpu_rw_semaphore {
> -	unsigned __percpu *counters;
> -	bool locked;
> -	struct mutex mtx;
> +	unsigned int __percpu	*fast_read_ctr;
> +	struct mutex		writer_mutex;
> +	struct rw_semaphore	rw_sem;
> +	atomic_t		slow_read_ctr;
> +	wait_queue_head_t	write_waitq;
>  };
> 
> -#define light_mb()	barrier()
> -#define heavy_mb()	synchronize_sched()
> +extern void __percpu_down_up_read(struct percpu_rw_semaphore *, int);
> 
> -static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> -{
> -	rcu_read_lock_sched();
> -	if (unlikely(p->locked)) {
> -		rcu_read_unlock_sched();
> -		mutex_lock(&p->mtx);
> -		this_cpu_inc(*p->counters);
> -		mutex_unlock(&p->mtx);
> -		return;
> -	}
> -	this_cpu_inc(*p->counters);
> -	rcu_read_unlock_sched();
> -	light_mb(); /* A, between read of p->locked and read of data, paired with D */
> -}
> -
> -static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> -{
> -	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
> -	this_cpu_dec(*p->counters);
> -}
> -
> -static inline unsigned __percpu_count(unsigned __percpu *counters)
> -{
> -	unsigned total = 0;
> -	int cpu;
> -
> -	for_each_possible_cpu(cpu)
> -		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
> +extern void percpu_down_write(struct percpu_rw_semaphore *);
> +extern void percpu_up_write(struct percpu_rw_semaphore *);
> 
> -	return total;
> -}
> -
> -static inline void percpu_down_write(struct percpu_rw_semaphore *p)
> -{
> -	mutex_lock(&p->mtx);
> -	p->locked = true;
> -	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
> -	while (__percpu_count(p->counters))
> -		msleep(1);
> -	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
> -}
> -
> -static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> -{
> -	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
> -	p->locked = false;
> -	mutex_unlock(&p->mtx);
> -}
> +extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
> +extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
> 
> -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
> +static inline void percpu_down_read(struct percpu_rw_semaphore *s)
>  {
> -	p->counters = alloc_percpu(unsigned);
> -	if (unlikely(!p->counters))
> -		return -ENOMEM;
> -	p->locked = false;
> -	mutex_init(&p->mtx);
> -	return 0;
> +	__percpu_down_up_read(s, 1);
>  }
> 
> -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
> +static inline void percpu_up_read(struct percpu_rw_semaphore *s)
>  {
> -	free_percpu(p->counters);
> -	p->counters = NULL; /* catch use after free bugs */
> +	__percpu_down_up_read(s, -1);
>  }
> 
>  #endif
> Index: linux-3.6.6-fast/lib/Makefile
> ===================================================================
> --- linux-3.6.6-fast.orig/lib/Makefile	2012-10-02 00:47:57.000000000 +0200
> +++ linux-3.6.6-fast/lib/Makefile	2012-11-07 03:10:44.000000000 +0100
> @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmd
>  	 idr.o int_sqrt.o extable.o prio_tree.o \
>  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> -	 is_single_threaded.o plist.o decompress.o
> +	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
> 
>  lib-$(CONFIG_MMU) += ioremap.o
>  lib-$(CONFIG_SMP) += cpumask.o
> Index: linux-3.6.6-fast/lib/percpu-rwsem.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-3.6.6-fast/lib/percpu-rwsem.c	2012-11-07 16:43:27.000000000 +0100
> @@ -0,0 +1,110 @@
> +#include <linux/percpu-rwsem.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>
> +#include <linux/module.h>
> +
> +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	brw->fast_read_ctr = alloc_percpu(int);
> +	if (unlikely(!brw->fast_read_ctr))
> +		return -ENOMEM;
> +
> +	mutex_init(&brw->writer_mutex);
> +	init_rwsem(&brw->rw_sem);
> +	atomic_set(&brw->slow_read_ctr, 0);
> +	init_waitqueue_head(&brw->write_waitq);
> +	return 0;
> +}
> +EXPORT_SYMBOL(percpu_init_rwsem);
> +
> +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	free_percpu(brw->fast_read_ctr);
> +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> +}
> +EXPORT_SYMBOL(percpu_free_rwsem);
> +
> +void __percpu_down_up_read(struct percpu_rw_semaphore *brw, int val)
> +{
> +	preempt_disable();
> +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> +		__this_cpu_add(*brw->fast_read_ctr, val);
> +		preempt_enable();
> +		return;
> +	}
> +	preempt_enable();
> +	if (val >= 0) {
> +		down_read(&brw->rw_sem);
> +		atomic_inc(&brw->slow_read_ctr);
> +		up_read(&brw->rw_sem);
> +	} else {
> +		if (atomic_dec_and_test(&brw->slow_read_ctr))
> +			wake_up_all(&brw->write_waitq);
> +	}
> +}
> +EXPORT_SYMBOL(__percpu_down_up_read);
> +
> +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> +{
> +	unsigned int sum = 0;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> +	}
> +
> +	return sum;
> +}
> +
> +/*
> + * A writer takes ->writer_mutex to exclude other writers and to force the
> + * readers to switch to the slow mode, note the mutex_is_locked() check in
> + * update_fast_ctr().
> + *
> + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> + * counter it represents the number of active readers.
> + *
> + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> + * then waits until the slow counter becomes zero.
> + */
> +void percpu_down_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> +	mutex_lock(&brw->writer_mutex);
> +
> +	/*
> +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> +	 *    so that update_fast_ctr() can't succeed.
> +	 *
> +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> +	 *    update_fast_ctr().
> +	 *
> +	 * 3. Ensures that if any reader has exited its critical section via
> +	 *    fast-path, it executes a full memory barrier before we return.
> +	 */
> +	synchronize_sched();
> +
> +	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> +	atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
> +
> +	/* block the new readers completely */
> +	down_write(&brw->rw_sem);
> +
> +	/* wait for all readers to complete their percpu_up_read() */
> +	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> +}
> +EXPORT_SYMBOL(percpu_down_write);
> +
> +void percpu_up_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* allow the new readers, but only the slow-path */
> +	up_write(&brw->rw_sem);
> +
> +	/* insert the barrier before the next fast-path in down_read */
> +	synchronize_sched();
> +
> +	mutex_unlock(&brw->writer_mutex);
> +}
> +EXPORT_SYMBOL(percpu_up_write);
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-08  1:16                               ` [PATCH v2 " Paul E. McKenney
@ 2012-11-08 13:33                                 ` Oleg Nesterov
  2012-11-08 16:27                                   ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-08 13:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On 11/07, Paul E. McKenney wrote:
>
> On Fri, Nov 02, 2012 at 07:06:29PM +0100, Oleg Nesterov wrote:
> > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > +{
> > +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > +	mutex_lock(&brw->writer_mutex);
> > +
> > +	/*
> > +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > +	 *    so that update_fast_ctr() can't succeed.
> > +	 *
> > +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> > +	 *    update_fast_ctr().
> > +	 *
> > +	 * 3. Ensures that if any reader has exited its critical section via
> > +	 *    fast-path, it executes a full memory barrier before we return.
> > +	 */
> > +	synchronize_sched();
> > +
> > +	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> > +	atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
> > +
> > +	/* block the new readers completely */
> > +	down_write(&brw->rw_sem);
> > +
> > +	/* wait for all readers to complete their percpu_up_read() */
> > +	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> > +}
> > +
> > +void percpu_up_write(struct percpu_rw_semaphore *brw)
> > +{
> > +	/* allow the new readers, but only the slow-path */
> > +	up_write(&brw->rw_sem);
> > +
> > +	/* insert the barrier before the next fast-path in down_read */
> > +	synchronize_sched();
>
> Ah, my added comments describing the memory-order properties of
> synchronize_sched() were incomplete.  As you say in the comment above,
> a valid RCU implementation must ensure that each CPU executes a memory
> barrier between the time that synchronize_sched() starts executing and
> the time that this same CPU starts its first RCU read-side critical
> section that ends after synchronize_sched() finishes executing.  (This
> is symmetric with the requirement discussed earlier.)

I think, yes. Let me repeat my example (changed a little bit). Suppose
that we have

	int A = 0, B = 0, STOP = 0;

	// can be called at any time, and many times
	void func(void)
	{
		rcu_read_lock_sched();
		if (!STOP) {
			A++;
			B++;
		}
		rcu_read_unlock_sched();
	}

Then I believe the following code should be correct:

	STOP = 1;

	synchronize_sched();

	BUG_ON(A != B);

We should see the result of the previous increments, and func() should
see STOP != 0 if it races with BUG_ON().

> And if a reader sees brw->writer_mutex as unlocked, then that reader's
> RCU read-side critical section must end after the above synchronize_sched()
> completes, which in turn means that there must have been a memory barrier
> on that reader's CPU after the synchronize_sched() started, so that the
> reader correctly sees the writer's updates.

Yes.

> But please let me know what you
> think of the added memory-order constraint.

I am going to (try to) do other changes on top of this patch, and I'll
certainly try to think more about this, thanks.

> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Great! thanks a lot Paul.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-07 19:17                                   ` Mikulas Patocka
@ 2012-11-08 13:42                                     ` Oleg Nesterov
  0 siblings, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-08 13:42 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Linus Torvalds, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On 11/07, Mikulas Patocka wrote:
>
> On Wed, 7 Nov 2012, Oleg Nesterov wrote:
>
> > On 11/07, Mikulas Patocka wrote:
> > >
> > > It looks sensible.
> > >
> > > Here I'm sending an improvement of the patch - I changed it so that there
> > > are not two-level nested functions for the fast path and so that both
> > > percpu_down_read and percpu_up_read use the same piece of code (to reduce
> > > cache footprint).
> >
> > IOW, the only change is that you eliminate "static update_fast_ctr()"
> > and fold it into down/up_read which takes the additional argument.
> >
> > Honestly, personally I do not think this is better, but I won't argue.
> > I agree with everything but I guess we need the ack from Paul.
>
> If you look at generated assembly (for x86-64), the footprint of my patch
> is 78 bytes shared for both percpu_down_read and percpu_up_read.
>
> The footprint of your patch is 62 bytes for update_fast_ctr, 46 bytes for
> percpu_down_read and 20 bytes for percpu_up_read.

Still I think the code looks more clean this way, and personally I think
this is more important. Plus, this lessens the footprint for the caller
although I agree this is minor.

Please send the increnental patch if you wish, I won't argue. But note
that with the lockdep annotations (and I'll send the patch soon) the
code will look even worse. Either you need another "if (val > 0)" check
or you need to add rwsem_acquire_read/rwsem_release into .h

And if you do this change please also update the comments, they still
refer to update_fast_ctr() you folded into down_up ;)

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH RESEND v2 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-02 18:06                           ` [PATCH v2 0/1] " Oleg Nesterov
  2012-11-02 18:06                             ` [PATCH v2 1/1] " Oleg Nesterov
@ 2012-11-08 13:48                             ` Oleg Nesterov
  2012-11-08 13:48                               ` [PATCH RESEND v2 1/1] " Oleg Nesterov
  1 sibling, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-08 13:48 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Paul E. McKenney
  Cc: Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 11/02, Oleg Nesterov wrote:
>
> On 11/01, Linus Torvalds wrote:
> >
> > Other than that, I guess it looks ok.
>
> Great, please see v2.
>
> I am not sure I addressed Paul's concerns, so I guess I need his ack.

And now I have it, so I think the patch is ready.

Please see 1/1. No changes except I added Reviewed-by from Paul.

If it is too late for 3.7, then may be Andrew can pick it.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-08 13:48                             ` [PATCH RESEND v2 0/1] " Oleg Nesterov
@ 2012-11-08 13:48                               ` Oleg Nesterov
  2012-11-08 20:07                                 ` Andrew Morton
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-08 13:48 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Paul E. McKenney
  Cc: Mikulas Patocka, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

Currently the writer does msleep() plus synchronize_sched() 3 times
to acquire/release the semaphore, and during this time the readers
are blocked completely. Even if the "write" section was not actually
started or if it was already finished.

With this patch down_write/up_write does synchronize_sched() twice
and down_read/up_read are still possible during this time, just they
use the slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/percpu-rwsem.h |   83 +++++------------------------
 lib/Makefile                 |    2 +-
 lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 137 insertions(+), 71 deletions(-)
 create mode 100644 lib/percpu-rwsem.c

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 250a4ac..592f0d6 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -2,82 +2,25 @@
 #define _LINUX_PERCPU_RWSEM_H
 
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/percpu.h>
-#include <linux/rcupdate.h>
-#include <linux/delay.h>
+#include <linux/wait.h>
 
 struct percpu_rw_semaphore {
-	unsigned __percpu *counters;
-	bool locked;
-	struct mutex mtx;
+	unsigned int __percpu	*fast_read_ctr;
+	struct mutex		writer_mutex;
+	struct rw_semaphore	rw_sem;
+	atomic_t		slow_read_ctr;
+	wait_queue_head_t	write_waitq;
 };
 
-#define light_mb()	barrier()
-#define heavy_mb()	synchronize_sched()
+extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern void percpu_up_read(struct percpu_rw_semaphore *);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
-	rcu_read_lock_sched();
-	if (unlikely(p->locked)) {
-		rcu_read_unlock_sched();
-		mutex_lock(&p->mtx);
-		this_cpu_inc(*p->counters);
-		mutex_unlock(&p->mtx);
-		return;
-	}
-	this_cpu_inc(*p->counters);
-	rcu_read_unlock_sched();
-	light_mb(); /* A, between read of p->locked and read of data, paired with D */
-}
+extern void percpu_down_write(struct percpu_rw_semaphore *);
+extern void percpu_up_write(struct percpu_rw_semaphore *);
 
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
-	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
-	this_cpu_dec(*p->counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
-	unsigned total = 0;
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
-
-	return total;
-}
-
-static inline void percpu_down_write(struct percpu_rw_semaphore *p)
-{
-	mutex_lock(&p->mtx);
-	p->locked = true;
-	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
-	while (__percpu_count(p->counters))
-		msleep(1);
-	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
-}
-
-static inline void percpu_up_write(struct percpu_rw_semaphore *p)
-{
-	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
-	p->locked = false;
-	mutex_unlock(&p->mtx);
-}
-
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
-{
-	p->counters = alloc_percpu(unsigned);
-	if (unlikely(!p->counters))
-		return -ENOMEM;
-	p->locked = false;
-	mutex_init(&p->mtx);
-	return 0;
-}
-
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
-{
-	free_percpu(p->counters);
-	p->counters = NULL; /* catch use after free bugs */
-}
+extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
+extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 
 #endif
diff --git a/lib/Makefile b/lib/Makefile
index 821a162..4dad4a7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 idr.o int_sqrt.o extable.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o
+	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
new file mode 100644
index 0000000..0e3bc0f
--- /dev/null
+++ b/lib/percpu-rwsem.c
@@ -0,0 +1,123 @@
+#include <linux/percpu-rwsem.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
+{
+	brw->fast_read_ctr = alloc_percpu(int);
+	if (unlikely(!brw->fast_read_ctr))
+		return -ENOMEM;
+
+	mutex_init(&brw->writer_mutex);
+	init_rwsem(&brw->rw_sem);
+	atomic_set(&brw->slow_read_ctr, 0);
+	init_waitqueue_head(&brw->write_waitq);
+	return 0;
+}
+
+void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
+{
+	free_percpu(brw->fast_read_ctr);
+	brw->fast_read_ctr = NULL; /* catch use after free bugs */
+}
+
+static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
+{
+	bool success = false;
+
+	preempt_disable();
+	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
+		__this_cpu_add(*brw->fast_read_ctr, val);
+		success = true;
+	}
+	preempt_enable();
+
+	return success;
+}
+
+/*
+ * Like the normal down_read() this is not recursive, the writer can
+ * come after the first percpu_down_read() and create the deadlock.
+ */
+void percpu_down_read(struct percpu_rw_semaphore *brw)
+{
+	if (likely(update_fast_ctr(brw, +1)))
+		return;
+
+	down_read(&brw->rw_sem);
+	atomic_inc(&brw->slow_read_ctr);
+	up_read(&brw->rw_sem);
+}
+
+void percpu_up_read(struct percpu_rw_semaphore *brw)
+{
+	if (likely(update_fast_ctr(brw, -1)))
+		return;
+
+	/* false-positive is possible but harmless */
+	if (atomic_dec_and_test(&brw->slow_read_ctr))
+		wake_up_all(&brw->write_waitq);
+}
+
+static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
+{
+	unsigned int sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		sum += per_cpu(*brw->fast_read_ctr, cpu);
+		per_cpu(*brw->fast_read_ctr, cpu) = 0;
+	}
+
+	return sum;
+}
+
+/*
+ * A writer takes ->writer_mutex to exclude other writers and to force the
+ * readers to switch to the slow mode, note the mutex_is_locked() check in
+ * update_fast_ctr().
+ *
+ * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
+ * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
+ * counter it represents the number of active readers.
+ *
+ * Finally the writer takes ->rw_sem for writing and blocks the new readers,
+ * then waits until the slow counter becomes zero.
+ */
+void percpu_down_write(struct percpu_rw_semaphore *brw)
+{
+	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
+	mutex_lock(&brw->writer_mutex);
+
+	/*
+	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
+	 *    so that update_fast_ctr() can't succeed.
+	 *
+	 * 2. Ensures we see the result of every previous this_cpu_add() in
+	 *    update_fast_ctr().
+	 *
+	 * 3. Ensures that if any reader has exited its critical section via
+	 *    fast-path, it executes a full memory barrier before we return.
+	 */
+	synchronize_sched();
+
+	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
+	atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
+
+	/* block the new readers completely */
+	down_write(&brw->rw_sem);
+
+	/* wait for all readers to complete their percpu_up_read() */
+	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
+}
+
+void percpu_up_write(struct percpu_rw_semaphore *brw)
+{
+	/* allow the new readers, but only the slow-path */
+	up_write(&brw->rw_sem);
+
+	/* insert the barrier before the next fast-path in down_read */
+	synchronize_sched();
+
+	mutex_unlock(&brw->writer_mutex);
+}
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-08 13:33                                 ` Oleg Nesterov
@ 2012-11-08 16:27                                   ` Paul E. McKenney
  0 siblings, 0 replies; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-08 16:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Mikulas Patocka, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Thu, Nov 08, 2012 at 02:33:27PM +0100, Oleg Nesterov wrote:
> On 11/07, Paul E. McKenney wrote:
> >
> > On Fri, Nov 02, 2012 at 07:06:29PM +0100, Oleg Nesterov wrote:
> > > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > > +	mutex_lock(&brw->writer_mutex);
> > > +
> > > +	/*
> > > +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > > +	 *    so that update_fast_ctr() can't succeed.
> > > +	 *
> > > +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> > > +	 *    update_fast_ctr().
> > > +	 *
> > > +	 * 3. Ensures that if any reader has exited its critical section via
> > > +	 *    fast-path, it executes a full memory barrier before we return.
> > > +	 */
> > > +	synchronize_sched();
> > > +
> > > +	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> > > +	atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
> > > +
> > > +	/* block the new readers completely */
> > > +	down_write(&brw->rw_sem);
> > > +
> > > +	/* wait for all readers to complete their percpu_up_read() */
> > > +	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> > > +}
> > > +
> > > +void percpu_up_write(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	/* allow the new readers, but only the slow-path */
> > > +	up_write(&brw->rw_sem);
> > > +
> > > +	/* insert the barrier before the next fast-path in down_read */
> > > +	synchronize_sched();
> >
> > Ah, my added comments describing the memory-order properties of
> > synchronize_sched() were incomplete.  As you say in the comment above,
> > a valid RCU implementation must ensure that each CPU executes a memory
> > barrier between the time that synchronize_sched() starts executing and
> > the time that this same CPU starts its first RCU read-side critical
> > section that ends after synchronize_sched() finishes executing.  (This
> > is symmetric with the requirement discussed earlier.)
> 
> I think, yes. Let me repeat my example (changed a little bit). Suppose
> that we have
> 
> 	int A = 0, B = 0, STOP = 0;
> 
> 	// can be called at any time, and many times
> 	void func(void)
> 	{
> 		rcu_read_lock_sched();
> 		if (!STOP) {
> 			A++;
> 			B++;
> 		}
> 		rcu_read_unlock_sched();
> 	}
> 
> Then I believe the following code should be correct:
> 
> 	STOP = 1;
> 
> 	synchronize_sched();
> 
> 	BUG_ON(A != B);

Agreed, but covered by my earlier definition.

> We should see the result of the previous increments, and func() should
> see STOP != 0 if it races with BUG_ON().

Alternatively, if we have something like:

	if (!STOP) {
		A++;
		B++;
		if (random() & 0xffff) {
			synchronize_sched();
			STOP = 1;
		}
	}

Then if we also have elsewhere:

	rcu_read_lock_sched();
	if (STOP)
		BUG_ON(A != B);
	rcu_read_unlock_sched();

The BUG_ON() should never fire.

This one requires the other guarantee, that if a given RCU read-side
critical section ends after a given synchronize_sched(), then the CPU
executing that RCU read-side critical section is guaranteed to have
executed a memory barrier between the start of that synchronize_sched()
and the start of that RCU read-side critical section.

> > And if a reader sees brw->writer_mutex as unlocked, then that reader's
> > RCU read-side critical section must end after the above synchronize_sched()
> > completes, which in turn means that there must have been a memory barrier
> > on that reader's CPU after the synchronize_sched() started, so that the
> > reader correctly sees the writer's updates.
> 
> Yes.
> 
> > But please let me know what you
> > think of the added memory-order constraint.
> 
> I am going to (try to) do other changes on top of this patch, and I'll
> certainly try to think more about this, thanks.

Looking forward to hearing your thoughts!

							Thanx, Paul

> > Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> Great! thanks a lot Paul.
> 
> Oleg.
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-08 13:48                               ` [PATCH RESEND v2 1/1] " Oleg Nesterov
@ 2012-11-08 20:07                                 ` Andrew Morton
  2012-11-08 21:08                                   ` Paul E. McKenney
                                                     ` (3 more replies)
  0 siblings, 4 replies; 103+ messages in thread
From: Andrew Morton @ 2012-11-08 20:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Paul E. McKenney, Mikulas Patocka,
	Peter Zijlstra, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On Thu, 8 Nov 2012 14:48:49 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> Currently the writer does msleep() plus synchronize_sched() 3 times
> to acquire/release the semaphore, and during this time the readers
> are blocked completely. Even if the "write" section was not actually
> started or if it was already finished.
> 
> With this patch down_write/up_write does synchronize_sched() twice
> and down_read/up_read are still possible during this time, just they
> use the slow path.
> 
> percpu_down_write() first forces the readers to use rw_semaphore and
> increment the "slow" counter to take the lock for reading, then it
> takes that rw_semaphore for writing and blocks the readers.
> 
> Also. With this patch the code relies on the documented behaviour of
> synchronize_sched(), it doesn't try to pair synchronize_sched() with
> barrier.
> 
> ...
>
>  include/linux/percpu-rwsem.h |   83 +++++------------------------
>  lib/Makefile                 |    2 +-
>  lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++

The patch also uninlines everything.

And it didn't export the resulting symbols to modules, so it isn't an
equivalent.  We can export thing later if needed I guess.

It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
avoid including the code altogether, methinks?

>
> ...
>
> --- /dev/null
> +++ b/lib/percpu-rwsem.c
> @@ -0,0 +1,123 @@

That was nice and terse ;)

> +#include <linux/percpu-rwsem.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>

This list is nowhere near sufficient to support this file's
requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
more.  IOW, if it compiles, it was sheer luck.

> +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	brw->fast_read_ctr = alloc_percpu(int);
> +	if (unlikely(!brw->fast_read_ctr))
> +		return -ENOMEM;
> +
> +	mutex_init(&brw->writer_mutex);
> +	init_rwsem(&brw->rw_sem);
> +	atomic_set(&brw->slow_read_ctr, 0);
> +	init_waitqueue_head(&brw->write_waitq);
> +	return 0;
> +}
> +
> +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	free_percpu(brw->fast_read_ctr);
> +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> +}
> +
> +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
> +{
> +	bool success = false;
> +
> +	preempt_disable();
> +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> +		__this_cpu_add(*brw->fast_read_ctr, val);
> +		success = true;
> +	}
> +	preempt_enable();
> +
> +	return success;
> +}
> +
> +/*
> + * Like the normal down_read() this is not recursive, the writer can
> + * come after the first percpu_down_read() and create the deadlock.
> + */
> +void percpu_down_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, +1)))
> +		return;
> +
> +	down_read(&brw->rw_sem);
> +	atomic_inc(&brw->slow_read_ctr);
> +	up_read(&brw->rw_sem);
> +}
> +
> +void percpu_up_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, -1)))
> +		return;
> +
> +	/* false-positive is possible but harmless */
> +	if (atomic_dec_and_test(&brw->slow_read_ctr))
> +		wake_up_all(&brw->write_waitq);
> +}
> +
> +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> +{
> +	unsigned int sum = 0;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> +	}
> +
> +	return sum;
> +}
> +
> +/*
> + * A writer takes ->writer_mutex to exclude other writers and to force the
> + * readers to switch to the slow mode, note the mutex_is_locked() check in
> + * update_fast_ctr().
> + *
> + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> + * counter it represents the number of active readers.
> + *
> + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> + * then waits until the slow counter becomes zero.
> + */

Some overview of how fast/slow_read_ctr are supposed to work would be
useful.  This comment seems to assume that the reader already knew
that.

> +void percpu_down_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> +	mutex_lock(&brw->writer_mutex);
> +
> +	/*
> +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> +	 *    so that update_fast_ctr() can't succeed.
> +	 *
> +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> +	 *    update_fast_ctr().
> +	 *
> +	 * 3. Ensures that if any reader has exited its critical section via
> +	 *    fast-path, it executes a full memory barrier before we return.
> +	 */
> +	synchronize_sched();

Here's where I get horridly confused.  Your patch completely deRCUifies
this code, yes?  Yet here we're using an RCU primitive.  And we seem to
be using it not as an RCU primitive but as a handy thing which happens
to have desirable side-effects.  But the implementation of
synchronize_sched() differs considerably according to which rcu
flavor-of-the-minute you're using.

And part 3 talks about the reader's critical section.  The only
critical sections I can see on the reader side are already covered by
mutex_lock() and preempt_diable().

I get this feeling I don't have clue what's going on here and I think
I'll just retire hurt now.  If this code isn't as brain damaged as it
initially appears then please, go easy on us simpletons in the next
version?

> +	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> +	atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
> +
> +	/* block the new readers completely */
> +	down_write(&brw->rw_sem);
> +
> +	/* wait for all readers to complete their percpu_up_read() */
> +	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> +}
> +
> +void percpu_up_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* allow the new readers, but only the slow-path */
> +	up_write(&brw->rw_sem);
> +
> +	/* insert the barrier before the next fast-path in down_read */
> +	synchronize_sched();
> +
> +	mutex_unlock(&brw->writer_mutex);
> +}


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-08 20:07                                 ` Andrew Morton
@ 2012-11-08 21:08                                   ` Paul E. McKenney
  2012-11-08 23:41                                     ` Mikulas Patocka
  2012-11-09 12:47                                   ` Mikulas Patocka
                                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-08 21:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Linus Torvalds, Mikulas Patocka, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> On Thu, 8 Nov 2012 14:48:49 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
> 
> > Currently the writer does msleep() plus synchronize_sched() 3 times
> > to acquire/release the semaphore, and during this time the readers
> > are blocked completely. Even if the "write" section was not actually
> > started or if it was already finished.
> > 
> > With this patch down_write/up_write does synchronize_sched() twice
> > and down_read/up_read are still possible during this time, just they
> > use the slow path.
> > 
> > percpu_down_write() first forces the readers to use rw_semaphore and
> > increment the "slow" counter to take the lock for reading, then it
> > takes that rw_semaphore for writing and blocks the readers.
> > 
> > Also. With this patch the code relies on the documented behaviour of
> > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > barrier.
> > 
> > ...
> >
> >  include/linux/percpu-rwsem.h |   83 +++++------------------------
> >  lib/Makefile                 |    2 +-
> >  lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++
> 
> The patch also uninlines everything.
> 
> And it didn't export the resulting symbols to modules, so it isn't an
> equivalent.  We can export thing later if needed I guess.
> 
> It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> avoid including the code altogether, methinks?
> 
> >
> > ...
> >
> > --- /dev/null
> > +++ b/lib/percpu-rwsem.c
> > @@ -0,0 +1,123 @@
> 
> That was nice and terse ;)
> 
> > +#include <linux/percpu-rwsem.h>
> > +#include <linux/rcupdate.h>
> > +#include <linux/sched.h>
> 
> This list is nowhere near sufficient to support this file's
> requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> more.  IOW, if it compiles, it was sheer luck.
> 
> > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> > +{
> > +	brw->fast_read_ctr = alloc_percpu(int);
> > +	if (unlikely(!brw->fast_read_ctr))
> > +		return -ENOMEM;
> > +
> > +	mutex_init(&brw->writer_mutex);
> > +	init_rwsem(&brw->rw_sem);
> > +	atomic_set(&brw->slow_read_ctr, 0);
> > +	init_waitqueue_head(&brw->write_waitq);
> > +	return 0;
> > +}
> > +
> > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> > +{
> > +	free_percpu(brw->fast_read_ctr);
> > +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> > +}
> > +
> > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
> > +{
> > +	bool success = false;
> > +
> > +	preempt_disable();
> > +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> > +		__this_cpu_add(*brw->fast_read_ctr, val);
> > +		success = true;
> > +	}
> > +	preempt_enable();
> > +
> > +	return success;
> > +}
> > +
> > +/*
> > + * Like the normal down_read() this is not recursive, the writer can
> > + * come after the first percpu_down_read() and create the deadlock.
> > + */
> > +void percpu_down_read(struct percpu_rw_semaphore *brw)
> > +{
> > +	if (likely(update_fast_ctr(brw, +1)))
> > +		return;
> > +
> > +	down_read(&brw->rw_sem);
> > +	atomic_inc(&brw->slow_read_ctr);
> > +	up_read(&brw->rw_sem);
> > +}
> > +
> > +void percpu_up_read(struct percpu_rw_semaphore *brw)
> > +{
> > +	if (likely(update_fast_ctr(brw, -1)))
> > +		return;
> > +
> > +	/* false-positive is possible but harmless */
> > +	if (atomic_dec_and_test(&brw->slow_read_ctr))
> > +		wake_up_all(&brw->write_waitq);
> > +}
> > +
> > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> > +{
> > +	unsigned int sum = 0;
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu) {
> > +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> > +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> > +	}
> > +
> > +	return sum;
> > +}
> > +
> > +/*
> > + * A writer takes ->writer_mutex to exclude other writers and to force the
> > + * readers to switch to the slow mode, note the mutex_is_locked() check in
> > + * update_fast_ctr().
> > + *
> > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > + * counter it represents the number of active readers.
> > + *
> > + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> > + * then waits until the slow counter becomes zero.
> > + */
> 
> Some overview of how fast/slow_read_ctr are supposed to work would be
> useful.  This comment seems to assume that the reader already knew
> that.
> 
> > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > +{
> > +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > +	mutex_lock(&brw->writer_mutex);
> > +
> > +	/*
> > +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > +	 *    so that update_fast_ctr() can't succeed.
> > +	 *
> > +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> > +	 *    update_fast_ctr().
> > +	 *
> > +	 * 3. Ensures that if any reader has exited its critical section via
> > +	 *    fast-path, it executes a full memory barrier before we return.
> > +	 */
> > +	synchronize_sched();
> 
> Here's where I get horridly confused.  Your patch completely deRCUifies
> this code, yes?  Yet here we're using an RCU primitive.  And we seem to
> be using it not as an RCU primitive but as a handy thing which happens
> to have desirable side-effects.  But the implementation of
> synchronize_sched() differs considerably according to which rcu
> flavor-of-the-minute you're using.

The trick is that the preempt_disable() call in update_fast_ctr()
acts as an RCU read-side critical section WRT synchronize_sched().

The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
synchronize_rcu() in place of preempt_disable()/preempt_enable() and
synchronize_sched().  The real-time guys would prefer the change
to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
you mention it.

Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock()
and synchronize_rcu()?

							Thanx, Paul

> And part 3 talks about the reader's critical section.  The only
> critical sections I can see on the reader side are already covered by
> mutex_lock() and preempt_diable().
> 
> I get this feeling I don't have clue what's going on here and I think
> I'll just retire hurt now.  If this code isn't as brain damaged as it
> initially appears then please, go easy on us simpletons in the next
> version?
> 
> > +	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> > +	atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
> > +
> > +	/* block the new readers completely */
> > +	down_write(&brw->rw_sem);
> > +
> > +	/* wait for all readers to complete their percpu_up_read() */
> > +	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> > +}
> > +
> > +void percpu_up_write(struct percpu_rw_semaphore *brw)
> > +{
> > +	/* allow the new readers, but only the slow-path */
> > +	up_write(&brw->rw_sem);
> > +
> > +	/* insert the barrier before the next fast-path in down_read */
> > +	synchronize_sched();
> > +
> > +	mutex_unlock(&brw->writer_mutex);
> > +}
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-08 21:08                                   ` Paul E. McKenney
@ 2012-11-08 23:41                                     ` Mikulas Patocka
  2012-11-09  0:41                                       ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Mikulas Patocka @ 2012-11-08 23:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Oleg Nesterov, Linus Torvalds, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel



On Thu, 8 Nov 2012, Paul E. McKenney wrote:

> On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > On Thu, 8 Nov 2012 14:48:49 +0100
> > Oleg Nesterov <oleg@redhat.com> wrote:
> > 
> > > Currently the writer does msleep() plus synchronize_sched() 3 times
> > > to acquire/release the semaphore, and during this time the readers
> > > are blocked completely. Even if the "write" section was not actually
> > > started or if it was already finished.
> > > 
> > > With this patch down_write/up_write does synchronize_sched() twice
> > > and down_read/up_read are still possible during this time, just they
> > > use the slow path.
> > > 
> > > percpu_down_write() first forces the readers to use rw_semaphore and
> > > increment the "slow" counter to take the lock for reading, then it
> > > takes that rw_semaphore for writing and blocks the readers.
> > > 
> > > Also. With this patch the code relies on the documented behaviour of
> > > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > > barrier.
> > > 
> > > ...
> > >
> > >  include/linux/percpu-rwsem.h |   83 +++++------------------------
> > >  lib/Makefile                 |    2 +-
> > >  lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++
> > 
> > The patch also uninlines everything.
> > 
> > And it didn't export the resulting symbols to modules, so it isn't an
> > equivalent.  We can export thing later if needed I guess.
> > 
> > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> > avoid including the code altogether, methinks?
> > 
> > >
> > > ...
> > >
> > > --- /dev/null
> > > +++ b/lib/percpu-rwsem.c
> > > @@ -0,0 +1,123 @@
> > 
> > That was nice and terse ;)
> > 
> > > +#include <linux/percpu-rwsem.h>
> > > +#include <linux/rcupdate.h>
> > > +#include <linux/sched.h>
> > 
> > This list is nowhere near sufficient to support this file's
> > requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> > more.  IOW, if it compiles, it was sheer luck.
> > 
> > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	brw->fast_read_ctr = alloc_percpu(int);
> > > +	if (unlikely(!brw->fast_read_ctr))
> > > +		return -ENOMEM;
> > > +
> > > +	mutex_init(&brw->writer_mutex);
> > > +	init_rwsem(&brw->rw_sem);
> > > +	atomic_set(&brw->slow_read_ctr, 0);
> > > +	init_waitqueue_head(&brw->write_waitq);
> > > +	return 0;
> > > +}
> > > +
> > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	free_percpu(brw->fast_read_ctr);
> > > +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> > > +}
> > > +
> > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
> > > +{
> > > +	bool success = false;
> > > +
> > > +	preempt_disable();
> > > +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> > > +		__this_cpu_add(*brw->fast_read_ctr, val);
> > > +		success = true;
> > > +	}
> > > +	preempt_enable();
> > > +
> > > +	return success;
> > > +}
> > > +
> > > +/*
> > > + * Like the normal down_read() this is not recursive, the writer can
> > > + * come after the first percpu_down_read() and create the deadlock.
> > > + */
> > > +void percpu_down_read(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	if (likely(update_fast_ctr(brw, +1)))
> > > +		return;
> > > +
> > > +	down_read(&brw->rw_sem);
> > > +	atomic_inc(&brw->slow_read_ctr);
> > > +	up_read(&brw->rw_sem);
> > > +}
> > > +
> > > +void percpu_up_read(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	if (likely(update_fast_ctr(brw, -1)))
> > > +		return;
> > > +
> > > +	/* false-positive is possible but harmless */
> > > +	if (atomic_dec_and_test(&brw->slow_read_ctr))
> > > +		wake_up_all(&brw->write_waitq);
> > > +}
> > > +
> > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	unsigned int sum = 0;
> > > +	int cpu;
> > > +
> > > +	for_each_possible_cpu(cpu) {
> > > +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> > > +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> > > +	}
> > > +
> > > +	return sum;
> > > +}
> > > +
> > > +/*
> > > + * A writer takes ->writer_mutex to exclude other writers and to force the
> > > + * readers to switch to the slow mode, note the mutex_is_locked() check in
> > > + * update_fast_ctr().
> > > + *
> > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > > + * counter it represents the number of active readers.
> > > + *
> > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> > > + * then waits until the slow counter becomes zero.
> > > + */
> > 
> > Some overview of how fast/slow_read_ctr are supposed to work would be
> > useful.  This comment seems to assume that the reader already knew
> > that.
> > 
> > > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > > +	mutex_lock(&brw->writer_mutex);
> > > +
> > > +	/*
> > > +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > > +	 *    so that update_fast_ctr() can't succeed.
> > > +	 *
> > > +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> > > +	 *    update_fast_ctr().
> > > +	 *
> > > +	 * 3. Ensures that if any reader has exited its critical section via
> > > +	 *    fast-path, it executes a full memory barrier before we return.
> > > +	 */
> > > +	synchronize_sched();
> > 
> > Here's where I get horridly confused.  Your patch completely deRCUifies
> > this code, yes?  Yet here we're using an RCU primitive.  And we seem to
> > be using it not as an RCU primitive but as a handy thing which happens
> > to have desirable side-effects.  But the implementation of
> > synchronize_sched() differs considerably according to which rcu
> > flavor-of-the-minute you're using.
> 
> The trick is that the preempt_disable() call in update_fast_ctr()
> acts as an RCU read-side critical section WRT synchronize_sched().
> 
> The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
> synchronize_rcu() in place of preempt_disable()/preempt_enable() and
> synchronize_sched().  The real-time guys would prefer the change
> to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
> you mention it.
> 
> Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock()
> and synchronize_rcu()?
> 
> 							Thanx, Paul

preempt_disable/preempt_enable is faster than 
rcu_read_lock/rcu_read_unlock for preemptive kernels.

Regarding real-time response - the region blocked with 
preempt_disable/preempt_enable contains a few instructions (one test for 
mutex_is_locked and one increment of percpu variable), so it isn't any 
threat to real time response. There are plenty of longer regions in the 
kernel that are executed with interrupts or preemption disabled.

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-08 23:41                                     ` Mikulas Patocka
@ 2012-11-09  0:41                                       ` Paul E. McKenney
  2012-11-09  3:23                                         ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-09  0:41 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Oleg Nesterov, Linus Torvalds, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
> 
> 
> On Thu, 8 Nov 2012, Paul E. McKenney wrote:
> 
> > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > > On Thu, 8 Nov 2012 14:48:49 +0100
> > > Oleg Nesterov <oleg@redhat.com> wrote:
> > > 
> > > > Currently the writer does msleep() plus synchronize_sched() 3 times
> > > > to acquire/release the semaphore, and during this time the readers
> > > > are blocked completely. Even if the "write" section was not actually
> > > > started or if it was already finished.
> > > > 
> > > > With this patch down_write/up_write does synchronize_sched() twice
> > > > and down_read/up_read are still possible during this time, just they
> > > > use the slow path.
> > > > 
> > > > percpu_down_write() first forces the readers to use rw_semaphore and
> > > > increment the "slow" counter to take the lock for reading, then it
> > > > takes that rw_semaphore for writing and blocks the readers.
> > > > 
> > > > Also. With this patch the code relies on the documented behaviour of
> > > > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > > > barrier.
> > > > 
> > > > ...
> > > >
> > > >  include/linux/percpu-rwsem.h |   83 +++++------------------------
> > > >  lib/Makefile                 |    2 +-
> > > >  lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++
> > > 
> > > The patch also uninlines everything.
> > > 
> > > And it didn't export the resulting symbols to modules, so it isn't an
> > > equivalent.  We can export thing later if needed I guess.
> > > 
> > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> > > avoid including the code altogether, methinks?
> > > 
> > > >
> > > > ...
> > > >
> > > > --- /dev/null
> > > > +++ b/lib/percpu-rwsem.c
> > > > @@ -0,0 +1,123 @@
> > > 
> > > That was nice and terse ;)
> > > 
> > > > +#include <linux/percpu-rwsem.h>
> > > > +#include <linux/rcupdate.h>
> > > > +#include <linux/sched.h>
> > > 
> > > This list is nowhere near sufficient to support this file's
> > > requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> > > more.  IOW, if it compiles, it was sheer luck.
> > > 
> > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> > > > +{
> > > > +	brw->fast_read_ctr = alloc_percpu(int);
> > > > +	if (unlikely(!brw->fast_read_ctr))
> > > > +		return -ENOMEM;
> > > > +
> > > > +	mutex_init(&brw->writer_mutex);
> > > > +	init_rwsem(&brw->rw_sem);
> > > > +	atomic_set(&brw->slow_read_ctr, 0);
> > > > +	init_waitqueue_head(&brw->write_waitq);
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> > > > +{
> > > > +	free_percpu(brw->fast_read_ctr);
> > > > +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> > > > +}
> > > > +
> > > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
> > > > +{
> > > > +	bool success = false;
> > > > +
> > > > +	preempt_disable();
> > > > +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> > > > +		__this_cpu_add(*brw->fast_read_ctr, val);
> > > > +		success = true;
> > > > +	}
> > > > +	preempt_enable();
> > > > +
> > > > +	return success;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Like the normal down_read() this is not recursive, the writer can
> > > > + * come after the first percpu_down_read() and create the deadlock.
> > > > + */
> > > > +void percpu_down_read(struct percpu_rw_semaphore *brw)
> > > > +{
> > > > +	if (likely(update_fast_ctr(brw, +1)))
> > > > +		return;
> > > > +
> > > > +	down_read(&brw->rw_sem);
> > > > +	atomic_inc(&brw->slow_read_ctr);
> > > > +	up_read(&brw->rw_sem);
> > > > +}
> > > > +
> > > > +void percpu_up_read(struct percpu_rw_semaphore *brw)
> > > > +{
> > > > +	if (likely(update_fast_ctr(brw, -1)))
> > > > +		return;
> > > > +
> > > > +	/* false-positive is possible but harmless */
> > > > +	if (atomic_dec_and_test(&brw->slow_read_ctr))
> > > > +		wake_up_all(&brw->write_waitq);
> > > > +}
> > > > +
> > > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> > > > +{
> > > > +	unsigned int sum = 0;
> > > > +	int cpu;
> > > > +
> > > > +	for_each_possible_cpu(cpu) {
> > > > +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> > > > +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> > > > +	}
> > > > +
> > > > +	return sum;
> > > > +}
> > > > +
> > > > +/*
> > > > + * A writer takes ->writer_mutex to exclude other writers and to force the
> > > > + * readers to switch to the slow mode, note the mutex_is_locked() check in
> > > > + * update_fast_ctr().
> > > > + *
> > > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> > > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > > > + * counter it represents the number of active readers.
> > > > + *
> > > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> > > > + * then waits until the slow counter becomes zero.
> > > > + */
> > > 
> > > Some overview of how fast/slow_read_ctr are supposed to work would be
> > > useful.  This comment seems to assume that the reader already knew
> > > that.
> > > 
> > > > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > > > +{
> > > > +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > > > +	mutex_lock(&brw->writer_mutex);
> > > > +
> > > > +	/*
> > > > +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > > > +	 *    so that update_fast_ctr() can't succeed.
> > > > +	 *
> > > > +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> > > > +	 *    update_fast_ctr().
> > > > +	 *
> > > > +	 * 3. Ensures that if any reader has exited its critical section via
> > > > +	 *    fast-path, it executes a full memory barrier before we return.
> > > > +	 */
> > > > +	synchronize_sched();
> > > 
> > > Here's where I get horridly confused.  Your patch completely deRCUifies
> > > this code, yes?  Yet here we're using an RCU primitive.  And we seem to
> > > be using it not as an RCU primitive but as a handy thing which happens
> > > to have desirable side-effects.  But the implementation of
> > > synchronize_sched() differs considerably according to which rcu
> > > flavor-of-the-minute you're using.
> > 
> > The trick is that the preempt_disable() call in update_fast_ctr()
> > acts as an RCU read-side critical section WRT synchronize_sched().
> > 
> > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
> > synchronize_rcu() in place of preempt_disable()/preempt_enable() and
> > synchronize_sched().  The real-time guys would prefer the change
> > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
> > you mention it.
> > 
> > Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock()
> > and synchronize_rcu()?
> 
> preempt_disable/preempt_enable is faster than 
> rcu_read_lock/rcu_read_unlock for preemptive kernels.

Significantly faster in this case?  Can you measure the difference
from a user-mode test?

Hmmm.  I have been avoiding moving the preemptible-RCU state from
task_struct to thread_info, but if the difference really matters,
perhaps that needs to be done.

> Regarding real-time response - the region blocked with 
> preempt_disable/preempt_enable contains a few instructions (one test for 
> mutex_is_locked and one increment of percpu variable), so it isn't any 
> threat to real time response. There are plenty of longer regions in the 
> kernel that are executed with interrupts or preemption disabled.

Careful.  The real-time guys might take the same every-little-bit approach
to latency that you seem to be taking for CPU cycles.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-09  0:41                                       ` Paul E. McKenney
@ 2012-11-09  3:23                                         ` Paul E. McKenney
  2012-11-09 16:35                                           ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-09  3:23 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Oleg Nesterov, Linus Torvalds, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote:
> On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
> > 
> > 
> > On Thu, 8 Nov 2012, Paul E. McKenney wrote:
> > 
> > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > > > On Thu, 8 Nov 2012 14:48:49 +0100
> > > > Oleg Nesterov <oleg@redhat.com> wrote:
> > > > 
> > > > > Currently the writer does msleep() plus synchronize_sched() 3 times
> > > > > to acquire/release the semaphore, and during this time the readers
> > > > > are blocked completely. Even if the "write" section was not actually
> > > > > started or if it was already finished.
> > > > > 
> > > > > With this patch down_write/up_write does synchronize_sched() twice
> > > > > and down_read/up_read are still possible during this time, just they
> > > > > use the slow path.
> > > > > 
> > > > > percpu_down_write() first forces the readers to use rw_semaphore and
> > > > > increment the "slow" counter to take the lock for reading, then it
> > > > > takes that rw_semaphore for writing and blocks the readers.
> > > > > 
> > > > > Also. With this patch the code relies on the documented behaviour of
> > > > > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > > > > barrier.
> > > > > 
> > > > > ...
> > > > >
> > > > >  include/linux/percpu-rwsem.h |   83 +++++------------------------
> > > > >  lib/Makefile                 |    2 +-
> > > > >  lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++
> > > > 
> > > > The patch also uninlines everything.
> > > > 
> > > > And it didn't export the resulting symbols to modules, so it isn't an
> > > > equivalent.  We can export thing later if needed I guess.
> > > > 
> > > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> > > > avoid including the code altogether, methinks?
> > > > 
> > > > >
> > > > > ...
> > > > >
> > > > > --- /dev/null
> > > > > +++ b/lib/percpu-rwsem.c
> > > > > @@ -0,0 +1,123 @@
> > > > 
> > > > That was nice and terse ;)
> > > > 
> > > > > +#include <linux/percpu-rwsem.h>
> > > > > +#include <linux/rcupdate.h>
> > > > > +#include <linux/sched.h>
> > > > 
> > > > This list is nowhere near sufficient to support this file's
> > > > requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> > > > more.  IOW, if it compiles, it was sheer luck.
> > > > 
> > > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> > > > > +{
> > > > > +	brw->fast_read_ctr = alloc_percpu(int);
> > > > > +	if (unlikely(!brw->fast_read_ctr))
> > > > > +		return -ENOMEM;
> > > > > +
> > > > > +	mutex_init(&brw->writer_mutex);
> > > > > +	init_rwsem(&brw->rw_sem);
> > > > > +	atomic_set(&brw->slow_read_ctr, 0);
> > > > > +	init_waitqueue_head(&brw->write_waitq);
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> > > > > +{
> > > > > +	free_percpu(brw->fast_read_ctr);
> > > > > +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> > > > > +}
> > > > > +
> > > > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
> > > > > +{
> > > > > +	bool success = false;
> > > > > +
> > > > > +	preempt_disable();
> > > > > +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> > > > > +		__this_cpu_add(*brw->fast_read_ctr, val);
> > > > > +		success = true;
> > > > > +	}
> > > > > +	preempt_enable();
> > > > > +
> > > > > +	return success;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Like the normal down_read() this is not recursive, the writer can
> > > > > + * come after the first percpu_down_read() and create the deadlock.
> > > > > + */
> > > > > +void percpu_down_read(struct percpu_rw_semaphore *brw)
> > > > > +{
> > > > > +	if (likely(update_fast_ctr(brw, +1)))
> > > > > +		return;
> > > > > +
> > > > > +	down_read(&brw->rw_sem);
> > > > > +	atomic_inc(&brw->slow_read_ctr);
> > > > > +	up_read(&brw->rw_sem);
> > > > > +}
> > > > > +
> > > > > +void percpu_up_read(struct percpu_rw_semaphore *brw)
> > > > > +{
> > > > > +	if (likely(update_fast_ctr(brw, -1)))
> > > > > +		return;
> > > > > +
> > > > > +	/* false-positive is possible but harmless */
> > > > > +	if (atomic_dec_and_test(&brw->slow_read_ctr))
> > > > > +		wake_up_all(&brw->write_waitq);
> > > > > +}
> > > > > +
> > > > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> > > > > +{
> > > > > +	unsigned int sum = 0;
> > > > > +	int cpu;
> > > > > +
> > > > > +	for_each_possible_cpu(cpu) {
> > > > > +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> > > > > +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> > > > > +	}
> > > > > +
> > > > > +	return sum;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * A writer takes ->writer_mutex to exclude other writers and to force the
> > > > > + * readers to switch to the slow mode, note the mutex_is_locked() check in
> > > > > + * update_fast_ctr().
> > > > > + *
> > > > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> > > > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > > > > + * counter it represents the number of active readers.
> > > > > + *
> > > > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> > > > > + * then waits until the slow counter becomes zero.
> > > > > + */
> > > > 
> > > > Some overview of how fast/slow_read_ctr are supposed to work would be
> > > > useful.  This comment seems to assume that the reader already knew
> > > > that.
> > > > 
> > > > > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > > > > +{
> > > > > +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > > > > +	mutex_lock(&brw->writer_mutex);
> > > > > +
> > > > > +	/*
> > > > > +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > > > > +	 *    so that update_fast_ctr() can't succeed.
> > > > > +	 *
> > > > > +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> > > > > +	 *    update_fast_ctr().
> > > > > +	 *
> > > > > +	 * 3. Ensures that if any reader has exited its critical section via
> > > > > +	 *    fast-path, it executes a full memory barrier before we return.
> > > > > +	 */
> > > > > +	synchronize_sched();
> > > > 
> > > > Here's where I get horridly confused.  Your patch completely deRCUifies
> > > > this code, yes?  Yet here we're using an RCU primitive.  And we seem to
> > > > be using it not as an RCU primitive but as a handy thing which happens
> > > > to have desirable side-effects.  But the implementation of
> > > > synchronize_sched() differs considerably according to which rcu
> > > > flavor-of-the-minute you're using.
> > > 
> > > The trick is that the preempt_disable() call in update_fast_ctr()
> > > acts as an RCU read-side critical section WRT synchronize_sched().
> > > 
> > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
> > > synchronize_rcu() in place of preempt_disable()/preempt_enable() and
> > > synchronize_sched().  The real-time guys would prefer the change
> > > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
> > > you mention it.
> > > 
> > > Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock()
> > > and synchronize_rcu()?
> > 
> > preempt_disable/preempt_enable is faster than 
> > rcu_read_lock/rcu_read_unlock for preemptive kernels.
> 
> Significantly faster in this case?  Can you measure the difference
> from a user-mode test?
> 
> Hmmm.  I have been avoiding moving the preemptible-RCU state from
> task_struct to thread_info, but if the difference really matters,
> perhaps that needs to be done.

Actually, the fact that __this_cpu_add() will malfunction on some
architectures is preemption is not disabled seems a more compelling
reason to keep preempt_enable() than any performance improvement.  ;-)

							Thanx, Paul

> > Regarding real-time response - the region blocked with 
> > preempt_disable/preempt_enable contains a few instructions (one test for 
> > mutex_is_locked and one increment of percpu variable), so it isn't any 
> > threat to real time response. There are plenty of longer regions in the 
> > kernel that are executed with interrupts or preemption disabled.
> 
> Careful.  The real-time guys might take the same every-little-bit approach
> to latency that you seem to be taking for CPU cycles.  ;-)
> 
> 							Thanx, Paul
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-08 20:07                                 ` Andrew Morton
  2012-11-08 21:08                                   ` Paul E. McKenney
@ 2012-11-09 12:47                                   ` Mikulas Patocka
  2012-11-09 15:46                                   ` Oleg Nesterov
  2012-11-11 18:27                                   ` [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix Oleg Nesterov
  3 siblings, 0 replies; 103+ messages in thread
From: Mikulas Patocka @ 2012-11-09 12:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Linus Torvalds, Paul E. McKenney, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel



On Thu, 8 Nov 2012, Andrew Morton wrote:

> On Thu, 8 Nov 2012 14:48:49 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
> 
> > Currently the writer does msleep() plus synchronize_sched() 3 times
> > to acquire/release the semaphore, and during this time the readers
> > are blocked completely. Even if the "write" section was not actually
> > started or if it was already finished.
> > 
> > With this patch down_write/up_write does synchronize_sched() twice
> > and down_read/up_read are still possible during this time, just they
> > use the slow path.
> > 
> > percpu_down_write() first forces the readers to use rw_semaphore and
> > increment the "slow" counter to take the lock for reading, then it
> > takes that rw_semaphore for writing and blocks the readers.
> > 
> > Also. With this patch the code relies on the documented behaviour of
> > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > barrier.
> > 
> > ...
> >
> >  include/linux/percpu-rwsem.h |   83 +++++------------------------
> >  lib/Makefile                 |    2 +-
> >  lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++
> 
> The patch also uninlines everything.
> 
> And it didn't export the resulting symbols to modules, so it isn't an
> equivalent.  We can export thing later if needed I guess.
> 
> It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> avoid including the code altogether, methinks?

If you want to use percpu-rwsem only for block devices then you can remove 
Oleg's patch at all. Oleg's optimizations are useless for block device use 
case (the contention between readers and writers is very rare and it 
doesn't matter if readers are blocked in case of contention). I suppose 
that Oleg made the optimizations because he wants to use percpu-rwsem for 
something else - if not, you can drop the patch and revert to the previois 
version that is simpler.

Mikulas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-08 20:07                                 ` Andrew Morton
  2012-11-08 21:08                                   ` Paul E. McKenney
  2012-11-09 12:47                                   ` Mikulas Patocka
@ 2012-11-09 15:46                                   ` Oleg Nesterov
  2012-11-09 17:01                                     ` Paul E. McKenney
  2012-11-11 18:27                                   ` [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix Oleg Nesterov
  3 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-09 15:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Paul E. McKenney, Mikulas Patocka,
	Peter Zijlstra, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 11/08, Andrew Morton wrote:
>
> On Thu, 8 Nov 2012 14:48:49 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> >
> >  include/linux/percpu-rwsem.h |   83 +++++------------------------
> >  lib/Makefile                 |    2 +-
> >  lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++
>
> The patch also uninlines everything.
>
> And it didn't export the resulting symbols to modules, so it isn't an
> equivalent.  We can export thing later if needed I guess.

Yes, currently it is only used by block_dev.c

> It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> avoid including the code altogether, methinks?

I am going to add another user (uprobes), this was my motivation for
this patch. And perhaps it will have more users.

But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send
the patch.

> > +#include <linux/percpu-rwsem.h>
> > +#include <linux/rcupdate.h>
> > +#include <linux/sched.h>
>
> This list is nowhere near sufficient to support this file's
> requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> more.  IOW, if it compiles, it was sheer luck.

OK, thanks, I'll send
send percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix

> > +/*
> > + * A writer takes ->writer_mutex to exclude other writers and to force the
> > + * readers to switch to the slow mode, note the mutex_is_locked() check in
> > + * update_fast_ctr().
> > + *
> > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > + * counter it represents the number of active readers.
> > + *
> > + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> > + * then waits until the slow counter becomes zero.
> > + */
>
> Some overview of how fast/slow_read_ctr are supposed to work would be
> useful.  This comment seems to assume that the reader already knew
> that.

I hate to say this, but I'll try to update this comment too ;)

> > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > +{
> > +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > +	mutex_lock(&brw->writer_mutex);
> > +
> > +	/*
> > +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > +	 *    so that update_fast_ctr() can't succeed.
> > +	 *
> > +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> > +	 *    update_fast_ctr().
> > +	 *
> > +	 * 3. Ensures that if any reader has exited its critical section via
> > +	 *    fast-path, it executes a full memory barrier before we return.
> > +	 */
> > +	synchronize_sched();
>
> Here's where I get horridly confused.  Your patch completely deRCUifies
> this code, yes?  Yet here we're using an RCU primitive.  And we seem to
> be using it not as an RCU primitive but as a handy thing which happens
> to have desirable side-effects.  But the implementation of
> synchronize_sched() differs considerably according to which rcu
> flavor-of-the-minute you're using.

It is documented that synchronize_sched() should play well with
preempt_disable/enable. From the comment:

	Note that preempt_disable(),
	local_irq_disable(), and so on may be used in place of
	rcu_read_lock_sched().

But I guess this needs more discussion, I see other emails in this
thread...

> And part 3 talks about the reader's critical section.  The only
> critical sections I can see on the reader side are already covered by
> mutex_lock() and preempt_diable().

Yes, but we need to ensure that if we take the lock for writing, we
should see all memory modifications done under down_read/up_read().

IOW. Suppose that the reader does

	percpu_down_read();
	STORE;
	percpu_up_read();	// no barriers in the fast path

The writer should see the result of that STORE under percpu_down_write().

Part 3 tries to say that at this point we should already see the result,
so we should not worry about acquire/release semantics.

> If this code isn't as brain damaged as it
> initially appears then please,

I hope ;)

> go easy on us simpletons in the next
> version?

Well, I'll try to update the comments... but the code is simple, I do
not think I can simplify it more. The nontrivial part is the barriers,
but this is always nontrivial.

Contrary, I am going to try to add some complications later, so that
it can have more users. In particular, I think it can replace
get_online_cpus/cpu_hotplug_begin, just we need
percpu_down_write_but_dont_deadlock_with_recursive_readers().

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-09  3:23                                         ` Paul E. McKenney
@ 2012-11-09 16:35                                           ` Oleg Nesterov
  2012-11-09 16:59                                             ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-09 16:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mikulas Patocka, Andrew Morton, Linus Torvalds, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On 11/08, Paul E. McKenney wrote:
>
> On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
> > >
> > > On Thu, 8 Nov 2012, Paul E. McKenney wrote:
> > >
> > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > > > > On Thu, 8 Nov 2012 14:48:49 +0100
> > > > > Oleg Nesterov <oleg@redhat.com> wrote:
> > > > >
> > > >
> > > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
> > > > synchronize_rcu() in place of preempt_disable()/preempt_enable() and
> > > > synchronize_sched().  The real-time guys would prefer the change
> > > > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
> > > > you mention it.
> > > >
> > > > Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock()
> > > > and synchronize_rcu()?
> > >
> > > preempt_disable/preempt_enable is faster than
> > > rcu_read_lock/rcu_read_unlock for preemptive kernels.

Yes, I chose preempt_disable() because it is the fastest/simplest
primitive and the critical section is really tiny.

But:

> > Significantly faster in this case?  Can you measure the difference
> > from a user-mode test?

I do not think rcu_read_lock() or rcu_read_lock_sched() can actually
make a measurable difference.

> Actually, the fact that __this_cpu_add() will malfunction on some
> architectures is preemption is not disabled seems a more compelling
> reason to keep preempt_enable() than any performance improvement.  ;-)

Yes, but this_cpu_add() should work.

> > Careful.  The real-time guys might take the same every-little-bit approach
> > to latency that you seem to be taking for CPU cycles.  ;-)

Understand...


So I simply do not know. Please tell me if you think it would be
better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched,
and I'll send the patch.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-09 16:35                                           ` Oleg Nesterov
@ 2012-11-09 16:59                                             ` Paul E. McKenney
  0 siblings, 0 replies; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-09 16:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Andrew Morton, Linus Torvalds, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On Fri, Nov 09, 2012 at 05:35:38PM +0100, Oleg Nesterov wrote:
> On 11/08, Paul E. McKenney wrote:
> >
> > On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote:
> > > On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
> > > >
> > > > On Thu, 8 Nov 2012, Paul E. McKenney wrote:
> > > >
> > > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > > > > > On Thu, 8 Nov 2012 14:48:49 +0100
> > > > > > Oleg Nesterov <oleg@redhat.com> wrote:
> > > > > >
> > > > >
> > > > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
> > > > > synchronize_rcu() in place of preempt_disable()/preempt_enable() and
> > > > > synchronize_sched().  The real-time guys would prefer the change
> > > > > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
> > > > > you mention it.
> > > > >
> > > > > Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock()
> > > > > and synchronize_rcu()?
> > > >
> > > > preempt_disable/preempt_enable is faster than
> > > > rcu_read_lock/rcu_read_unlock for preemptive kernels.
> 
> Yes, I chose preempt_disable() because it is the fastest/simplest
> primitive and the critical section is really tiny.
> 
> But:
> 
> > > Significantly faster in this case?  Can you measure the difference
> > > from a user-mode test?
> 
> I do not think rcu_read_lock() or rcu_read_lock_sched() can actually
> make a measurable difference.
> 
> > Actually, the fact that __this_cpu_add() will malfunction on some
> > architectures is preemption is not disabled seems a more compelling
> > reason to keep preempt_enable() than any performance improvement.  ;-)
> 
> Yes, but this_cpu_add() should work.

Indeed!  But this_cpu_add() just does the preempt_enable() under the
covers, so not much difference from a latency viewpoint.

> > > Careful.  The real-time guys might take the same every-little-bit approach
> > > to latency that you seem to be taking for CPU cycles.  ;-)
> 
> Understand...
> 
> So I simply do not know. Please tell me if you think it would be
> better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched,
> and I'll send the patch.

I doubt if it makes a measurable difference for either throughput or
latency.  One could argue that rcu_read_lock() would be better for
readability, but making sure that the preempt_disable() is clearly
commented as starting an RCU-sched read-side critical section would
be just as good.

So I am OK with the current preempt_disable() approach.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-09 15:46                                   ` Oleg Nesterov
@ 2012-11-09 17:01                                     ` Paul E. McKenney
  2012-11-09 18:10                                       ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-09 17:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote:
> On 11/08, Andrew Morton wrote:
> >
> > On Thu, 8 Nov 2012 14:48:49 +0100
> > Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > >
> > >  include/linux/percpu-rwsem.h |   83 +++++------------------------
> > >  lib/Makefile                 |    2 +-
> > >  lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++
> >
> > The patch also uninlines everything.
> >
> > And it didn't export the resulting symbols to modules, so it isn't an
> > equivalent.  We can export thing later if needed I guess.
> 
> Yes, currently it is only used by block_dev.c
> 
> > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> > avoid including the code altogether, methinks?
> 
> I am going to add another user (uprobes), this was my motivation for
> this patch. And perhaps it will have more users.
> 
> But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send
> the patch.
> 
> > > +#include <linux/percpu-rwsem.h>
> > > +#include <linux/rcupdate.h>
> > > +#include <linux/sched.h>
> >
> > This list is nowhere near sufficient to support this file's
> > requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> > more.  IOW, if it compiles, it was sheer luck.
> 
> OK, thanks, I'll send
> send percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix
> 
> > > +/*
> > > + * A writer takes ->writer_mutex to exclude other writers and to force the
> > > + * readers to switch to the slow mode, note the mutex_is_locked() check in
> > > + * update_fast_ctr().
> > > + *
> > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > > + * counter it represents the number of active readers.
> > > + *
> > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> > > + * then waits until the slow counter becomes zero.
> > > + */
> >
> > Some overview of how fast/slow_read_ctr are supposed to work would be
> > useful.  This comment seems to assume that the reader already knew
> > that.
> 
> I hate to say this, but I'll try to update this comment too ;)
> 
> > > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > > +	mutex_lock(&brw->writer_mutex);
> > > +
> > > +	/*
> > > +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > > +	 *    so that update_fast_ctr() can't succeed.
> > > +	 *
> > > +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> > > +	 *    update_fast_ctr().
> > > +	 *
> > > +	 * 3. Ensures that if any reader has exited its critical section via
> > > +	 *    fast-path, it executes a full memory barrier before we return.
> > > +	 */
> > > +	synchronize_sched();
> >
> > Here's where I get horridly confused.  Your patch completely deRCUifies
> > this code, yes?  Yet here we're using an RCU primitive.  And we seem to
> > be using it not as an RCU primitive but as a handy thing which happens
> > to have desirable side-effects.  But the implementation of
> > synchronize_sched() differs considerably according to which rcu
> > flavor-of-the-minute you're using.
> 
> It is documented that synchronize_sched() should play well with
> preempt_disable/enable. From the comment:
> 
> 	Note that preempt_disable(),
> 	local_irq_disable(), and so on may be used in place of
> 	rcu_read_lock_sched().
> 
> But I guess this needs more discussion, I see other emails in this
> thread...
> 
> > And part 3 talks about the reader's critical section.  The only
> > critical sections I can see on the reader side are already covered by
> > mutex_lock() and preempt_diable().
> 
> Yes, but we need to ensure that if we take the lock for writing, we
> should see all memory modifications done under down_read/up_read().
> 
> IOW. Suppose that the reader does
> 
> 	percpu_down_read();
> 	STORE;
> 	percpu_up_read();	// no barriers in the fast path
> 
> The writer should see the result of that STORE under percpu_down_write().
> 
> Part 3 tries to say that at this point we should already see the result,
> so we should not worry about acquire/release semantics.
> 
> > If this code isn't as brain damaged as it
> > initially appears then please,
> 
> I hope ;)
> 
> > go easy on us simpletons in the next
> > version?
> 
> Well, I'll try to update the comments... but the code is simple, I do
> not think I can simplify it more. The nontrivial part is the barriers,
> but this is always nontrivial.
> 
> Contrary, I am going to try to add some complications later, so that
> it can have more users. In particular, I think it can replace
> get_online_cpus/cpu_hotplug_begin, just we need
> percpu_down_write_but_dont_deadlock_with_recursive_readers().

I must confess that I am a bit concerned about possible scalability
bottlenecks in the current get_online_cpus(), so +1 from me on this one.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-09 17:01                                     ` Paul E. McKenney
@ 2012-11-09 18:10                                       ` Oleg Nesterov
  2012-11-09 18:19                                         ` Oleg Nesterov
  2012-11-10  0:55                                         ` Paul E. McKenney
  0 siblings, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-09 18:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On 11/09, Paul E. McKenney wrote:
>
> On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote:
> > Contrary, I am going to try to add some complications later, so that
> > it can have more users. In particular, I think it can replace
> > get_online_cpus/cpu_hotplug_begin, just we need
> > percpu_down_write_but_dont_deadlock_with_recursive_readers().
>
> I must confess that I am a bit concerned about possible scalability
> bottlenecks in the current get_online_cpus(), so +1 from me on this one.

OK, thanks...

And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is
trivial, just it needs down_write(rw_sem) "inside" wait_event(), not
before. But I'm afraid I will never manage to write the comments ;)

	static bool xxx(brw)
	{
		down_write(&brw->rw_sem);
		if (!atomic_read(&brw->slow_read_ctr))
			return true;

		up_write(&brw->rw_sem);
		return false;
	}

	static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool recursive_readers)
	{
		mutex_lock(&brw->writer_mutex);

		synchronize_sched();

		atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);

		if (recursive_readers)  {
			wait_event(brw->write_waitq, xxx(brw));
		} else {
			down_write(&brw->rw_sem);

			wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
		}
	}

Of course, cpu.c still needs .active_writer to allow get_online_cpus()
under cpu_hotplug_begin(), but this is simple.

But first we should do other changes, I think. IMHO we should not do
synchronize_sched() under mutex_lock() and this will add (a bit) more
complications. We will see.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-09 18:10                                       ` Oleg Nesterov
@ 2012-11-09 18:19                                         ` Oleg Nesterov
  2012-11-10  0:55                                         ` Paul E. McKenney
  1 sibling, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-09 18:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On 11/09, Oleg Nesterov wrote:
>
> 	static bool xxx(brw)
> 	{
> 		down_write(&brw->rw_sem);
> 		if (!atomic_read(&brw->slow_read_ctr))
>			return true;

I meant, try_to_down_write(). Otherwise this can obviously deadlock.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-09 18:10                                       ` Oleg Nesterov
  2012-11-09 18:19                                         ` Oleg Nesterov
@ 2012-11-10  0:55                                         ` Paul E. McKenney
  2012-11-11 15:45                                           ` Oleg Nesterov
  1 sibling, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-10  0:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote:
> On 11/09, Paul E. McKenney wrote:
> >
> > On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote:
> > > Contrary, I am going to try to add some complications later, so that
> > > it can have more users. In particular, I think it can replace
> > > get_online_cpus/cpu_hotplug_begin, just we need
> > > percpu_down_write_but_dont_deadlock_with_recursive_readers().
> >
> > I must confess that I am a bit concerned about possible scalability
> > bottlenecks in the current get_online_cpus(), so +1 from me on this one.
> 
> OK, thanks...
> 
> And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is
> trivial, just it needs down_write(rw_sem) "inside" wait_event(), not
> before. But I'm afraid I will never manage to write the comments ;)
> 
> 	static bool xxx(brw)
> 	{
> 		down_write(&brw->rw_sem);

		down_write_trylock()

As you noted in your later email.  Presumably you return false if
the attempt to acquire it fails.

> 		if (!atomic_read(&brw->slow_read_ctr))
> 			return true;
> 
> 		up_write(&brw->rw_sem);
> 		return false;
> 	}
> 
> 	static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool recursive_readers)
> 	{
> 		mutex_lock(&brw->writer_mutex);
> 
> 		synchronize_sched();
> 
> 		atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
> 
> 		if (recursive_readers)  {
> 			wait_event(brw->write_waitq, xxx(brw));

I see what you mean about acquiring brw->rw_sem inside of wait_event().

Cute trick!

The "recursive_readers" is a global initialization-time thing, right?

> 		} else {
> 			down_write(&brw->rw_sem);
> 
> 			wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> 		}
> 	}

Looks like it should work, and would perform and scale nicely even
if we end up having to greatly increase the number of calls to
get_online_cpus().

> Of course, cpu.c still needs .active_writer to allow get_online_cpus()
> under cpu_hotplug_begin(), but this is simple.

Yep, same check as now.

> But first we should do other changes, I think. IMHO we should not do
> synchronize_sched() under mutex_lock() and this will add (a bit) more
> complications. We will see.

Indeed, that does put considerable delay on the writers.  There is always
synchronize_sched_expedited(), I suppose.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-10  0:55                                         ` Paul E. McKenney
@ 2012-11-11 15:45                                           ` Oleg Nesterov
  2012-11-12 18:38                                             ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-11 15:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On 11/09, Paul E. McKenney wrote:
>
> On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote:
> >
> > 	static bool xxx(brw)
> > 	{
> > 		down_write(&brw->rw_sem);
>
> 		down_write_trylock()
>
> As you noted in your later email.  Presumably you return false if
> the attempt to acquire it fails.

Yes, yes, thanks.

> > But first we should do other changes, I think. IMHO we should not do
> > synchronize_sched() under mutex_lock() and this will add (a bit) more
> > complications. We will see.
>
> Indeed, that does put considerable delay on the writers.  There is always
> synchronize_sched_expedited(), I suppose.

I am not sure about synchronize_sched_expedited() (at least unconditionally),
but: only the 1st down_write() needs  synchronize_, and up_write() do not
need to sleep in synchronize_ at all.

To simplify, lets ignore the fact that the writers need to serialize with
each other. IOW, the pseudo-code below is obviously deadly wrong and racy,
just to illustrate the idea.

1. We remove brw->writer_mutex and add "atomic_t writers_ctr".

   update_fast_ctr() uses atomic_read(brw->writers_ctr) == 0 instead
   of !mutex_is_locked().

2. down_write() does

	if (atomic_add_return(brw->writers_ctr) == 1) {
		// first writer
		synchronize_sched();
		...
	} else {
		... XXX: wait for percpu_up_write() from the first writer ...
	}

3. up_write() does

	if (atomic_dec_unless_one(brw->writers_ctr)) {
		... wake up XXX writers above ...
		return;
	} else {
		// the last writer
		call_rcu_sched( func => { atomic_dec(brw->writers_ctr) } );
	}

Once again, this all is racy, but hopefully the idea is clear:

	- down_write(brw) sleeps in synchronize_sched() only if brw
	  has already switched back to fast-path-mode

	- up_write() never sleeps in synchronize_sched(), it uses
	  call_rcu_sched() or wakes up the next writer.

Of course I am not sure this all worth the trouble, this should be discussed.
(and, cough, I'd like to add the multi-writers mode which I'm afraid nobody
will like) But I am not going to even try to do this until the current patch
is applied, I need it to fix the bug in uprobes and I think the current code
is "good enough". These changes can't help to speedup the readers, and the
writers are slow/rare anyway.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix
  2012-11-08 20:07                                 ` Andrew Morton
                                                     ` (2 preceding siblings ...)
  2012-11-09 15:46                                   ` Oleg Nesterov
@ 2012-11-11 18:27                                   ` Oleg Nesterov
  2012-11-12 18:31                                     ` Paul E. McKenney
  2012-11-16 23:22                                     ` Andrew Morton
  3 siblings, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-11 18:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Paul E. McKenney, Mikulas Patocka,
	Peter Zijlstra, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

More include's and more comments, no changes in code.

To remind, once/if I am sure you agree with this patch I'll send 2 additional
and simple patches:

	1. lockdep annotations

	2. CONFIG_PERCPU_RWSEM

It seems that we can do much more improvements to a) speedup the writers and
b) make percpu_rw_semaphore more useful, but not right now.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 lib/percpu-rwsem.c |   35 +++++++++++++++++++++++++++++++++--
 1 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
index 0e3bc0f..02bd157 100644
--- a/lib/percpu-rwsem.c
+++ b/lib/percpu-rwsem.c
@@ -1,6 +1,11 @@
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/percpu.h>
+#include <linux/wait.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/errno.h>
 
 int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
 {
@@ -21,6 +26,29 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
 	brw->fast_read_ctr = NULL; /* catch use after free bugs */
 }
 
+/*
+ * This is the fast-path for down_read/up_read, it only needs to ensure
+ * there is no pending writer (!mutex_is_locked() check) and inc/dec the
+ * fast per-cpu counter. The writer uses synchronize_sched() to serialize
+ * with the preempt-disabled section below.
+ *
+ * The nontrivial part is that we should guarantee acquire/release semantics
+ * in case when
+ *
+ *	R_W: down_write() comes after up_read(), the writer should see all
+ *	     changes done by the reader
+ * or
+ *	W_R: down_read() comes after up_write(), the reader should see all
+ *	     changes done by the writer
+ *
+ * If this helper fails the callers rely on the normal rw_semaphore and
+ * atomic_dec_and_test(), so in this case we have the necessary barriers.
+ *
+ * But if it succeeds we do not have any barriers, mutex_is_locked() or
+ * __this_cpu_add() below can be reordered with any LOAD/STORE done by the
+ * reader inside the critical section. See the comments in down_write and
+ * up_write below.
+ */
 static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
 {
 	bool success = false;
@@ -98,6 +126,7 @@ void percpu_down_write(struct percpu_rw_semaphore *brw)
 	 *
 	 * 3. Ensures that if any reader has exited its critical section via
 	 *    fast-path, it executes a full memory barrier before we return.
+	 *    See R_W case in the comment above update_fast_ctr().
 	 */
 	synchronize_sched();
 
@@ -116,8 +145,10 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
 	/* allow the new readers, but only the slow-path */
 	up_write(&brw->rw_sem);
 
-	/* insert the barrier before the next fast-path in down_read */
+	/*
+	 * Insert the barrier before the next fast-path in down_read,
+	 * see W_R case in the comment above update_fast_ctr().
+	 */
 	synchronize_sched();
-
 	mutex_unlock(&brw->writer_mutex);
 }
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix
  2012-11-11 18:27                                   ` [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix Oleg Nesterov
@ 2012-11-12 18:31                                     ` Paul E. McKenney
  2012-11-16 23:22                                     ` Andrew Morton
  1 sibling, 0 replies; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-12 18:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On Sun, Nov 11, 2012 at 07:27:44PM +0100, Oleg Nesterov wrote:
> More include's and more comments, no changes in code.
> 
> To remind, once/if I am sure you agree with this patch I'll send 2 additional
> and simple patches:
> 
> 	1. lockdep annotations
> 
> 	2. CONFIG_PERCPU_RWSEM
> 
> It seems that we can do much more improvements to a) speedup the writers and
> b) make percpu_rw_semaphore more useful, but not right now.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Looks good to me!

							Thanx, Paul

> ---
>  lib/percpu-rwsem.c |   35 +++++++++++++++++++++++++++++++++--
>  1 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
> index 0e3bc0f..02bd157 100644
> --- a/lib/percpu-rwsem.c
> +++ b/lib/percpu-rwsem.c
> @@ -1,6 +1,11 @@
> +#include <linux/mutex.h>
> +#include <linux/rwsem.h>
> +#include <linux/percpu.h>
> +#include <linux/wait.h>
>  #include <linux/percpu-rwsem.h>
>  #include <linux/rcupdate.h>
>  #include <linux/sched.h>
> +#include <linux/errno.h>
> 
>  int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
>  {
> @@ -21,6 +26,29 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
>  	brw->fast_read_ctr = NULL; /* catch use after free bugs */
>  }
> 
> +/*
> + * This is the fast-path for down_read/up_read, it only needs to ensure
> + * there is no pending writer (!mutex_is_locked() check) and inc/dec the
> + * fast per-cpu counter. The writer uses synchronize_sched() to serialize
> + * with the preempt-disabled section below.
> + *
> + * The nontrivial part is that we should guarantee acquire/release semantics
> + * in case when
> + *
> + *	R_W: down_write() comes after up_read(), the writer should see all
> + *	     changes done by the reader
> + * or
> + *	W_R: down_read() comes after up_write(), the reader should see all
> + *	     changes done by the writer
> + *
> + * If this helper fails the callers rely on the normal rw_semaphore and
> + * atomic_dec_and_test(), so in this case we have the necessary barriers.
> + *
> + * But if it succeeds we do not have any barriers, mutex_is_locked() or
> + * __this_cpu_add() below can be reordered with any LOAD/STORE done by the
> + * reader inside the critical section. See the comments in down_write and
> + * up_write below.
> + */
>  static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
>  {
>  	bool success = false;
> @@ -98,6 +126,7 @@ void percpu_down_write(struct percpu_rw_semaphore *brw)
>  	 *
>  	 * 3. Ensures that if any reader has exited its critical section via
>  	 *    fast-path, it executes a full memory barrier before we return.
> +	 *    See R_W case in the comment above update_fast_ctr().
>  	 */
>  	synchronize_sched();
> 
> @@ -116,8 +145,10 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
>  	/* allow the new readers, but only the slow-path */
>  	up_write(&brw->rw_sem);
> 
> -	/* insert the barrier before the next fast-path in down_read */
> +	/*
> +	 * Insert the barrier before the next fast-path in down_read,
> +	 * see W_R case in the comment above update_fast_ctr().
> +	 */
>  	synchronize_sched();
> -
>  	mutex_unlock(&brw->writer_mutex);
>  }
> -- 
> 1.5.5.1
> 
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-11 15:45                                           ` Oleg Nesterov
@ 2012-11-12 18:38                                             ` Paul E. McKenney
  0 siblings, 0 replies; 103+ messages in thread
From: Paul E. McKenney @ 2012-11-12 18:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Linus Torvalds, Mikulas Patocka, Peter Zijlstra,
	Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On Sun, Nov 11, 2012 at 04:45:09PM +0100, Oleg Nesterov wrote:
> On 11/09, Paul E. McKenney wrote:
> >
> > On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote:
> > >
> > > 	static bool xxx(brw)
> > > 	{
> > > 		down_write(&brw->rw_sem);
> >
> > 		down_write_trylock()
> >
> > As you noted in your later email.  Presumably you return false if
> > the attempt to acquire it fails.
> 
> Yes, yes, thanks.
> 
> > > But first we should do other changes, I think. IMHO we should not do
> > > synchronize_sched() under mutex_lock() and this will add (a bit) more
> > > complications. We will see.
> >
> > Indeed, that does put considerable delay on the writers.  There is always
> > synchronize_sched_expedited(), I suppose.
> 
> I am not sure about synchronize_sched_expedited() (at least unconditionally),
> but: only the 1st down_write() needs  synchronize_, and up_write() do not
> need to sleep in synchronize_ at all.
> 
> To simplify, lets ignore the fact that the writers need to serialize with
> each other. IOW, the pseudo-code below is obviously deadly wrong and racy,
> just to illustrate the idea.
> 
> 1. We remove brw->writer_mutex and add "atomic_t writers_ctr".
> 
>    update_fast_ctr() uses atomic_read(brw->writers_ctr) == 0 instead
>    of !mutex_is_locked().
> 
> 2. down_write() does
> 
> 	if (atomic_add_return(brw->writers_ctr) == 1) {
> 		// first writer
> 		synchronize_sched();
> 		...
> 	} else {
> 		... XXX: wait for percpu_up_write() from the first writer ...
> 	}
> 
> 3. up_write() does
> 
> 	if (atomic_dec_unless_one(brw->writers_ctr)) {
> 		... wake up XXX writers above ...
> 		return;
> 	} else {
> 		// the last writer
> 		call_rcu_sched( func => { atomic_dec(brw->writers_ctr) } );
> 	}

Agreed, an asynchronous callback can be used to switch the readers
back onto the fastpath.  Of course, as you say, getting it all working
will require some care.  ;-)

> Once again, this all is racy, but hopefully the idea is clear:
> 
> 	- down_write(brw) sleeps in synchronize_sched() only if brw
> 	  has already switched back to fast-path-mode
> 
> 	- up_write() never sleeps in synchronize_sched(), it uses
> 	  call_rcu_sched() or wakes up the next writer.
> 
> Of course I am not sure this all worth the trouble, this should be discussed.
> (and, cough, I'd like to add the multi-writers mode which I'm afraid nobody
> will like) But I am not going to even try to do this until the current patch
> is applied, I need it to fix the bug in uprobes and I think the current code
> is "good enough". These changes can't help to speedup the readers, and the
> writers are slow/rare anyway.

Probably best to wait for multi-writers until there is a measurable need,
to be sure!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix
  2012-11-11 18:27                                   ` [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix Oleg Nesterov
  2012-11-12 18:31                                     ` Paul E. McKenney
@ 2012-11-16 23:22                                     ` Andrew Morton
  2012-11-18 19:32                                       ` Oleg Nesterov
  1 sibling, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2012-11-16 23:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Paul E. McKenney, Mikulas Patocka,
	Peter Zijlstra, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On Sun, 11 Nov 2012 19:27:44 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

>  lib/percpu-rwsem.c |   35 +++++++++++++++++++++++++++++++++--

y'know, this looks like a great pile of useless bloat for single-CPU
machines.  Maybe add a CONFIG_SMP=n variant which simply calls the
regular rwsem operations?



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix
  2012-11-16 23:22                                     ` Andrew Morton
@ 2012-11-18 19:32                                       ` Oleg Nesterov
  0 siblings, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2012-11-18 19:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Paul E. McKenney, Mikulas Patocka,
	Peter Zijlstra, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 11/16, Andrew Morton wrote:
>
> On Sun, 11 Nov 2012 19:27:44 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> >  lib/percpu-rwsem.c |   35 +++++++++++++++++++++++++++++++++--
>
> y'know, this looks like a great pile of useless bloat for single-CPU
> machines.  Maybe add a CONFIG_SMP=n variant which simply calls the
> regular rwsem operations?

Yes, I thought about this and probably I'll send the patch...

But note that the regular down_read() won't be actually faster if
there is no writer, and it doesn't allow to add other features.

I'll try to think, perhaps it would be enough to add a couple of
"ifdef CONFIG_SMP" into this code, say, to avoid __percpu.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2012-11-18 19:32 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-15 19:09 [RFC PATCH 0/2] uprobes: register/unregister can race with fork Oleg Nesterov
2012-10-15 19:10 ` [PATCH 1/2] brw_mutex: big read-write mutex Oleg Nesterov
2012-10-15 23:28   ` Paul E. McKenney
2012-10-16 15:56     ` Oleg Nesterov
2012-10-16 18:58       ` Paul E. McKenney
2012-10-17 16:37         ` Oleg Nesterov
2012-10-17 22:28           ` Paul E. McKenney
2012-10-16 19:56   ` Linus Torvalds
2012-10-17 16:59     ` Oleg Nesterov
2012-10-17 22:44       ` Paul E. McKenney
2012-10-18 16:24         ` Oleg Nesterov
2012-10-18 16:38           ` Paul E. McKenney
2012-10-18 17:57             ` Oleg Nesterov
2012-10-18 19:28               ` Mikulas Patocka
2012-10-19 12:38                 ` Peter Zijlstra
2012-10-19 15:32                   ` Mikulas Patocka
2012-10-19 17:40                     ` Peter Zijlstra
2012-10-19 17:57                       ` Oleg Nesterov
2012-10-19 22:54                       ` Mikulas Patocka
2012-10-24  3:08                         ` Dave Chinner
2012-10-25 14:09                           ` Mikulas Patocka
2012-10-25 23:40                             ` Dave Chinner
2012-10-26 12:06                               ` Oleg Nesterov
2012-10-26 13:22                                 ` Mikulas Patocka
2012-10-26 14:12                                   ` Oleg Nesterov
2012-10-26 15:23                                     ` mark_files_ro && sb_end_write Oleg Nesterov
2012-10-26 16:09                                     ` [PATCH 1/2] brw_mutex: big read-write mutex Mikulas Patocka
2012-10-19 17:49                     ` Oleg Nesterov
2012-10-22 23:09                       ` Mikulas Patocka
2012-10-23 15:12                         ` Oleg Nesterov
2012-10-19 19:28               ` Paul E. McKenney
2012-10-22 23:36                 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka
2012-10-22 23:37                   ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Mikulas Patocka
2012-10-22 23:39                     ` [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched Mikulas Patocka
2012-10-24 16:16                       ` Paul E. McKenney
2012-10-24 17:18                         ` Oleg Nesterov
2012-10-24 18:20                           ` Paul E. McKenney
2012-10-24 18:43                             ` Oleg Nesterov
2012-10-24 19:43                               ` Paul E. McKenney
2012-10-25 14:54                         ` Mikulas Patocka
2012-10-25 15:07                           ` Paul E. McKenney
2012-10-25 16:15                             ` Mikulas Patocka
2012-10-23 16:59                     ` [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers Oleg Nesterov
2012-10-23 18:05                       ` Paul E. McKenney
2012-10-23 18:27                         ` Oleg Nesterov
2012-10-23 18:41                         ` Oleg Nesterov
2012-10-23 20:29                           ` Paul E. McKenney
2012-10-23 20:32                             ` Paul E. McKenney
2012-10-23 21:39                               ` Mikulas Patocka
2012-10-24 16:23                                 ` Paul E. McKenney
2012-10-24 20:22                                   ` Mikulas Patocka
2012-10-24 20:36                                     ` Paul E. McKenney
2012-10-24 20:44                                       ` Mikulas Patocka
2012-10-24 23:57                                         ` Paul E. McKenney
2012-10-25 12:39                                           ` Paul E. McKenney
2012-10-25 13:48                                           ` Mikulas Patocka
2012-10-23 19:23                       ` Oleg Nesterov
2012-10-23 20:45                         ` Peter Zijlstra
2012-10-23 20:57                         ` Peter Zijlstra
2012-10-24 15:11                           ` Oleg Nesterov
2012-10-23 21:26                         ` Mikulas Patocka
2012-10-23 20:32                     ` Peter Zijlstra
2012-10-30 18:48                   ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Oleg Nesterov
2012-10-31 19:41                     ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov
2012-10-31 19:41                       ` [PATCH 1/1] " Oleg Nesterov
2012-11-01 15:10                         ` Linus Torvalds
2012-11-01 15:34                           ` Oleg Nesterov
2012-11-02 18:06                           ` [PATCH v2 0/1] " Oleg Nesterov
2012-11-02 18:06                             ` [PATCH v2 1/1] " Oleg Nesterov
2012-11-07 17:04                               ` [PATCH v3 " Mikulas Patocka
2012-11-07 17:47                                 ` Oleg Nesterov
2012-11-07 19:17                                   ` Mikulas Patocka
2012-11-08 13:42                                     ` Oleg Nesterov
2012-11-08  1:23                                 ` Paul E. McKenney
2012-11-08  1:16                               ` [PATCH v2 " Paul E. McKenney
2012-11-08 13:33                                 ` Oleg Nesterov
2012-11-08 16:27                                   ` Paul E. McKenney
2012-11-08 13:48                             ` [PATCH RESEND v2 0/1] " Oleg Nesterov
2012-11-08 13:48                               ` [PATCH RESEND v2 1/1] " Oleg Nesterov
2012-11-08 20:07                                 ` Andrew Morton
2012-11-08 21:08                                   ` Paul E. McKenney
2012-11-08 23:41                                     ` Mikulas Patocka
2012-11-09  0:41                                       ` Paul E. McKenney
2012-11-09  3:23                                         ` Paul E. McKenney
2012-11-09 16:35                                           ` Oleg Nesterov
2012-11-09 16:59                                             ` Paul E. McKenney
2012-11-09 12:47                                   ` Mikulas Patocka
2012-11-09 15:46                                   ` Oleg Nesterov
2012-11-09 17:01                                     ` Paul E. McKenney
2012-11-09 18:10                                       ` Oleg Nesterov
2012-11-09 18:19                                         ` Oleg Nesterov
2012-11-10  0:55                                         ` Paul E. McKenney
2012-11-11 15:45                                           ` Oleg Nesterov
2012-11-12 18:38                                             ` Paul E. McKenney
2012-11-11 18:27                                   ` [PATCH -mm] percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessari ly.fix Oleg Nesterov
2012-11-12 18:31                                     ` Paul E. McKenney
2012-11-16 23:22                                     ` Andrew Morton
2012-11-18 19:32                                       ` Oleg Nesterov
2012-11-01 15:43                         ` [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Paul E. McKenney
2012-11-01 18:33                           ` Oleg Nesterov
2012-11-02 16:18                             ` Oleg Nesterov
2012-10-15 19:10 ` [PATCH 2/2] uprobes: Use brw_mutex to fix register/unregister vs dup_mmap() race Oleg Nesterov
2012-10-18  7:03   ` Srikar Dronamraju

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.