All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
@ 2012-11-04  8:47 George Spelvin
  2012-11-04 15:52 ` Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: George Spelvin @ 2012-11-04  8:47 UTC (permalink / raw)
  To: oleg, torvalds; +Cc: linux, linux-kernel

Grand poo-bah Linus wrote:
> Now, I doubt you'll find an architecture or C compiler where this will
> actually ever make a difference, but the fact remains that you
> shouldn't use signed integers for counters like this. You should use
> unsigned, and you should rely on the well-defined modulo-2**n
> semantics.

Actually, this is another C standard undefined case that recent versions of
GCC exploit for optimization.

When using signed integers, GCC's optimizer assumes that, if b > 0,
then a + b > a.

For example, the loop:
	for (i = 1; i; i++)
		/* Code */
will never terminate!  Feed the following to gcc -O2 and see for yourself:

extern void foo(int x);

void bar(void)
{
	int i;
	for (i = 0; i >= 0; i++)
		foo(i);
}

here's what I get:

        .file   "test.c"
        .text
        .p2align 4,,15
        .globl  bar
        .type   bar, @function
bar:
.LFB0:
        .cfi_startproc
        pushl   %ebx
        .cfi_def_cfa_offset 8
        .cfi_offset 3, -8
        movl    $1, %ebx
        subl    $24, %esp
        .cfi_def_cfa_offset 32
        .p2align 4,,7
        .p2align 3
.L2:
        movl    %ebx, (%esp)
        addl    $1, %ebx
        call    foo
        jmp     .L2
        .cfi_endproc
.LFE0:
        .size   bar, .-bar
        .ident  "GCC: (Debian 4.7.2-4) 4.7.2"
        .section        .note.GNU-stack,"",@progbits

Notice the lack of test in the "jmp .L2" loop.

It can even handle more complicated cases like:

void bar(int j)
{
        int i = 0;
        do {
                foo(i);
                if (j >= 0)
                        i += j;
                else
                        i -= j;
        } while (i >= 0);
}

... which gcc -O3 neatly splits into two infinite loops, as if I had written:

void bar(int j)
{
        int i = 0;
	if (j >= 0)
		for (; ; i += j)
			foo(i);
	else
		for (; ; i -= j)
			foo(i);
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-04  8:47 [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily George Spelvin
@ 2012-11-04 15:52 ` Oleg Nesterov
  0 siblings, 0 replies; 8+ messages in thread
From: Oleg Nesterov @ 2012-11-04 15:52 UTC (permalink / raw)
  To: George Spelvin; +Cc: torvalds, linux-kernel

On 11/04, George Spelvin wrote:
>
> Grand poo-bah Linus wrote:
> > Now, I doubt you'll find an architecture or C compiler where this will
> > actually ever make a difference, but the fact remains that you
> > shouldn't use signed integers for counters like this. You should use
> > unsigned, and you should rely on the well-defined modulo-2**n
> > semantics.
>
> Actually, this is another C standard undefined case that recent versions of
> GCC exploit for optimization.
                  ^^^^^^^^^^^^

This is another thing,

> For example, the loop:
> 	for (i = 1; i; i++)
> 		/* Code */
> will never terminate!  Feed the following to gcc -O2 and see for yourself:

Yes, because ...

> Notice the lack of test in the "jmp .L2" loop.

Exactly.

But if we have like

	int A, B;

	int sum(void)
	{
		return A + B;
	}

then I doubt there is any architecture (at least supported by linux)
which can generate the different code if you do s/int/unsigned/.


Anyway I agree, unsigned makes more sense, and I changed this patch
accordingly..

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-01 18:33                       ` Oleg Nesterov
@ 2012-11-02 16:18                         ` Oleg Nesterov
  0 siblings, 0 replies; 8+ messages in thread
From: Oleg Nesterov @ 2012-11-02 16:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On 11/01, Oleg Nesterov wrote:
>
> On 11/01, Paul E. McKenney wrote:
> >
> > OK, so it looks to me that this code relies on synchronize_sched()
> > forcing a memory barrier on each CPU executing in the kernel.
>
> No, the patch tries to avoid this assumption, but probably I missed
> something.
>
> > 1.	A task running on CPU 0 currently write-holds the lock.
> >
> > 2.	CPU 1 is running in the kernel, executing a longer-than-average
> > 	loop of normal instructions (no atomic instructions or memory
> > 	barriers).
> >
> > 3.	CPU 0 invokes percpu_up_write(), calling up_write(),
> > 	synchronize_sched(), and finally mutex_unlock().
>
> And my expectation was, this should be enough because ...
>
> > 4.	CPU 1 executes percpu_down_read(), which calls update_fast_ctr(),
>
> since update_fast_ctr does preempt_disable/enable it should see all
> modifications done by CPU 0.
>
> IOW. Suppose that the writer (CPU 0) does
>
> 	percpu_done_write();
> 	STORE;
> 	percpu_up_write();
>
> This means
>
> 	STORE;
> 	synchronize_sched();
> 	mutex_unlock();
>
> Now. Do you mean that the next preempt_disable/enable can see the
> result of mutex_unlock() but not STORE?

So far I think this is not possible, so the code doesn't need the
additional wstate/barriers.

> > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val)
> > +{
> > +	bool success = false;
>
> 	int state;
>
> > +
> > +	preempt_disable();
> > +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
>
> 	state = ACCESS_ONCE(brw->wstate);
> 	if (likely(!state)) {
>
> > +		__this_cpu_add(*brw->fast_read_ctr, val);
> > +		success = true;
>
> 	} else if (state & WSTATE_NEED_MB) {
> 		__this_cpu_add(*brw->fast_read_ctr, val);
> 		smb_mb(); /* Order increment against critical section. */
> 		success = true;
> 	}

...

> > +void percpu_up_write(struct percpu_rw_semaphore *brw)
> > +{
> > +	/* allow the new readers, but only the slow-path */
> > +	up_write(&brw->rw_sem);
>
> 	ACCESS_ONCE(brw->wstate) = WSTATE_NEED_MB;
>
> > +
> > +	/* insert the barrier before the next fast-path in down_read */
> > +	synchronize_sched();

But update_fast_ctr() should see mutex_is_locked(), obiously down_write()
must ensure this.

So update_fast_ctr() can execute the WSTATE_NEED_MB code only if it
races with

> 	ACCESS_ONCE(brw->wstate) = 0;
>
> > +	mutex_unlock(&brw->writer_mutex);

these 2 stores and sees them in reverse order.



I guess that mutex_is_locked() in update_fast_ctr() looks a bit confusing.
It means no-fast-path for the reader, we could use ->state instead.

And even ->writer_mutex should go away if we want to optimize the
write-contended case, but I think this needs another patch on top of
this initial implementation.

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-01 15:43                     ` Paul E. McKenney
@ 2012-11-01 18:33                       ` Oleg Nesterov
  2012-11-02 16:18                         ` Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2012-11-01 18:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

Paul, thanks.

Sorry, I can't reply today, just one note...

On 11/01, Paul E. McKenney wrote:
>
> OK, so it looks to me that this code relies on synchronize_sched()
> forcing a memory barrier on each CPU executing in the kernel.

No, the patch tries to avoid this assumption, but probably I missed
something.

> 1.	A task running on CPU 0 currently write-holds the lock.
>
> 2.	CPU 1 is running in the kernel, executing a longer-than-average
> 	loop of normal instructions (no atomic instructions or memory
> 	barriers).
>
> 3.	CPU 0 invokes percpu_up_write(), calling up_write(),
> 	synchronize_sched(), and finally mutex_unlock().

And my expectation was, this should be enough because ...

> 4.	CPU 1 executes percpu_down_read(), which calls update_fast_ctr(),

since update_fast_ctr does preempt_disable/enable it should see all
modifications done by CPU 0.

IOW. Suppose that the writer (CPU 0) does

	percpu_done_write();
	STORE;
	percpu_up_write();

This means

	STORE;
	synchronize_sched();
	mutex_unlock();

Now. Do you mean that the next preempt_disable/enable can see the
result of mutex_unlock() but not STORE?

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-10-31 19:41                   ` [PATCH 1/1] " Oleg Nesterov
  2012-11-01 15:10                     ` Linus Torvalds
@ 2012-11-01 15:43                     ` Paul E. McKenney
  2012-11-01 18:33                       ` Oleg Nesterov
  1 sibling, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2012-11-01 15:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Oct 31, 2012 at 08:41:58PM +0100, Oleg Nesterov wrote:
> Currently the writer does msleep() plus synchronize_sched() 3 times
> to acquire/release the semaphore, and during this time the readers
> are blocked completely. Even if the "write" section was not actually
> started or if it was already finished.
> 
> With this patch down_read/up_read does synchronize_sched() twice and
> down_read/up_read are still possible during this time, just they use
> the slow path.
> 
> percpu_down_write() first forces the readers to use rw_semaphore and
> increment the "slow" counter to take the lock for reading, then it
> takes that rw_semaphore for writing and blocks the readers.
> 
> Also. With this patch the code relies on the documented behaviour of
> synchronize_sched(), it doesn't try to pair synchronize_sched() with
> barrier.

OK, so it looks to me that this code relies on synchronize_sched()
forcing a memory barrier on each CPU executing in the kernel.  I
might well be confused, so here is the sequence of events that leads
me to believe this:

1.	A task running on CPU 0 currently write-holds the lock.

2.	CPU 1 is running in the kernel, executing a longer-than-average
	loop of normal instructions (no atomic instructions or memory
	barriers).

3.	CPU 0 invokes percpu_up_write(), calling up_write(),
	synchronize_sched(), and finally mutex_unlock().

4.	CPU 1 executes percpu_down_read(), which calls update_fast_ctr(),
	which finds that ->writer_mutex is not held.  CPU 1 therefore
	increments >fast_read_ctr and returns success.

Of course, as Mikulas pointed out, the actual implementation will
have forced CPU 1 to execute a memory barrier in the course of the
synchronize_sched() implementation.  However, if synchronize_sched() had
been modified to act as synchronize_srcu() currently does, there would
be no memory barrier, and thus no guarantee that CPU 1's subsequent
read-side critical section would seen the effect of CPU 0's previous
write-side critical section.

Fortunately, this is easy to fix, with zero added overhead on the
read-side fastpath, as shown by the notes interspersed below.

Thoughts?

						Thanx, Paul

> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/percpu-rwsem.h |   83 +++++---------------------------
>  lib/Makefile                 |    2 +-
>  lib/percpu-rwsem.c           |  106 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 120 insertions(+), 71 deletions(-)
>  create mode 100644 lib/percpu-rwsem.c
> 
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 250a4ac..7f738ca 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -2,82 +2,25 @@
>  #define _LINUX_PERCPU_RWSEM_H
> 
>  #include <linux/mutex.h>
> +#include <linux/rwsem.h>
>  #include <linux/percpu.h>
> -#include <linux/rcupdate.h>
> -#include <linux/delay.h>
> +#include <linux/wait.h>
> 
>  struct percpu_rw_semaphore {
> -	unsigned __percpu *counters;
> -	bool locked;
> -	struct mutex mtx;
> +	int __percpu		*fast_read_ctr;
> +	struct mutex		writer_mutex;
> +	struct rw_semaphore	rw_sem;
> +	atomic_t		slow_read_ctr;
> +	wait_queue_head_t	write_waitq;
	int			wstate;
>  };
> 
> -#define light_mb()	barrier()
> -#define heavy_mb()	synchronize_sched()
> +extern void percpu_down_read(struct percpu_rw_semaphore *);
> +extern void percpu_up_read(struct percpu_rw_semaphore *);
> 
> -static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> -{
> -	rcu_read_lock_sched();
> -	if (unlikely(p->locked)) {
> -		rcu_read_unlock_sched();
> -		mutex_lock(&p->mtx);
> -		this_cpu_inc(*p->counters);
> -		mutex_unlock(&p->mtx);
> -		return;
> -	}
> -	this_cpu_inc(*p->counters);
> -	rcu_read_unlock_sched();
> -	light_mb(); /* A, between read of p->locked and read of data, paired with D */
> -}
> +extern void percpu_down_write(struct percpu_rw_semaphore *);
> +extern void percpu_up_write(struct percpu_rw_semaphore *);
> 
> -static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> -{
> -	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
> -	this_cpu_dec(*p->counters);
> -}
> -
> -static inline unsigned __percpu_count(unsigned __percpu *counters)
> -{
> -	unsigned total = 0;
> -	int cpu;
> -
> -	for_each_possible_cpu(cpu)
> -		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
> -
> -	return total;
> -}
> -
> -static inline void percpu_down_write(struct percpu_rw_semaphore *p)
> -{
> -	mutex_lock(&p->mtx);
> -	p->locked = true;
> -	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
> -	while (__percpu_count(p->counters))
> -		msleep(1);
> -	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
> -}
> -
> -static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> -{
> -	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
> -	p->locked = false;
> -	mutex_unlock(&p->mtx);
> -}
> -
> -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
> -{
> -	p->counters = alloc_percpu(unsigned);
> -	if (unlikely(!p->counters))
> -		return -ENOMEM;
> -	p->locked = false;
> -	mutex_init(&p->mtx);
> -	return 0;
> -}
> -
> -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
> -{
> -	free_percpu(p->counters);
> -	p->counters = NULL; /* catch use after free bugs */
> -}
> +extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
> +extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
> 
>  #endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 821a162..4dad4a7 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>  	 idr.o int_sqrt.o extable.o \
>  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> -	 is_single_threaded.o plist.o decompress.o
> +	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
> 
>  lib-$(CONFIG_MMU) += ioremap.o
>  lib-$(CONFIG_SMP) += cpumask.o
> diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
> new file mode 100644
> index 0000000..40a415d
> --- /dev/null
> +++ b/lib/percpu-rwsem.c
> @@ -0,0 +1,106 @@
> +#include <linux/percpu-rwsem.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>

#define WSTATE_NEED_LOCK 1
#define WSTATE_NEED_MB	 2

> +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	brw->fast_read_ctr = alloc_percpu(int);
> +	if (unlikely(!brw->fast_read_ctr))
> +		return -ENOMEM;
> +
> +	mutex_init(&brw->writer_mutex);
> +	init_rwsem(&brw->rw_sem);
> +	atomic_set(&brw->slow_read_ctr, 0);
> +	init_waitqueue_head(&brw->write_waitq);
> +	return 0;
> +}
> +
> +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	free_percpu(brw->fast_read_ctr);
> +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> +}
> +
> +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val)
> +{
> +	bool success = false;

	int state;

> +
> +	preempt_disable();
> +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {

	state = ACCESS_ONCE(brw->wstate);
	if (likely(!state)) {

> +		__this_cpu_add(*brw->fast_read_ctr, val);
> +		success = true;

	} else if (state & WSTATE_NEED_MB) {
		__this_cpu_add(*brw->fast_read_ctr, val);
		smb_mb(); /* Order increment against critical section. */
		success = true;
	}

> +	preempt_enable();
> +
> +	return success;
> +}
> +
> +void percpu_down_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, +1)))
> +		return;
> +
> +	down_read(&brw->rw_sem);
> +	atomic_inc(&brw->slow_read_ctr);
> +	up_read(&brw->rw_sem);
> +}
> +
> +void percpu_up_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, -1)))
> +		return;
> +
> +	/* false-positive is possible but harmless */
> +	if (atomic_dec_and_test(&brw->slow_read_ctr))
> +		wake_up_all(&brw->write_waitq);
> +}
> +
> +static int clear_fast_read_ctr(struct percpu_rw_semaphore *brw)
> +{
> +	int cpu, sum = 0;
> +
> +	for_each_possible_cpu(cpu) {
> +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> +	}
> +
> +	return sum;
> +}
> +
> +void percpu_down_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> +	mutex_lock(&brw->writer_mutex);

	ACCESS_ONCE(brw->wstate) = WSTATE_NEED_LOCK;

> +	/*
> +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> +	 *    so that update_fast_ctr() can't succeed.
> +	 *
> +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> +	 *    update_fast_ctr().
> +	 *
> +	 * 3. Ensures that if any reader has exited its critical section via
> +	 *    fast-path, it executes a full memory barrier before we return.
> +	 */
> +	synchronize_sched();
> +
> +	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> +	atomic_add(clear_fast_read_ctr(brw), &brw->slow_read_ctr);
> +
> +	/* block the new readers completely */
> +	down_write(&brw->rw_sem);
> +
> +	/* wait for all readers to complete their percpu_up_read() */
> +	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> +}
> +
> +void percpu_up_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* allow the new readers, but only the slow-path */
> +	up_write(&brw->rw_sem);

	ACCESS_ONCE(brw->wstate) = WSTATE_NEED_MB;

> +
> +	/* insert the barrier before the next fast-path in down_read */
> +	synchronize_sched();

	ACCESS_ONCE(brw->wstate) = 0;

> +	mutex_unlock(&brw->writer_mutex);
> +}

OK, o


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-01 15:10                     ` Linus Torvalds
@ 2012-11-01 15:34                       ` Oleg Nesterov
  0 siblings, 0 replies; 8+ messages in thread
From: Oleg Nesterov @ 2012-11-01 15:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

Thanks!

I'll send v2 tomorrow.

On 11/01, Linus Torvalds wrote:
> On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > Currently the writer does msleep() plus synchronize_sched() 3 times
> > to acquire/release the semaphore, and during this time the readers
> > are blocked completely. Even if the "write" section was not actually
> > started or if it was already finished.
> >
> > With this patch down_read/up_read does synchronize_sched() twice and
> > down_read/up_read are still possible during this time, just they use
> > the slow path.
> 
> The changelog is wrong (it's the write path, not read path, that does
> the synchronize_sched).
> 
> >  struct percpu_rw_semaphore {
> > -       unsigned __percpu *counters;
> > -       bool locked;
> > -       struct mutex mtx;
> > +       int __percpu            *fast_read_ctr;
> 
> This change is wrong.
> 
> You must not make the 'fast_read_ctr' thing be an int. Or at least you
> need to be a hell of a lot more careful about it.
> 
> Why?
> 
> Because the readers update the counters while possibly moving around
> cpu's, the increment and decrement of the counters may be on different
> CPU's. But that means that when you add all the counters together,
> things can overflow (only the final sum is meaningful). And THAT in
> turn means that you should not use a signed count, for the simple
> reason that signed integers don't have well-behaved overflow behavior
> in C.
> 
> Now, I doubt you'll find an architecture or C compiler where this will
> actually ever make a difference, but the fact remains that you
> shouldn't use signed integers for counters like this. You should use
> unsigned, and you should rely on the well-defined modulo-2**n
> semantics.
> 
> I'd also like to see a comment somewhere in the source code about the
> whole algorithm and the rules.
> 
> Other than that, I guess it looks ok.
> 
>             Linus


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-10-31 19:41                   ` [PATCH 1/1] " Oleg Nesterov
@ 2012-11-01 15:10                     ` Linus Torvalds
  2012-11-01 15:34                       ` Oleg Nesterov
  2012-11-01 15:43                     ` Paul E. McKenney
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2012-11-01 15:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> Currently the writer does msleep() plus synchronize_sched() 3 times
> to acquire/release the semaphore, and during this time the readers
> are blocked completely. Even if the "write" section was not actually
> started or if it was already finished.
>
> With this patch down_read/up_read does synchronize_sched() twice and
> down_read/up_read are still possible during this time, just they use
> the slow path.

The changelog is wrong (it's the write path, not read path, that does
the synchronize_sched).

>  struct percpu_rw_semaphore {
> -       unsigned __percpu *counters;
> -       bool locked;
> -       struct mutex mtx;
> +       int __percpu            *fast_read_ctr;

This change is wrong.

You must not make the 'fast_read_ctr' thing be an int. Or at least you
need to be a hell of a lot more careful about it.

Why?

Because the readers update the counters while possibly moving around
cpu's, the increment and decrement of the counters may be on different
CPU's. But that means that when you add all the counters together,
things can overflow (only the final sum is meaningful). And THAT in
turn means that you should not use a signed count, for the simple
reason that signed integers don't have well-behaved overflow behavior
in C.

Now, I doubt you'll find an architecture or C compiler where this will
actually ever make a difference, but the fact remains that you
shouldn't use signed integers for counters like this. You should use
unsigned, and you should rely on the well-defined modulo-2**n
semantics.

I'd also like to see a comment somewhere in the source code about the
whole algorithm and the rules.

Other than that, I guess it looks ok.

            Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-10-31 19:41                 ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov
@ 2012-10-31 19:41                   ` Oleg Nesterov
  2012-11-01 15:10                     ` Linus Torvalds
  2012-11-01 15:43                     ` Paul E. McKenney
  0 siblings, 2 replies; 8+ messages in thread
From: Oleg Nesterov @ 2012-10-31 19:41 UTC (permalink / raw)
  To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Linus Torvalds
  Cc: Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

Currently the writer does msleep() plus synchronize_sched() 3 times
to acquire/release the semaphore, and during this time the readers
are blocked completely. Even if the "write" section was not actually
started or if it was already finished.

With this patch down_read/up_read does synchronize_sched() twice and
down_read/up_read are still possible during this time, just they use
the slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/percpu-rwsem.h |   83 +++++---------------------------
 lib/Makefile                 |    2 +-
 lib/percpu-rwsem.c           |  106 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 120 insertions(+), 71 deletions(-)
 create mode 100644 lib/percpu-rwsem.c

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 250a4ac..7f738ca 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -2,82 +2,25 @@
 #define _LINUX_PERCPU_RWSEM_H
 
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/percpu.h>
-#include <linux/rcupdate.h>
-#include <linux/delay.h>
+#include <linux/wait.h>
 
 struct percpu_rw_semaphore {
-	unsigned __percpu *counters;
-	bool locked;
-	struct mutex mtx;
+	int __percpu		*fast_read_ctr;
+	struct mutex		writer_mutex;
+	struct rw_semaphore	rw_sem;
+	atomic_t		slow_read_ctr;
+	wait_queue_head_t	write_waitq;
 };
 
-#define light_mb()	barrier()
-#define heavy_mb()	synchronize_sched()
+extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern void percpu_up_read(struct percpu_rw_semaphore *);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
-	rcu_read_lock_sched();
-	if (unlikely(p->locked)) {
-		rcu_read_unlock_sched();
-		mutex_lock(&p->mtx);
-		this_cpu_inc(*p->counters);
-		mutex_unlock(&p->mtx);
-		return;
-	}
-	this_cpu_inc(*p->counters);
-	rcu_read_unlock_sched();
-	light_mb(); /* A, between read of p->locked and read of data, paired with D */
-}
+extern void percpu_down_write(struct percpu_rw_semaphore *);
+extern void percpu_up_write(struct percpu_rw_semaphore *);
 
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
-	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
-	this_cpu_dec(*p->counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
-	unsigned total = 0;
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
-
-	return total;
-}
-
-static inline void percpu_down_write(struct percpu_rw_semaphore *p)
-{
-	mutex_lock(&p->mtx);
-	p->locked = true;
-	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
-	while (__percpu_count(p->counters))
-		msleep(1);
-	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
-}
-
-static inline void percpu_up_write(struct percpu_rw_semaphore *p)
-{
-	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
-	p->locked = false;
-	mutex_unlock(&p->mtx);
-}
-
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
-{
-	p->counters = alloc_percpu(unsigned);
-	if (unlikely(!p->counters))
-		return -ENOMEM;
-	p->locked = false;
-	mutex_init(&p->mtx);
-	return 0;
-}
-
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
-{
-	free_percpu(p->counters);
-	p->counters = NULL; /* catch use after free bugs */
-}
+extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
+extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 
 #endif
diff --git a/lib/Makefile b/lib/Makefile
index 821a162..4dad4a7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 idr.o int_sqrt.o extable.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o
+	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
new file mode 100644
index 0000000..40a415d
--- /dev/null
+++ b/lib/percpu-rwsem.c
@@ -0,0 +1,106 @@
+#include <linux/percpu-rwsem.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
+{
+	brw->fast_read_ctr = alloc_percpu(int);
+	if (unlikely(!brw->fast_read_ctr))
+		return -ENOMEM;
+
+	mutex_init(&brw->writer_mutex);
+	init_rwsem(&brw->rw_sem);
+	atomic_set(&brw->slow_read_ctr, 0);
+	init_waitqueue_head(&brw->write_waitq);
+	return 0;
+}
+
+void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
+{
+	free_percpu(brw->fast_read_ctr);
+	brw->fast_read_ctr = NULL; /* catch use after free bugs */
+}
+
+static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val)
+{
+	bool success = false;
+
+	preempt_disable();
+	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
+		__this_cpu_add(*brw->fast_read_ctr, val);
+		success = true;
+	}
+	preempt_enable();
+
+	return success;
+}
+
+void percpu_down_read(struct percpu_rw_semaphore *brw)
+{
+	if (likely(update_fast_ctr(brw, +1)))
+		return;
+
+	down_read(&brw->rw_sem);
+	atomic_inc(&brw->slow_read_ctr);
+	up_read(&brw->rw_sem);
+}
+
+void percpu_up_read(struct percpu_rw_semaphore *brw)
+{
+	if (likely(update_fast_ctr(brw, -1)))
+		return;
+
+	/* false-positive is possible but harmless */
+	if (atomic_dec_and_test(&brw->slow_read_ctr))
+		wake_up_all(&brw->write_waitq);
+}
+
+static int clear_fast_read_ctr(struct percpu_rw_semaphore *brw)
+{
+	int cpu, sum = 0;
+
+	for_each_possible_cpu(cpu) {
+		sum += per_cpu(*brw->fast_read_ctr, cpu);
+		per_cpu(*brw->fast_read_ctr, cpu) = 0;
+	}
+
+	return sum;
+}
+
+void percpu_down_write(struct percpu_rw_semaphore *brw)
+{
+	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
+	mutex_lock(&brw->writer_mutex);
+
+	/*
+	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
+	 *    so that update_fast_ctr() can't succeed.
+	 *
+	 * 2. Ensures we see the result of every previous this_cpu_add() in
+	 *    update_fast_ctr().
+	 *
+	 * 3. Ensures that if any reader has exited its critical section via
+	 *    fast-path, it executes a full memory barrier before we return.
+	 */
+	synchronize_sched();
+
+	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
+	atomic_add(clear_fast_read_ctr(brw), &brw->slow_read_ctr);
+
+	/* block the new readers completely */
+	down_write(&brw->rw_sem);
+
+	/* wait for all readers to complete their percpu_up_read() */
+	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
+}
+
+void percpu_up_write(struct percpu_rw_semaphore *brw)
+{
+	/* allow the new readers, but only the slow-path */
+	up_write(&brw->rw_sem);
+
+	/* insert the barrier before the next fast-path in down_read */
+	synchronize_sched();
+
+	mutex_unlock(&brw->writer_mutex);
+}
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-11-04 15:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-04  8:47 [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily George Spelvin
2012-11-04 15:52 ` Oleg Nesterov
  -- strict thread matches above, loose matches on Subject: below --
2012-10-16 19:56 [PATCH 1/2] brw_mutex: big read-write mutex Linus Torvalds
2012-10-17 16:59 ` Oleg Nesterov
2012-10-17 22:44   ` Paul E. McKenney
2012-10-18 16:24     ` Oleg Nesterov
2012-10-18 16:38       ` Paul E. McKenney
2012-10-18 17:57         ` Oleg Nesterov
2012-10-19 19:28           ` Paul E. McKenney
2012-10-22 23:36             ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka
2012-10-30 18:48               ` Oleg Nesterov
2012-10-31 19:41                 ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov
2012-10-31 19:41                   ` [PATCH 1/1] " Oleg Nesterov
2012-11-01 15:10                     ` Linus Torvalds
2012-11-01 15:34                       ` Oleg Nesterov
2012-11-01 15:43                     ` Paul E. McKenney
2012-11-01 18:33                       ` Oleg Nesterov
2012-11-02 16:18                         ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.