Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
@ 2012-11-04  8:47 George Spelvin
  2012-11-04 15:52 ` Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: George Spelvin @ 2012-11-04  8:47 UTC (permalink / raw)
  To: oleg, torvalds; +Cc: linux, linux-kernel

Grand poo-bah Linus wrote:
> Now, I doubt you'll find an architecture or C compiler where this will
> actually ever make a difference, but the fact remains that you
> shouldn't use signed integers for counters like this. You should use
> unsigned, and you should rely on the well-defined modulo-2**n
> semantics.

Actually, this is another C standard undefined case that recent versions of
GCC exploit for optimization.

When using signed integers, GCC's optimizer assumes that, if b > 0,
then a + b > a.

For example, the loop:
	for (i = 1; i; i++)
		/* Code */
will never terminate!  Feed the following to gcc -O2 and see for yourself:

extern void foo(int x);

void bar(void)
{
	int i;
	for (i = 0; i >= 0; i++)
		foo(i);
}

here's what I get:

        .file   "test.c"
        .text
        .p2align 4,,15
        .globl  bar
        .type   bar, @function
bar:
.LFB0:
        .cfi_startproc
        pushl   %ebx
        .cfi_def_cfa_offset 8
        .cfi_offset 3, -8
        movl    $1, %ebx
        subl    $24, %esp
        .cfi_def_cfa_offset 32
        .p2align 4,,7
        .p2align 3
.L2:
        movl    %ebx, (%esp)
        addl    $1, %ebx
        call    foo
        jmp     .L2
        .cfi_endproc
.LFE0:
        .size   bar, .-bar
        .ident  "GCC: (Debian 4.7.2-4) 4.7.2"
        .section        .note.GNU-stack,"",@progbits

Notice the lack of test in the "jmp .L2" loop.

It can even handle more complicated cases like:

void bar(int j)
{
        int i = 0;
        do {
                foo(i);
                if (j >= 0)
                        i += j;
                else
                        i -= j;
        } while (i >= 0);
}

... which gcc -O3 neatly splits into two infinite loops, as if I had written:

void bar(int j)
{
        int i = 0;
	if (j >= 0)
		for (; ; i += j)
			foo(i);
	else
		for (; ; i -= j)
			foo(i);
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-04  8:47 [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily George Spelvin
@ 2012-11-04 15:52 ` Oleg Nesterov
  0 siblings, 0 replies; 8+ messages in thread
From: Oleg Nesterov @ 2012-11-04 15:52 UTC (permalink / raw)
  To: George Spelvin; +Cc: torvalds, linux-kernel

On 11/04, George Spelvin wrote:
>
> Grand poo-bah Linus wrote:
> > Now, I doubt you'll find an architecture or C compiler where this will
> > actually ever make a difference, but the fact remains that you
> > shouldn't use signed integers for counters like this. You should use
> > unsigned, and you should rely on the well-defined modulo-2**n
> > semantics.
>
> Actually, this is another C standard undefined case that recent versions of
> GCC exploit for optimization.
                  ^^^^^^^^^^^^

This is another thing,

> For example, the loop:
> 	for (i = 1; i; i++)
> 		/* Code */
> will never terminate!  Feed the following to gcc -O2 and see for yourself:

Yes, because ...

> Notice the lack of test in the "jmp .L2" loop.

Exactly.

But if we have like

	int A, B;

	int sum(void)
	{
		return A + B;
	}

then I doubt there is any architecture (at least supported by linux)
which can generate the different code if you do s/int/unsigned/.


Anyway I agree, unsigned makes more sense, and I changed this patch
accordingly..

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
@ 2012-10-16 19:56 Linus Torvalds
  2012-10-17 16:59 ` Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2012-10-16 19:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Paul E. McKenney, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> it allows multiple writers too, just "read" and "write" are mutually
> exclusive.

So those semantics just don't sound sane. It's also not what any kind
of normal "rw" lock ever does.

So can you explain why these particular insane semantics are useful,
and what for?

                  Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-16 19:56 [PATCH 1/2] brw_mutex: big read-write mutex Linus Torvalds
@ 2012-10-17 16:59 ` Oleg Nesterov
  2012-10-17 22:44   ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2012-10-17 16:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Paul E. McKenney, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 10/16, Linus Torvalds wrote:
>
> On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> > it allows multiple writers too, just "read" and "write" are mutually
> > exclusive.
>
> So those semantics just don't sound sane. It's also not what any kind
> of normal "rw" lock ever does.

Yes, this is not usual.

And initially I made brw_sem which allows only 1 writer, but then
I changed this patch.

> So can you explain why these particular insane semantics are useful,
> and what for?

To allow multiple uprobe_register/unregister at the same time. Mostly
to not add the "regression", currently this is possible.

It is not that I think this is terribly important, but still. And
personally I think that "multiple writers" is not necessarily insane
in general. Suppose you have the complex object/subsystem, the readers
can use a single brw_mutex to access it "lockless", start_read() is
very cheap.

But start_write() is slow. Multiple writes can use the fine-grained
inside the start_write/end_write section and do not block each other.

Oleg.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-17 16:59 ` Oleg Nesterov
@ 2012-10-17 22:44   ` Paul E. McKenney
  2012-10-18 16:24     ` Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2012-10-17 22:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On Wed, Oct 17, 2012 at 06:59:02PM +0200, Oleg Nesterov wrote:
> On 10/16, Linus Torvalds wrote:
> >
> > On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> > > it allows multiple writers too, just "read" and "write" are mutually
> > > exclusive.
> >
> > So those semantics just don't sound sane. It's also not what any kind
> > of normal "rw" lock ever does.
> 
> Yes, this is not usual.
> 
> And initially I made brw_sem which allows only 1 writer, but then
> I changed this patch.
> 
> > So can you explain why these particular insane semantics are useful,
> > and what for?
> 
> To allow multiple uprobe_register/unregister at the same time. Mostly
> to not add the "regression", currently this is possible.
> 
> It is not that I think this is terribly important, but still. And
> personally I think that "multiple writers" is not necessarily insane
> in general. Suppose you have the complex object/subsystem, the readers
> can use a single brw_mutex to access it "lockless", start_read() is
> very cheap.
> 
> But start_write() is slow. Multiple writes can use the fine-grained
> inside the start_write/end_write section and do not block each other.

Strangely enough, the old VAXCluster locking primitives allowed this
sort of thing.  The brw_start_read() would be a "protected read", and
brw_start_write() would be a "concurrent write".  Even more interesting,
they gave the same advice you give -- concurrent writes should use
fine-grained locking to protect the actual accesses.

It seems like it should be possible to come up with better names,
but I cannot think of any at the moment.

							Thanx, Paul

PS.  For the sufficiently masochistic, here is the exclusion table
     for the six VAXCluster locking modes:

		NL	CR	CW	PR	PW	EX
	NL
	CR						X
	CW				X	X	X
	PR			X		X	X
	PW			X	X	X	X
	EX		X	X	X	X	X

	"X" means that the pair of modes exclude each other,
	otherwise the lock may be held in both of the modes
	simultaneously.


Modes:

NL:	Null, or "not held".
CR:	Concurrent read.
CW:	Concurrent write.
PR:	Protected read.
PW:	Protected write.
EX:	Exclusive.


A reader-writer lock could use protected read for readers and either of
protected write or exclusive for writers, the difference between protected
write and exclusive being irrelevant in the absence of concurrent readers.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-17 22:44   ` Paul E. McKenney
@ 2012-10-18 16:24     ` Oleg Nesterov
  2012-10-18 16:38       ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2012-10-18 16:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 10/17, Paul E. McKenney wrote:
>
> On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote:
> > On 10/16, Paul E. McKenney wrote:
> > >
> > > Suppose that the writer arrives and sees that the value of the counter
> > > is zero,
> >
> > after synchronize_sched(). So there are no readers (but perhaps there
> > are brw_end_read's in flight which already decremented read_ctr)
>
> But the preempt_disable() region only covers read acquisition.  So
> synchronize_sched() waits only for all the brw_start_read() calls to
> reach the preempt_enable()

Yes.

> -- it cannot wait for all the resulting
> readers to reach the corresponding brw_end_read().

Indeed.

> > > and thus never sleeps, and so is also not awakened?
> >
> > and why do we need wakeup in this case?
>
> To get the memory barriers required to keep the critical sections
> ordered -- to ensure that everyone sees the reader's critical section
> as ending before the writer's critical section starts.

And now I am starting to think I misunderstood your concern from
the very beginning.

I thought that you meant that without mb() brw_start_write() can
race with brw_end_read() and hang forever.

But probably you meant that we need the barriers to ensure that,
say, if the reader does

	brw_start_read();
	CONDITION = 1;
	brw_end_read();

then the writer must see CONDITION != 0 after brw_start_write() ?
(or vice-versa)


In this case we need the barrier, yes. Obviously brw_start_write()
can return right after this_cpu_dec() and before wake_up_all().

2/2 doesn't need this guarantee but I agree, this doesn't look
sane in gerenal...

Or I misunderstood you again?

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-18 16:24     ` Oleg Nesterov
@ 2012-10-18 16:38       ` Paul E. McKenney
  2012-10-18 17:57         ` Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2012-10-18 16:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
> On 10/17, Paul E. McKenney wrote:
> >
> > On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote:
> > > On 10/16, Paul E. McKenney wrote:
> > > >
> > > > Suppose that the writer arrives and sees that the value of the counter
> > > > is zero,
> > >
> > > after synchronize_sched(). So there are no readers (but perhaps there
> > > are brw_end_read's in flight which already decremented read_ctr)
> >
> > But the preempt_disable() region only covers read acquisition.  So
> > synchronize_sched() waits only for all the brw_start_read() calls to
> > reach the preempt_enable()
> 
> Yes.
> 
> > -- it cannot wait for all the resulting
> > readers to reach the corresponding brw_end_read().
> 
> Indeed.
> 
> > > > and thus never sleeps, and so is also not awakened?
> > >
> > > and why do we need wakeup in this case?
> >
> > To get the memory barriers required to keep the critical sections
> > ordered -- to ensure that everyone sees the reader's critical section
> > as ending before the writer's critical section starts.
> 
> And now I am starting to think I misunderstood your concern from
> the very beginning.
> 
> I thought that you meant that without mb() brw_start_write() can
> race with brw_end_read() and hang forever.
> 
> But probably you meant that we need the barriers to ensure that,
> say, if the reader does
> 
> 	brw_start_read();
> 	CONDITION = 1;
> 	brw_end_read();
> 
> then the writer must see CONDITION != 0 after brw_start_write() ?
> (or vice-versa)

Yes, this is exactly my concern.

> In this case we need the barrier, yes. Obviously brw_start_write()
> can return right after this_cpu_dec() and before wake_up_all().
> 
> 2/2 doesn't need this guarantee but I agree, this doesn't look
> sane in gerenal...

Or name it something not containing "lock".  And clearly document
the behavior and how it is to be used.  ;-)

Otherwise, someone will get confused and introduce bugs.

> Or I misunderstood you again?

No, this was indeed my concern.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-18 16:38       ` Paul E. McKenney
@ 2012-10-18 17:57         ` Oleg Nesterov
  2012-10-19 19:28           ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2012-10-18 17:57 UTC (permalink / raw)
  To: Paul E. McKenney, Mikulas Patocka
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 10/18, Paul E. McKenney wrote:
>
> On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
> >
> > I thought that you meant that without mb() brw_start_write() can
> > race with brw_end_read() and hang forever.
> >
> > But probably you meant that we need the barriers to ensure that,
> > say, if the reader does
> >
> > 	brw_start_read();
> > 	CONDITION = 1;
> > 	brw_end_read();
> >
> > then the writer must see CONDITION != 0 after brw_start_write() ?
> > (or vice-versa)
>
> Yes, this is exactly my concern.

Oh, thanks at lot Paul (as always).

> > In this case we need the barrier, yes. Obviously brw_start_write()
> > can return right after this_cpu_dec() and before wake_up_all().
> >
> > 2/2 doesn't need this guarantee but I agree, this doesn't look
> > sane in gerenal...
>
> Or name it something not containing "lock".  And clearly document
> the behavior and how it is to be used.  ;-)

this would be insane, I guess ;)

So. Ignoring the possible optimization you mentioned before,
brw_end_read() should do:

	smp_mb();
	this_cpu_dec();

	wake_up_all();

And yes, we need the full mb(). wmb() is enough to ensure that the
writer will see the memory modifications done by the reader. But we
also need to ensure that any LOAD inside start_read/end_read can not
be moved outside of the critical section.

But we should also ensure that "read" will see all modifications
which were done under start_write/end_write. This means that
brw_end_write() needs another synchronize_sched() before
atomic_dec_and_test(), or brw_start_read() needs mb() in the
fast-path.

Correct?

Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
something similar. Certainly it was not in my tree when I started
this patch... percpu_down_write() doesn't allow multiple writers,
but the main problem it uses msleep(1). It should not, I think.

But. It seems that percpu_up_write() is equally wrong? Doesn't
it need synchronize_rcu() before "p->locked = false" ?

(add Mikulas)

Oleg.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] brw_mutex: big read-write mutex
  2012-10-18 17:57         ` Oleg Nesterov
@ 2012-10-19 19:28           ` Paul E. McKenney
  2012-10-22 23:36             ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2012-10-19 19:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Thu, Oct 18, 2012 at 07:57:47PM +0200, Oleg Nesterov wrote:
> On 10/18, Paul E. McKenney wrote:
> >
> > On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
> > >
> > > I thought that you meant that without mb() brw_start_write() can
> > > race with brw_end_read() and hang forever.
> > >
> > > But probably you meant that we need the barriers to ensure that,
> > > say, if the reader does
> > >
> > > 	brw_start_read();
> > > 	CONDITION = 1;
> > > 	brw_end_read();
> > >
> > > then the writer must see CONDITION != 0 after brw_start_write() ?
> > > (or vice-versa)
> >
> > Yes, this is exactly my concern.
> 
> Oh, thanks at lot Paul (as always).

Glad it helped.  ;-)

> > > In this case we need the barrier, yes. Obviously brw_start_write()
> > > can return right after this_cpu_dec() and before wake_up_all().
> > >
> > > 2/2 doesn't need this guarantee but I agree, this doesn't look
> > > sane in gerenal...
> >
> > Or name it something not containing "lock".  And clearly document
> > the behavior and how it is to be used.  ;-)
> 
> this would be insane, I guess ;)

Well, I suppose you could call it a "phase" : brw_start_phase_1() and
so on.

> So. Ignoring the possible optimization you mentioned before,
> brw_end_read() should do:
> 
> 	smp_mb();
> 	this_cpu_dec();
> 
> 	wake_up_all();
> 
> And yes, we need the full mb(). wmb() is enough to ensure that the
> writer will see the memory modifications done by the reader. But we
> also need to ensure that any LOAD inside start_read/end_read can not
> be moved outside of the critical section.
> 
> But we should also ensure that "read" will see all modifications
> which were done under start_write/end_write. This means that
> brw_end_write() needs another synchronize_sched() before
> atomic_dec_and_test(), or brw_start_read() needs mb() in the
> fast-path.
> 
> Correct?

Good point, I missed the need for synchronize_sched() to avoid
readers sleeping through the next write cycle due to racing with
an exiting writer.  But yes, this sounds correct.

> Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> something similar. Certainly it was not in my tree when I started
> this patch... percpu_down_write() doesn't allow multiple writers,
> but the main problem it uses msleep(1). It should not, I think.
> 
> But. It seems that percpu_up_write() is equally wrong? Doesn't
> it need synchronize_rcu() before "p->locked = false" ?
> 
> (add Mikulas)

Mikulas said something about doing an updated patch, so I figured I
would look at his next version.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex)
  2012-10-19 19:28           ` Paul E. McKenney
@ 2012-10-22 23:36             ` Mikulas Patocka
  2012-10-30 18:48               ` Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: Mikulas Patocka @ 2012-10-22 23:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel,
	Paul E. McKenney

> > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> > something similar. Certainly it was not in my tree when I started
> > this patch... percpu_down_write() doesn't allow multiple writers,
> > but the main problem it uses msleep(1). It should not, I think.
> > 
> > But. It seems that percpu_up_write() is equally wrong? Doesn't
> > it need synchronize_rcu() before "p->locked = false" ?
> > 
> > (add Mikulas)
> 
> Mikulas said something about doing an updated patch, so I figured I
> would look at his next version.
> 
> 							Thanx, Paul

The best ideas proposed in this thread are:

Using heavy/light barries by Lai Jiangshan. This fixes the missing barrier 
bug, removes the ugly test "#if defined(X86) ..." and makes the read path 
use no barrier instruction on all architectures.

Instead of rcu_read_lock, we can use rcu_read_lock_sched (or 
preempt_disable) - the resulting code is smaller. The critical section is 
so small that there is no problem disabling preemption.

I am sending these two patches. Linus, please apply them if there are no 
objections.

Mikulas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex)
  2012-10-22 23:36             ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka
@ 2012-10-30 18:48               ` Oleg Nesterov
  2012-10-31 19:41                 ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2012-10-30 18:48 UTC (permalink / raw)
  To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Srikar Dronamraju,
	Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel

On 10/22, Mikulas Patocka wrote:
>
> > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> > > something similar. Certainly it was not in my tree when I started
> > > this patch... percpu_down_write() doesn't allow multiple writers,
> > > but the main problem it uses msleep(1). It should not, I think.

But, since we already have percpu_rw_semaphore, I do not think I can
add another similar thing,

However percpu_rw_semaphore is sub-optimal, not sure uprobes can use
it to block dup_mmap(). Perhaps we can improve it?

> > > But. It seems that percpu_up_write() is equally wrong? Doesn't
> > > it need synchronize_rcu() before "p->locked = false" ?
> > >
> > > (add Mikulas)
> >
> > Mikulas said something about doing an updated patch, so I figured I
> > would look at his next version.
> >
> > 							Thanx, Paul
>
> The best ideas proposed in this thread are:
>
>
> Using heavy/light barries by Lai Jiangshan.

So. down_write/up_right does msleep() and it needs to call
synchronize_sched() 3 times.

This looks too much. It is not that I am worried about the writers,
the problem is that the new readers are blocked completely while the
writer sleeps in msleep/synchronize_sched.

Paul, Mikulas, et al. Could you please look at the new implementation
below? Completely untested/uncompiled, just for discussion.

Compared to the current implementation, down_read() is still possible
while the writer sleeps in synchronize_sched(), but the reader uses
rw_semaphore/atomic_inc when it detects the waiting writer.

Can this work? Do you think this is better than we have now?


Note: probably we can optimize percpu_down/up_write more, we can
"factor out" synchronize_sched(), multiple writers can do this in
parallel before they take ->writer_mutex to exclude each other.
But this won't affect the readers, and this can be done later.

Oleg.

------------------------------------------------------------------------------
struct percpu_rw_semaphore {
	long __percpu		*fast_read_ctr;
	struct mutex            writer_mutex;
	struct rw_semaphore 	rw_sem;
	atomit_t		slow_read_ctr;
};

static bool update_fast_ctr(struct percpu_rw_semaphore *brw, long val)
{
	bool success = false;

	preempt_disable();
	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
		__this_cpu_add(*brw->fast_read_ctr, val);
		success = true;
	}
	preempt_enable();

	return success;
}

static long clear_fast_read_ctr(struct percpu_rw_semaphore *brw)
{
	long sum = 0;
	int cpu;

	for_each_possible_cpu(cpu) {
		sum += per_cpu(*brw->fast_read_ctr, cpu);
		per_cpu(*brw->fast_read_ctr, cpu) = 0;
	}

	return sum;
}

void percpu_down_read(struct percpu_rw_semaphore *brw)
{
	if (likely(update_fast_ctr(+1)))
		return;

	down_read(&brw->rw_sem);
	atomic_inc(&brw->slow_read_ctr);
	up_read(&brw->rw_sem);
}

void percpu_up_read(struct percpu_rw_semaphore *brw)
{
	if (likely(update_fast_ctr(-1)))
		return;

	if (atomic_dec_and_test(&brw->slow_read_ctr))
		wake_up_all(&brw->write_waitq);
}

void percpu_down_write(struct percpu_rw_semaphore *brw)
{
	mutex_lock(&brw->writer_mutex);
	/* ensure mutex_is_locked() is visible to the readers */
	synchronize_sched();

	/* block the new readers */
	down_write(&brw->rw_sem);

	atomic_add(&brw->slow_read_ctr, clear_fast_read_ctr(brw));

	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
}

void percpu_up_write(struct percpu_rw_semaphore *brw)
{
	up_write(&brw->rw_sem);
	/* insert the barrier before the next fast-path in down_read */
	synchronize_sched();

	mutex_unlock(&brw->writer_mutex);
}


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-10-30 18:48               ` Oleg Nesterov
@ 2012-10-31 19:41                 ` Oleg Nesterov
  2012-10-31 19:41                   ` [PATCH 1/1] " Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2012-10-31 19:41 UTC (permalink / raw)
  To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Linus Torvalds
  Cc: Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

On 10/30, Oleg Nesterov wrote:
>
> So. down_write/up_right does msleep() and it needs to call
> synchronize_sched() 3 times.
>
> This looks too much. It is not that I am worried about the writers,
> the problem is that the new readers are blocked completely while the
> writer sleeps in msleep/synchronize_sched.
>
> Paul, Mikulas, et al. Could you please look at the new implementation
> below? Completely untested/uncompiled, just for discussion.

I tried to test it, seems to work...

But. I guess the only valid test is: pass the review from Paul/Peter.

Todo:
	- add the lockdep annotations

	- we can speedup the down_write-right-aftet-up_write case

What do you all think?

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-10-31 19:41                 ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov
@ 2012-10-31 19:41                   ` Oleg Nesterov
  2012-11-01 15:10                     ` Linus Torvalds
  2012-11-01 15:43                     ` Paul E. McKenney
  0 siblings, 2 replies; 8+ messages in thread
From: Oleg Nesterov @ 2012-10-31 19:41 UTC (permalink / raw)
  To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Linus Torvalds
  Cc: Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli,
	Anton Arapov, linux-kernel

Currently the writer does msleep() plus synchronize_sched() 3 times
to acquire/release the semaphore, and during this time the readers
are blocked completely. Even if the "write" section was not actually
started or if it was already finished.

With this patch down_read/up_read does synchronize_sched() twice and
down_read/up_read are still possible during this time, just they use
the slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/percpu-rwsem.h |   83 +++++---------------------------
 lib/Makefile                 |    2 +-
 lib/percpu-rwsem.c           |  106 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 120 insertions(+), 71 deletions(-)
 create mode 100644 lib/percpu-rwsem.c

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 250a4ac..7f738ca 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -2,82 +2,25 @@
 #define _LINUX_PERCPU_RWSEM_H
 
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/percpu.h>
-#include <linux/rcupdate.h>
-#include <linux/delay.h>
+#include <linux/wait.h>
 
 struct percpu_rw_semaphore {
-	unsigned __percpu *counters;
-	bool locked;
-	struct mutex mtx;
+	int __percpu		*fast_read_ctr;
+	struct mutex		writer_mutex;
+	struct rw_semaphore	rw_sem;
+	atomic_t		slow_read_ctr;
+	wait_queue_head_t	write_waitq;
 };
 
-#define light_mb()	barrier()
-#define heavy_mb()	synchronize_sched()
+extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern void percpu_up_read(struct percpu_rw_semaphore *);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
-	rcu_read_lock_sched();
-	if (unlikely(p->locked)) {
-		rcu_read_unlock_sched();
-		mutex_lock(&p->mtx);
-		this_cpu_inc(*p->counters);
-		mutex_unlock(&p->mtx);
-		return;
-	}
-	this_cpu_inc(*p->counters);
-	rcu_read_unlock_sched();
-	light_mb(); /* A, between read of p->locked and read of data, paired with D */
-}
+extern void percpu_down_write(struct percpu_rw_semaphore *);
+extern void percpu_up_write(struct percpu_rw_semaphore *);
 
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
-	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
-	this_cpu_dec(*p->counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
-	unsigned total = 0;
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
-
-	return total;
-}
-
-static inline void percpu_down_write(struct percpu_rw_semaphore *p)
-{
-	mutex_lock(&p->mtx);
-	p->locked = true;
-	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
-	while (__percpu_count(p->counters))
-		msleep(1);
-	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
-}
-
-static inline void percpu_up_write(struct percpu_rw_semaphore *p)
-{
-	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
-	p->locked = false;
-	mutex_unlock(&p->mtx);
-}
-
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
-{
-	p->counters = alloc_percpu(unsigned);
-	if (unlikely(!p->counters))
-		return -ENOMEM;
-	p->locked = false;
-	mutex_init(&p->mtx);
-	return 0;
-}
-
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
-{
-	free_percpu(p->counters);
-	p->counters = NULL; /* catch use after free bugs */
-}
+extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
+extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 
 #endif
diff --git a/lib/Makefile b/lib/Makefile
index 821a162..4dad4a7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 idr.o int_sqrt.o extable.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o
+	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
new file mode 100644
index 0000000..40a415d
--- /dev/null
+++ b/lib/percpu-rwsem.c
@@ -0,0 +1,106 @@
+#include <linux/percpu-rwsem.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
+{
+	brw->fast_read_ctr = alloc_percpu(int);
+	if (unlikely(!brw->fast_read_ctr))
+		return -ENOMEM;
+
+	mutex_init(&brw->writer_mutex);
+	init_rwsem(&brw->rw_sem);
+	atomic_set(&brw->slow_read_ctr, 0);
+	init_waitqueue_head(&brw->write_waitq);
+	return 0;
+}
+
+void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
+{
+	free_percpu(brw->fast_read_ctr);
+	brw->fast_read_ctr = NULL; /* catch use after free bugs */
+}
+
+static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val)
+{
+	bool success = false;
+
+	preempt_disable();
+	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
+		__this_cpu_add(*brw->fast_read_ctr, val);
+		success = true;
+	}
+	preempt_enable();
+
+	return success;
+}
+
+void percpu_down_read(struct percpu_rw_semaphore *brw)
+{
+	if (likely(update_fast_ctr(brw, +1)))
+		return;
+
+	down_read(&brw->rw_sem);
+	atomic_inc(&brw->slow_read_ctr);
+	up_read(&brw->rw_sem);
+}
+
+void percpu_up_read(struct percpu_rw_semaphore *brw)
+{
+	if (likely(update_fast_ctr(brw, -1)))
+		return;
+
+	/* false-positive is possible but harmless */
+	if (atomic_dec_and_test(&brw->slow_read_ctr))
+		wake_up_all(&brw->write_waitq);
+}
+
+static int clear_fast_read_ctr(struct percpu_rw_semaphore *brw)
+{
+	int cpu, sum = 0;
+
+	for_each_possible_cpu(cpu) {
+		sum += per_cpu(*brw->fast_read_ctr, cpu);
+		per_cpu(*brw->fast_read_ctr, cpu) = 0;
+	}
+
+	return sum;
+}
+
+void percpu_down_write(struct percpu_rw_semaphore *brw)
+{
+	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
+	mutex_lock(&brw->writer_mutex);
+
+	/*
+	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
+	 *    so that update_fast_ctr() can't succeed.
+	 *
+	 * 2. Ensures we see the result of every previous this_cpu_add() in
+	 *    update_fast_ctr().
+	 *
+	 * 3. Ensures that if any reader has exited its critical section via
+	 *    fast-path, it executes a full memory barrier before we return.
+	 */
+	synchronize_sched();
+
+	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
+	atomic_add(clear_fast_read_ctr(brw), &brw->slow_read_ctr);
+
+	/* block the new readers completely */
+	down_write(&brw->rw_sem);
+
+	/* wait for all readers to complete their percpu_up_read() */
+	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
+}
+
+void percpu_up_write(struct percpu_rw_semaphore *brw)
+{
+	/* allow the new readers, but only the slow-path */
+	up_write(&brw->rw_sem);
+
+	/* insert the barrier before the next fast-path in down_read */
+	synchronize_sched();
+
+	mutex_unlock(&brw->writer_mutex);
+}
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-10-31 19:41                   ` [PATCH 1/1] " Oleg Nesterov
@ 2012-11-01 15:10                     ` Linus Torvalds
  2012-11-01 15:34                       ` Oleg Nesterov
  2012-11-01 15:43                     ` Paul E. McKenney
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2012-11-01 15:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> Currently the writer does msleep() plus synchronize_sched() 3 times
> to acquire/release the semaphore, and during this time the readers
> are blocked completely. Even if the "write" section was not actually
> started or if it was already finished.
>
> With this patch down_read/up_read does synchronize_sched() twice and
> down_read/up_read are still possible during this time, just they use
> the slow path.

The changelog is wrong (it's the write path, not read path, that does
the synchronize_sched).

>  struct percpu_rw_semaphore {
> -       unsigned __percpu *counters;
> -       bool locked;
> -       struct mutex mtx;
> +       int __percpu            *fast_read_ctr;

This change is wrong.

You must not make the 'fast_read_ctr' thing be an int. Or at least you
need to be a hell of a lot more careful about it.

Why?

Because the readers update the counters while possibly moving around
cpu's, the increment and decrement of the counters may be on different
CPU's. But that means that when you add all the counters together,
things can overflow (only the final sum is meaningful). And THAT in
turn means that you should not use a signed count, for the simple
reason that signed integers don't have well-behaved overflow behavior
in C.

Now, I doubt you'll find an architecture or C compiler where this will
actually ever make a difference, but the fact remains that you
shouldn't use signed integers for counters like this. You should use
unsigned, and you should rely on the well-defined modulo-2**n
semantics.

I'd also like to see a comment somewhere in the source code about the
whole algorithm and the rules.

Other than that, I guess it looks ok.

            Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-01 15:10                     ` Linus Torvalds
@ 2012-11-01 15:34                       ` Oleg Nesterov
  0 siblings, 0 replies; 8+ messages in thread
From: Oleg Nesterov @ 2012-11-01 15:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

Thanks!

I'll send v2 tomorrow.

On 11/01, Linus Torvalds wrote:
> On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > Currently the writer does msleep() plus synchronize_sched() 3 times
> > to acquire/release the semaphore, and during this time the readers
> > are blocked completely. Even if the "write" section was not actually
> > started or if it was already finished.
> >
> > With this patch down_read/up_read does synchronize_sched() twice and
> > down_read/up_read are still possible during this time, just they use
> > the slow path.
> 
> The changelog is wrong (it's the write path, not read path, that does
> the synchronize_sched).
> 
> >  struct percpu_rw_semaphore {
> > -       unsigned __percpu *counters;
> > -       bool locked;
> > -       struct mutex mtx;
> > +       int __percpu            *fast_read_ctr;
> 
> This change is wrong.
> 
> You must not make the 'fast_read_ctr' thing be an int. Or at least you
> need to be a hell of a lot more careful about it.
> 
> Why?
> 
> Because the readers update the counters while possibly moving around
> cpu's, the increment and decrement of the counters may be on different
> CPU's. But that means that when you add all the counters together,
> things can overflow (only the final sum is meaningful). And THAT in
> turn means that you should not use a signed count, for the simple
> reason that signed integers don't have well-behaved overflow behavior
> in C.
> 
> Now, I doubt you'll find an architecture or C compiler where this will
> actually ever make a difference, but the fact remains that you
> shouldn't use signed integers for counters like this. You should use
> unsigned, and you should rely on the well-defined modulo-2**n
> semantics.
> 
> I'd also like to see a comment somewhere in the source code about the
> whole algorithm and the rules.
> 
> Other than that, I guess it looks ok.
> 
>             Linus


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-10-31 19:41                   ` [PATCH 1/1] " Oleg Nesterov
  2012-11-01 15:10                     ` Linus Torvalds
@ 2012-11-01 15:43                     ` Paul E. McKenney
  2012-11-01 18:33                       ` Oleg Nesterov
  1 sibling, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2012-11-01 15:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On Wed, Oct 31, 2012 at 08:41:58PM +0100, Oleg Nesterov wrote:
> Currently the writer does msleep() plus synchronize_sched() 3 times
> to acquire/release the semaphore, and during this time the readers
> are blocked completely. Even if the "write" section was not actually
> started or if it was already finished.
> 
> With this patch down_read/up_read does synchronize_sched() twice and
> down_read/up_read are still possible during this time, just they use
> the slow path.
> 
> percpu_down_write() first forces the readers to use rw_semaphore and
> increment the "slow" counter to take the lock for reading, then it
> takes that rw_semaphore for writing and blocks the readers.
> 
> Also. With this patch the code relies on the documented behaviour of
> synchronize_sched(), it doesn't try to pair synchronize_sched() with
> barrier.

OK, so it looks to me that this code relies on synchronize_sched()
forcing a memory barrier on each CPU executing in the kernel.  I
might well be confused, so here is the sequence of events that leads
me to believe this:

1.	A task running on CPU 0 currently write-holds the lock.

2.	CPU 1 is running in the kernel, executing a longer-than-average
	loop of normal instructions (no atomic instructions or memory
	barriers).

3.	CPU 0 invokes percpu_up_write(), calling up_write(),
	synchronize_sched(), and finally mutex_unlock().

4.	CPU 1 executes percpu_down_read(), which calls update_fast_ctr(),
	which finds that ->writer_mutex is not held.  CPU 1 therefore
	increments >fast_read_ctr and returns success.

Of course, as Mikulas pointed out, the actual implementation will
have forced CPU 1 to execute a memory barrier in the course of the
synchronize_sched() implementation.  However, if synchronize_sched() had
been modified to act as synchronize_srcu() currently does, there would
be no memory barrier, and thus no guarantee that CPU 1's subsequent
read-side critical section would seen the effect of CPU 0's previous
write-side critical section.

Fortunately, this is easy to fix, with zero added overhead on the
read-side fastpath, as shown by the notes interspersed below.

Thoughts?

						Thanx, Paul

> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/percpu-rwsem.h |   83 +++++---------------------------
>  lib/Makefile                 |    2 +-
>  lib/percpu-rwsem.c           |  106 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 120 insertions(+), 71 deletions(-)
>  create mode 100644 lib/percpu-rwsem.c
> 
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 250a4ac..7f738ca 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -2,82 +2,25 @@
>  #define _LINUX_PERCPU_RWSEM_H
> 
>  #include <linux/mutex.h>
> +#include <linux/rwsem.h>
>  #include <linux/percpu.h>
> -#include <linux/rcupdate.h>
> -#include <linux/delay.h>
> +#include <linux/wait.h>
> 
>  struct percpu_rw_semaphore {
> -	unsigned __percpu *counters;
> -	bool locked;
> -	struct mutex mtx;
> +	int __percpu		*fast_read_ctr;
> +	struct mutex		writer_mutex;
> +	struct rw_semaphore	rw_sem;
> +	atomic_t		slow_read_ctr;
> +	wait_queue_head_t	write_waitq;
	int			wstate;
>  };
> 
> -#define light_mb()	barrier()
> -#define heavy_mb()	synchronize_sched()
> +extern void percpu_down_read(struct percpu_rw_semaphore *);
> +extern void percpu_up_read(struct percpu_rw_semaphore *);
> 
> -static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> -{
> -	rcu_read_lock_sched();
> -	if (unlikely(p->locked)) {
> -		rcu_read_unlock_sched();
> -		mutex_lock(&p->mtx);
> -		this_cpu_inc(*p->counters);
> -		mutex_unlock(&p->mtx);
> -		return;
> -	}
> -	this_cpu_inc(*p->counters);
> -	rcu_read_unlock_sched();
> -	light_mb(); /* A, between read of p->locked and read of data, paired with D */
> -}
> +extern void percpu_down_write(struct percpu_rw_semaphore *);
> +extern void percpu_up_write(struct percpu_rw_semaphore *);
> 
> -static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> -{
> -	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
> -	this_cpu_dec(*p->counters);
> -}
> -
> -static inline unsigned __percpu_count(unsigned __percpu *counters)
> -{
> -	unsigned total = 0;
> -	int cpu;
> -
> -	for_each_possible_cpu(cpu)
> -		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
> -
> -	return total;
> -}
> -
> -static inline void percpu_down_write(struct percpu_rw_semaphore *p)
> -{
> -	mutex_lock(&p->mtx);
> -	p->locked = true;
> -	synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */
> -	while (__percpu_count(p->counters))
> -		msleep(1);
> -	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
> -}
> -
> -static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> -{
> -	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
> -	p->locked = false;
> -	mutex_unlock(&p->mtx);
> -}
> -
> -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
> -{
> -	p->counters = alloc_percpu(unsigned);
> -	if (unlikely(!p->counters))
> -		return -ENOMEM;
> -	p->locked = false;
> -	mutex_init(&p->mtx);
> -	return 0;
> -}
> -
> -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
> -{
> -	free_percpu(p->counters);
> -	p->counters = NULL; /* catch use after free bugs */
> -}
> +extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
> +extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
> 
>  #endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 821a162..4dad4a7 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>  	 idr.o int_sqrt.o extable.o \
>  	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>  	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> -	 is_single_threaded.o plist.o decompress.o
> +	 is_single_threaded.o plist.o decompress.o percpu-rwsem.o
> 
>  lib-$(CONFIG_MMU) += ioremap.o
>  lib-$(CONFIG_SMP) += cpumask.o
> diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
> new file mode 100644
> index 0000000..40a415d
> --- /dev/null
> +++ b/lib/percpu-rwsem.c
> @@ -0,0 +1,106 @@
> +#include <linux/percpu-rwsem.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>

#define WSTATE_NEED_LOCK 1
#define WSTATE_NEED_MB	 2

> +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	brw->fast_read_ctr = alloc_percpu(int);
> +	if (unlikely(!brw->fast_read_ctr))
> +		return -ENOMEM;
> +
> +	mutex_init(&brw->writer_mutex);
> +	init_rwsem(&brw->rw_sem);
> +	atomic_set(&brw->slow_read_ctr, 0);
> +	init_waitqueue_head(&brw->write_waitq);
> +	return 0;
> +}
> +
> +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> +{
> +	free_percpu(brw->fast_read_ctr);
> +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> +}
> +
> +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val)
> +{
> +	bool success = false;

	int state;

> +
> +	preempt_disable();
> +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {

	state = ACCESS_ONCE(brw->wstate);
	if (likely(!state)) {

> +		__this_cpu_add(*brw->fast_read_ctr, val);
> +		success = true;

	} else if (state & WSTATE_NEED_MB) {
		__this_cpu_add(*brw->fast_read_ctr, val);
		smb_mb(); /* Order increment against critical section. */
		success = true;
	}

> +	preempt_enable();
> +
> +	return success;
> +}
> +
> +void percpu_down_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, +1)))
> +		return;
> +
> +	down_read(&brw->rw_sem);
> +	atomic_inc(&brw->slow_read_ctr);
> +	up_read(&brw->rw_sem);
> +}
> +
> +void percpu_up_read(struct percpu_rw_semaphore *brw)
> +{
> +	if (likely(update_fast_ctr(brw, -1)))
> +		return;
> +
> +	/* false-positive is possible but harmless */
> +	if (atomic_dec_and_test(&brw->slow_read_ctr))
> +		wake_up_all(&brw->write_waitq);
> +}
> +
> +static int clear_fast_read_ctr(struct percpu_rw_semaphore *brw)
> +{
> +	int cpu, sum = 0;
> +
> +	for_each_possible_cpu(cpu) {
> +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> +	}
> +
> +	return sum;
> +}
> +
> +void percpu_down_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> +	mutex_lock(&brw->writer_mutex);

	ACCESS_ONCE(brw->wstate) = WSTATE_NEED_LOCK;

> +	/*
> +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> +	 *    so that update_fast_ctr() can't succeed.
> +	 *
> +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> +	 *    update_fast_ctr().
> +	 *
> +	 * 3. Ensures that if any reader has exited its critical section via
> +	 *    fast-path, it executes a full memory barrier before we return.
> +	 */
> +	synchronize_sched();
> +
> +	/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> +	atomic_add(clear_fast_read_ctr(brw), &brw->slow_read_ctr);
> +
> +	/* block the new readers completely */
> +	down_write(&brw->rw_sem);
> +
> +	/* wait for all readers to complete their percpu_up_read() */
> +	wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> +}
> +
> +void percpu_up_write(struct percpu_rw_semaphore *brw)
> +{
> +	/* allow the new readers, but only the slow-path */
> +	up_write(&brw->rw_sem);

	ACCESS_ONCE(brw->wstate) = WSTATE_NEED_MB;

> +
> +	/* insert the barrier before the next fast-path in down_read */
> +	synchronize_sched();

	ACCESS_ONCE(brw->wstate) = 0;

> +	mutex_unlock(&brw->writer_mutex);
> +}

OK, o


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-01 15:43                     ` Paul E. McKenney
@ 2012-11-01 18:33                       ` Oleg Nesterov
  2012-11-02 16:18                         ` Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2012-11-01 18:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

Paul, thanks.

Sorry, I can't reply today, just one note...

On 11/01, Paul E. McKenney wrote:
>
> OK, so it looks to me that this code relies on synchronize_sched()
> forcing a memory barrier on each CPU executing in the kernel.

No, the patch tries to avoid this assumption, but probably I missed
something.

> 1.	A task running on CPU 0 currently write-holds the lock.
>
> 2.	CPU 1 is running in the kernel, executing a longer-than-average
> 	loop of normal instructions (no atomic instructions or memory
> 	barriers).
>
> 3.	CPU 0 invokes percpu_up_write(), calling up_write(),
> 	synchronize_sched(), and finally mutex_unlock().

And my expectation was, this should be enough because ...

> 4.	CPU 1 executes percpu_down_read(), which calls update_fast_ctr(),

since update_fast_ctr does preempt_disable/enable it should see all
modifications done by CPU 0.

IOW. Suppose that the writer (CPU 0) does

	percpu_done_write();
	STORE;
	percpu_up_write();

This means

	STORE;
	synchronize_sched();
	mutex_unlock();

Now. Do you mean that the next preempt_disable/enable can see the
result of mutex_unlock() but not STORE?

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
  2012-11-01 18:33                       ` Oleg Nesterov
@ 2012-11-02 16:18                         ` Oleg Nesterov
  0 siblings, 0 replies; 8+ messages in thread
From: Oleg Nesterov @ 2012-11-02 16:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov,
	linux-kernel

On 11/01, Oleg Nesterov wrote:
>
> On 11/01, Paul E. McKenney wrote:
> >
> > OK, so it looks to me that this code relies on synchronize_sched()
> > forcing a memory barrier on each CPU executing in the kernel.
>
> No, the patch tries to avoid this assumption, but probably I missed
> something.
>
> > 1.	A task running on CPU 0 currently write-holds the lock.
> >
> > 2.	CPU 1 is running in the kernel, executing a longer-than-average
> > 	loop of normal instructions (no atomic instructions or memory
> > 	barriers).
> >
> > 3.	CPU 0 invokes percpu_up_write(), calling up_write(),
> > 	synchronize_sched(), and finally mutex_unlock().
>
> And my expectation was, this should be enough because ...
>
> > 4.	CPU 1 executes percpu_down_read(), which calls update_fast_ctr(),
>
> since update_fast_ctr does preempt_disable/enable it should see all
> modifications done by CPU 0.
>
> IOW. Suppose that the writer (CPU 0) does
>
> 	percpu_done_write();
> 	STORE;
> 	percpu_up_write();
>
> This means
>
> 	STORE;
> 	synchronize_sched();
> 	mutex_unlock();
>
> Now. Do you mean that the next preempt_disable/enable can see the
> result of mutex_unlock() but not STORE?

So far I think this is not possible, so the code doesn't need the
additional wstate/barriers.

> > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val)
> > +{
> > +	bool success = false;
>
> 	int state;
>
> > +
> > +	preempt_disable();
> > +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
>
> 	state = ACCESS_ONCE(brw->wstate);
> 	if (likely(!state)) {
>
> > +		__this_cpu_add(*brw->fast_read_ctr, val);
> > +		success = true;
>
> 	} else if (state & WSTATE_NEED_MB) {
> 		__this_cpu_add(*brw->fast_read_ctr, val);
> 		smb_mb(); /* Order increment against critical section. */
> 		success = true;
> 	}

...

> > +void percpu_up_write(struct percpu_rw_semaphore *brw)
> > +{
> > +	/* allow the new readers, but only the slow-path */
> > +	up_write(&brw->rw_sem);
>
> 	ACCESS_ONCE(brw->wstate) = WSTATE_NEED_MB;
>
> > +
> > +	/* insert the barrier before the next fast-path in down_read */
> > +	synchronize_sched();

But update_fast_ctr() should see mutex_is_locked(), obiously down_write()
must ensure this.

So update_fast_ctr() can execute the WSTATE_NEED_MB code only if it
races with

> 	ACCESS_ONCE(brw->wstate) = 0;
>
> > +	mutex_unlock(&brw->writer_mutex);

these 2 stores and sees them in reverse order.



I guess that mutex_is_locked() in update_fast_ctr() looks a bit confusing.
It means no-fast-path for the reader, we could use ->state instead.

And even ->writer_mutex should go away if we want to optimize the
write-contended case, but I think this needs another patch on top of
this initial implementation.

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-11-04 15:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-04  8:47 [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily George Spelvin
2012-11-04 15:52 ` Oleg Nesterov
  -- strict thread matches above, loose matches on Subject: below --
2012-10-16 19:56 [PATCH 1/2] brw_mutex: big read-write mutex Linus Torvalds
2012-10-17 16:59 ` Oleg Nesterov
2012-10-17 22:44   ` Paul E. McKenney
2012-10-18 16:24     ` Oleg Nesterov
2012-10-18 16:38       ` Paul E. McKenney
2012-10-18 17:57         ` Oleg Nesterov
2012-10-19 19:28           ` Paul E. McKenney
2012-10-22 23:36             ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka
2012-10-30 18:48               ` Oleg Nesterov
2012-10-31 19:41                 ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov
2012-10-31 19:41                   ` [PATCH 1/1] " Oleg Nesterov
2012-11-01 15:10                     ` Linus Torvalds
2012-11-01 15:34                       ` Oleg Nesterov
2012-11-01 15:43                     ` Paul E. McKenney
2012-11-01 18:33                       ` Oleg Nesterov
2012-11-02 16:18                         ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.