* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
@ 2012-11-04 8:47 George Spelvin
2012-11-04 15:52 ` Oleg Nesterov
0 siblings, 1 reply; 8+ messages in thread
From: George Spelvin @ 2012-11-04 8:47 UTC (permalink / raw)
To: oleg, torvalds; +Cc: linux, linux-kernel
Grand poo-bah Linus wrote:
> Now, I doubt you'll find an architecture or C compiler where this will
> actually ever make a difference, but the fact remains that you
> shouldn't use signed integers for counters like this. You should use
> unsigned, and you should rely on the well-defined modulo-2**n
> semantics.
Actually, this is another C standard undefined case that recent versions of
GCC exploit for optimization.
When using signed integers, GCC's optimizer assumes that, if b > 0,
then a + b > a.
For example, the loop:
for (i = 1; i; i++)
/* Code */
will never terminate! Feed the following to gcc -O2 and see for yourself:
extern void foo(int x);
void bar(void)
{
int i;
for (i = 0; i >= 0; i++)
foo(i);
}
here's what I get:
.file "test.c"
.text
.p2align 4,,15
.globl bar
.type bar, @function
bar:
.LFB0:
.cfi_startproc
pushl %ebx
.cfi_def_cfa_offset 8
.cfi_offset 3, -8
movl $1, %ebx
subl $24, %esp
.cfi_def_cfa_offset 32
.p2align 4,,7
.p2align 3
.L2:
movl %ebx, (%esp)
addl $1, %ebx
call foo
jmp .L2
.cfi_endproc
.LFE0:
.size bar, .-bar
.ident "GCC: (Debian 4.7.2-4) 4.7.2"
.section .note.GNU-stack,"",@progbits
Notice the lack of test in the "jmp .L2" loop.
It can even handle more complicated cases like:
void bar(int j)
{
int i = 0;
do {
foo(i);
if (j >= 0)
i += j;
else
i -= j;
} while (i >= 0);
}
... which gcc -O3 neatly splits into two infinite loops, as if I had written:
void bar(int j)
{
int i = 0;
if (j >= 0)
for (; ; i += j)
foo(i);
else
for (; ; i -= j)
foo(i);
}
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-04 8:47 [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily George Spelvin @ 2012-11-04 15:52 ` Oleg Nesterov 0 siblings, 0 replies; 8+ messages in thread From: Oleg Nesterov @ 2012-11-04 15:52 UTC (permalink / raw) To: George Spelvin; +Cc: torvalds, linux-kernel On 11/04, George Spelvin wrote: > > Grand poo-bah Linus wrote: > > Now, I doubt you'll find an architecture or C compiler where this will > > actually ever make a difference, but the fact remains that you > > shouldn't use signed integers for counters like this. You should use > > unsigned, and you should rely on the well-defined modulo-2**n > > semantics. > > Actually, this is another C standard undefined case that recent versions of > GCC exploit for optimization. ^^^^^^^^^^^^ This is another thing, > For example, the loop: > for (i = 1; i; i++) > /* Code */ > will never terminate! Feed the following to gcc -O2 and see for yourself: Yes, because ... > Notice the lack of test in the "jmp .L2" loop. Exactly. But if we have like int A, B; int sum(void) { return A + B; } then I doubt there is any architecture (at least supported by linux) which can generate the different code if you do s/int/unsigned/. Anyway I agree, unsigned makes more sense, and I changed this patch accordingly.. Oleg. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex
@ 2012-10-16 19:56 Linus Torvalds
2012-10-17 16:59 ` Oleg Nesterov
0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2012-10-16 19:56 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Ingo Molnar, Paul E. McKenney, Peter Zijlstra, Srikar Dronamraju,
Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel
On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> it allows multiple writers too, just "read" and "write" are mutually
> exclusive.
So those semantics just don't sound sane. It's also not what any kind
of normal "rw" lock ever does.
So can you explain why these particular insane semantics are useful,
and what for?
Linus
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-16 19:56 [PATCH 1/2] brw_mutex: big read-write mutex Linus Torvalds @ 2012-10-17 16:59 ` Oleg Nesterov 2012-10-17 22:44 ` Paul E. McKenney 0 siblings, 1 reply; 8+ messages in thread From: Oleg Nesterov @ 2012-10-17 16:59 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Paul E. McKenney, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/16, Linus Torvalds wrote: > > On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote: > > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore > > it allows multiple writers too, just "read" and "write" are mutually > > exclusive. > > So those semantics just don't sound sane. It's also not what any kind > of normal "rw" lock ever does. Yes, this is not usual. And initially I made brw_sem which allows only 1 writer, but then I changed this patch. > So can you explain why these particular insane semantics are useful, > and what for? To allow multiple uprobe_register/unregister at the same time. Mostly to not add the "regression", currently this is possible. It is not that I think this is terribly important, but still. And personally I think that "multiple writers" is not necessarily insane in general. Suppose you have the complex object/subsystem, the readers can use a single brw_mutex to access it "lockless", start_read() is very cheap. But start_write() is slow. Multiple writes can use the fine-grained inside the start_write/end_write section and do not block each other. Oleg. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-17 16:59 ` Oleg Nesterov @ 2012-10-17 22:44 ` Paul E. McKenney 2012-10-18 16:24 ` Oleg Nesterov 0 siblings, 1 reply; 8+ messages in thread From: Paul E. McKenney @ 2012-10-17 22:44 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 17, 2012 at 06:59:02PM +0200, Oleg Nesterov wrote: > On 10/16, Linus Torvalds wrote: > > > > On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov <oleg@redhat.com> wrote: > > > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore > > > it allows multiple writers too, just "read" and "write" are mutually > > > exclusive. > > > > So those semantics just don't sound sane. It's also not what any kind > > of normal "rw" lock ever does. > > Yes, this is not usual. > > And initially I made brw_sem which allows only 1 writer, but then > I changed this patch. > > > So can you explain why these particular insane semantics are useful, > > and what for? > > To allow multiple uprobe_register/unregister at the same time. Mostly > to not add the "regression", currently this is possible. > > It is not that I think this is terribly important, but still. And > personally I think that "multiple writers" is not necessarily insane > in general. Suppose you have the complex object/subsystem, the readers > can use a single brw_mutex to access it "lockless", start_read() is > very cheap. > > But start_write() is slow. Multiple writes can use the fine-grained > inside the start_write/end_write section and do not block each other. Strangely enough, the old VAXCluster locking primitives allowed this sort of thing. The brw_start_read() would be a "protected read", and brw_start_write() would be a "concurrent write". Even more interesting, they gave the same advice you give -- concurrent writes should use fine-grained locking to protect the actual accesses. It seems like it should be possible to come up with better names, but I cannot think of any at the moment. Thanx, Paul PS. For the sufficiently masochistic, here is the exclusion table for the six VAXCluster locking modes: NL CR CW PR PW EX NL CR X CW X X X PR X X X PW X X X X EX X X X X X "X" means that the pair of modes exclude each other, otherwise the lock may be held in both of the modes simultaneously. Modes: NL: Null, or "not held". CR: Concurrent read. CW: Concurrent write. PR: Protected read. PW: Protected write. EX: Exclusive. A reader-writer lock could use protected read for readers and either of protected write or exclusive for writers, the difference between protected write and exclusive being irrelevant in the absence of concurrent readers. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-17 22:44 ` Paul E. McKenney @ 2012-10-18 16:24 ` Oleg Nesterov 2012-10-18 16:38 ` Paul E. McKenney 0 siblings, 1 reply; 8+ messages in thread From: Oleg Nesterov @ 2012-10-18 16:24 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/17, Paul E. McKenney wrote: > > On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote: > > On 10/16, Paul E. McKenney wrote: > > > > > > Suppose that the writer arrives and sees that the value of the counter > > > is zero, > > > > after synchronize_sched(). So there are no readers (but perhaps there > > are brw_end_read's in flight which already decremented read_ctr) > > But the preempt_disable() region only covers read acquisition. So > synchronize_sched() waits only for all the brw_start_read() calls to > reach the preempt_enable() Yes. > -- it cannot wait for all the resulting > readers to reach the corresponding brw_end_read(). Indeed. > > > and thus never sleeps, and so is also not awakened? > > > > and why do we need wakeup in this case? > > To get the memory barriers required to keep the critical sections > ordered -- to ensure that everyone sees the reader's critical section > as ending before the writer's critical section starts. And now I am starting to think I misunderstood your concern from the very beginning. I thought that you meant that without mb() brw_start_write() can race with brw_end_read() and hang forever. But probably you meant that we need the barriers to ensure that, say, if the reader does brw_start_read(); CONDITION = 1; brw_end_read(); then the writer must see CONDITION != 0 after brw_start_write() ? (or vice-versa) In this case we need the barrier, yes. Obviously brw_start_write() can return right after this_cpu_dec() and before wake_up_all(). 2/2 doesn't need this guarantee but I agree, this doesn't look sane in gerenal... Or I misunderstood you again? Oleg. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-18 16:24 ` Oleg Nesterov @ 2012-10-18 16:38 ` Paul E. McKenney 2012-10-18 17:57 ` Oleg Nesterov 0 siblings, 1 reply; 8+ messages in thread From: Paul E. McKenney @ 2012-10-18 16:38 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote: > On 10/17, Paul E. McKenney wrote: > > > > On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote: > > > On 10/16, Paul E. McKenney wrote: > > > > > > > > Suppose that the writer arrives and sees that the value of the counter > > > > is zero, > > > > > > after synchronize_sched(). So there are no readers (but perhaps there > > > are brw_end_read's in flight which already decremented read_ctr) > > > > But the preempt_disable() region only covers read acquisition. So > > synchronize_sched() waits only for all the brw_start_read() calls to > > reach the preempt_enable() > > Yes. > > > -- it cannot wait for all the resulting > > readers to reach the corresponding brw_end_read(). > > Indeed. > > > > > and thus never sleeps, and so is also not awakened? > > > > > > and why do we need wakeup in this case? > > > > To get the memory barriers required to keep the critical sections > > ordered -- to ensure that everyone sees the reader's critical section > > as ending before the writer's critical section starts. > > And now I am starting to think I misunderstood your concern from > the very beginning. > > I thought that you meant that without mb() brw_start_write() can > race with brw_end_read() and hang forever. > > But probably you meant that we need the barriers to ensure that, > say, if the reader does > > brw_start_read(); > CONDITION = 1; > brw_end_read(); > > then the writer must see CONDITION != 0 after brw_start_write() ? > (or vice-versa) Yes, this is exactly my concern. > In this case we need the barrier, yes. Obviously brw_start_write() > can return right after this_cpu_dec() and before wake_up_all(). > > 2/2 doesn't need this guarantee but I agree, this doesn't look > sane in gerenal... Or name it something not containing "lock". And clearly document the behavior and how it is to be used. ;-) Otherwise, someone will get confused and introduce bugs. > Or I misunderstood you again? No, this was indeed my concern. Thanx, Paul ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-18 16:38 ` Paul E. McKenney @ 2012-10-18 17:57 ` Oleg Nesterov 2012-10-19 19:28 ` Paul E. McKenney 0 siblings, 1 reply; 8+ messages in thread From: Oleg Nesterov @ 2012-10-18 17:57 UTC (permalink / raw) To: Paul E. McKenney, Mikulas Patocka Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/18, Paul E. McKenney wrote: > > On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote: > > > > I thought that you meant that without mb() brw_start_write() can > > race with brw_end_read() and hang forever. > > > > But probably you meant that we need the barriers to ensure that, > > say, if the reader does > > > > brw_start_read(); > > CONDITION = 1; > > brw_end_read(); > > > > then the writer must see CONDITION != 0 after brw_start_write() ? > > (or vice-versa) > > Yes, this is exactly my concern. Oh, thanks at lot Paul (as always). > > In this case we need the barrier, yes. Obviously brw_start_write() > > can return right after this_cpu_dec() and before wake_up_all(). > > > > 2/2 doesn't need this guarantee but I agree, this doesn't look > > sane in gerenal... > > Or name it something not containing "lock". And clearly document > the behavior and how it is to be used. ;-) this would be insane, I guess ;) So. Ignoring the possible optimization you mentioned before, brw_end_read() should do: smp_mb(); this_cpu_dec(); wake_up_all(); And yes, we need the full mb(). wmb() is enough to ensure that the writer will see the memory modifications done by the reader. But we also need to ensure that any LOAD inside start_read/end_read can not be moved outside of the critical section. But we should also ensure that "read" will see all modifications which were done under start_write/end_write. This means that brw_end_write() needs another synchronize_sched() before atomic_dec_and_test(), or brw_start_read() needs mb() in the fast-path. Correct? Ooooh. And I just noticed include/linux/percpu-rwsem.h which does something similar. Certainly it was not in my tree when I started this patch... percpu_down_write() doesn't allow multiple writers, but the main problem it uses msleep(1). It should not, I think. But. It seems that percpu_up_write() is equally wrong? Doesn't it need synchronize_rcu() before "p->locked = false" ? (add Mikulas) Oleg. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] brw_mutex: big read-write mutex 2012-10-18 17:57 ` Oleg Nesterov @ 2012-10-19 19:28 ` Paul E. McKenney 2012-10-22 23:36 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka 0 siblings, 1 reply; 8+ messages in thread From: Paul E. McKenney @ 2012-10-19 19:28 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Linus Torvalds, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Thu, Oct 18, 2012 at 07:57:47PM +0200, Oleg Nesterov wrote: > On 10/18, Paul E. McKenney wrote: > > > > On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote: > > > > > > I thought that you meant that without mb() brw_start_write() can > > > race with brw_end_read() and hang forever. > > > > > > But probably you meant that we need the barriers to ensure that, > > > say, if the reader does > > > > > > brw_start_read(); > > > CONDITION = 1; > > > brw_end_read(); > > > > > > then the writer must see CONDITION != 0 after brw_start_write() ? > > > (or vice-versa) > > > > Yes, this is exactly my concern. > > Oh, thanks at lot Paul (as always). Glad it helped. ;-) > > > In this case we need the barrier, yes. Obviously brw_start_write() > > > can return right after this_cpu_dec() and before wake_up_all(). > > > > > > 2/2 doesn't need this guarantee but I agree, this doesn't look > > > sane in gerenal... > > > > Or name it something not containing "lock". And clearly document > > the behavior and how it is to be used. ;-) > > this would be insane, I guess ;) Well, I suppose you could call it a "phase" : brw_start_phase_1() and so on. > So. Ignoring the possible optimization you mentioned before, > brw_end_read() should do: > > smp_mb(); > this_cpu_dec(); > > wake_up_all(); > > And yes, we need the full mb(). wmb() is enough to ensure that the > writer will see the memory modifications done by the reader. But we > also need to ensure that any LOAD inside start_read/end_read can not > be moved outside of the critical section. > > But we should also ensure that "read" will see all modifications > which were done under start_write/end_write. This means that > brw_end_write() needs another synchronize_sched() before > atomic_dec_and_test(), or brw_start_read() needs mb() in the > fast-path. > > Correct? Good point, I missed the need for synchronize_sched() to avoid readers sleeping through the next write cycle due to racing with an exiting writer. But yes, this sounds correct. > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > something similar. Certainly it was not in my tree when I started > this patch... percpu_down_write() doesn't allow multiple writers, > but the main problem it uses msleep(1). It should not, I think. > > But. It seems that percpu_up_write() is equally wrong? Doesn't > it need synchronize_rcu() before "p->locked = false" ? > > (add Mikulas) Mikulas said something about doing an updated patch, so I figured I would look at his next version. Thanx, Paul ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) 2012-10-19 19:28 ` Paul E. McKenney @ 2012-10-22 23:36 ` Mikulas Patocka 2012-10-30 18:48 ` Oleg Nesterov 0 siblings, 1 reply; 8+ messages in thread From: Mikulas Patocka @ 2012-10-22 23:36 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel, Paul E. McKenney > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > > something similar. Certainly it was not in my tree when I started > > this patch... percpu_down_write() doesn't allow multiple writers, > > but the main problem it uses msleep(1). It should not, I think. > > > > But. It seems that percpu_up_write() is equally wrong? Doesn't > > it need synchronize_rcu() before "p->locked = false" ? > > > > (add Mikulas) > > Mikulas said something about doing an updated patch, so I figured I > would look at his next version. > > Thanx, Paul The best ideas proposed in this thread are: Using heavy/light barries by Lai Jiangshan. This fixes the missing barrier bug, removes the ugly test "#if defined(X86) ..." and makes the read path use no barrier instruction on all architectures. Instead of rcu_read_lock, we can use rcu_read_lock_sched (or preempt_disable) - the resulting code is smaller. The critical section is so small that there is no problem disabling preemption. I am sending these two patches. Linus, please apply them if there are no objections. Mikulas ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) 2012-10-22 23:36 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka @ 2012-10-30 18:48 ` Oleg Nesterov 2012-10-31 19:41 ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov 0 siblings, 1 reply; 8+ messages in thread From: Oleg Nesterov @ 2012-10-30 18:48 UTC (permalink / raw) To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra Cc: Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/22, Mikulas Patocka wrote: > > > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > > > something similar. Certainly it was not in my tree when I started > > > this patch... percpu_down_write() doesn't allow multiple writers, > > > but the main problem it uses msleep(1). It should not, I think. But, since we already have percpu_rw_semaphore, I do not think I can add another similar thing, However percpu_rw_semaphore is sub-optimal, not sure uprobes can use it to block dup_mmap(). Perhaps we can improve it? > > > But. It seems that percpu_up_write() is equally wrong? Doesn't > > > it need synchronize_rcu() before "p->locked = false" ? > > > > > > (add Mikulas) > > > > Mikulas said something about doing an updated patch, so I figured I > > would look at his next version. > > > > Thanx, Paul > > The best ideas proposed in this thread are: > > > Using heavy/light barries by Lai Jiangshan. So. down_write/up_right does msleep() and it needs to call synchronize_sched() 3 times. This looks too much. It is not that I am worried about the writers, the problem is that the new readers are blocked completely while the writer sleeps in msleep/synchronize_sched. Paul, Mikulas, et al. Could you please look at the new implementation below? Completely untested/uncompiled, just for discussion. Compared to the current implementation, down_read() is still possible while the writer sleeps in synchronize_sched(), but the reader uses rw_semaphore/atomic_inc when it detects the waiting writer. Can this work? Do you think this is better than we have now? Note: probably we can optimize percpu_down/up_write more, we can "factor out" synchronize_sched(), multiple writers can do this in parallel before they take ->writer_mutex to exclude each other. But this won't affect the readers, and this can be done later. Oleg. ------------------------------------------------------------------------------ struct percpu_rw_semaphore { long __percpu *fast_read_ctr; struct mutex writer_mutex; struct rw_semaphore rw_sem; atomit_t slow_read_ctr; }; static bool update_fast_ctr(struct percpu_rw_semaphore *brw, long val) { bool success = false; preempt_disable(); if (likely(!mutex_is_locked(&brw->writer_mutex))) { __this_cpu_add(*brw->fast_read_ctr, val); success = true; } preempt_enable(); return success; } static long clear_fast_read_ctr(struct percpu_rw_semaphore *brw) { long sum = 0; int cpu; for_each_possible_cpu(cpu) { sum += per_cpu(*brw->fast_read_ctr, cpu); per_cpu(*brw->fast_read_ctr, cpu) = 0; } return sum; } void percpu_down_read(struct percpu_rw_semaphore *brw) { if (likely(update_fast_ctr(+1))) return; down_read(&brw->rw_sem); atomic_inc(&brw->slow_read_ctr); up_read(&brw->rw_sem); } void percpu_up_read(struct percpu_rw_semaphore *brw) { if (likely(update_fast_ctr(-1))) return; if (atomic_dec_and_test(&brw->slow_read_ctr)) wake_up_all(&brw->write_waitq); } void percpu_down_write(struct percpu_rw_semaphore *brw) { mutex_lock(&brw->writer_mutex); /* ensure mutex_is_locked() is visible to the readers */ synchronize_sched(); /* block the new readers */ down_write(&brw->rw_sem); atomic_add(&brw->slow_read_ctr, clear_fast_read_ctr(brw)); wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); } void percpu_up_write(struct percpu_rw_semaphore *brw) { up_write(&brw->rw_sem); /* insert the barrier before the next fast-path in down_read */ synchronize_sched(); mutex_unlock(&brw->writer_mutex); } ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-10-30 18:48 ` Oleg Nesterov @ 2012-10-31 19:41 ` Oleg Nesterov 2012-10-31 19:41 ` [PATCH 1/1] " Oleg Nesterov 0 siblings, 1 reply; 8+ messages in thread From: Oleg Nesterov @ 2012-10-31 19:41 UTC (permalink / raw) To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Linus Torvalds Cc: Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 10/30, Oleg Nesterov wrote: > > So. down_write/up_right does msleep() and it needs to call > synchronize_sched() 3 times. > > This looks too much. It is not that I am worried about the writers, > the problem is that the new readers are blocked completely while the > writer sleeps in msleep/synchronize_sched. > > Paul, Mikulas, et al. Could you please look at the new implementation > below? Completely untested/uncompiled, just for discussion. I tried to test it, seems to work... But. I guess the only valid test is: pass the review from Paul/Peter. Todo: - add the lockdep annotations - we can speedup the down_write-right-aftet-up_write case What do you all think? Oleg. ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-10-31 19:41 ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov @ 2012-10-31 19:41 ` Oleg Nesterov 2012-11-01 15:10 ` Linus Torvalds 2012-11-01 15:43 ` Paul E. McKenney 0 siblings, 2 replies; 8+ messages in thread From: Oleg Nesterov @ 2012-10-31 19:41 UTC (permalink / raw) To: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Linus Torvalds Cc: Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the "write" section was not actually started or if it was already finished. With this patch down_read/up_read does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the "slow" counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- include/linux/percpu-rwsem.h | 83 +++++--------------------------- lib/Makefile | 2 +- lib/percpu-rwsem.c | 106 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 120 insertions(+), 71 deletions(-) create mode 100644 lib/percpu-rwsem.c diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 250a4ac..7f738ca 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -2,82 +2,25 @@ #define _LINUX_PERCPU_RWSEM_H #include <linux/mutex.h> +#include <linux/rwsem.h> #include <linux/percpu.h> -#include <linux/rcupdate.h> -#include <linux/delay.h> +#include <linux/wait.h> struct percpu_rw_semaphore { - unsigned __percpu *counters; - bool locked; - struct mutex mtx; + int __percpu *fast_read_ctr; + struct mutex writer_mutex; + struct rw_semaphore rw_sem; + atomic_t slow_read_ctr; + wait_queue_head_t write_waitq; }; -#define light_mb() barrier() -#define heavy_mb() synchronize_sched() +extern void percpu_down_read(struct percpu_rw_semaphore *); +extern void percpu_up_read(struct percpu_rw_semaphore *); -static inline void percpu_down_read(struct percpu_rw_semaphore *p) -{ - rcu_read_lock_sched(); - if (unlikely(p->locked)) { - rcu_read_unlock_sched(); - mutex_lock(&p->mtx); - this_cpu_inc(*p->counters); - mutex_unlock(&p->mtx); - return; - } - this_cpu_inc(*p->counters); - rcu_read_unlock_sched(); - light_mb(); /* A, between read of p->locked and read of data, paired with D */ -} +extern void percpu_down_write(struct percpu_rw_semaphore *); +extern void percpu_up_write(struct percpu_rw_semaphore *); -static inline void percpu_up_read(struct percpu_rw_semaphore *p) -{ - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ - this_cpu_dec(*p->counters); -} - -static inline unsigned __percpu_count(unsigned __percpu *counters) -{ - unsigned total = 0; - int cpu; - - for_each_possible_cpu(cpu) - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); - - return total; -} - -static inline void percpu_down_write(struct percpu_rw_semaphore *p) -{ - mutex_lock(&p->mtx); - p->locked = true; - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ - while (__percpu_count(p->counters)) - msleep(1); - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ -} - -static inline void percpu_up_write(struct percpu_rw_semaphore *p) -{ - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ - p->locked = false; - mutex_unlock(&p->mtx); -} - -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) -{ - p->counters = alloc_percpu(unsigned); - if (unlikely(!p->counters)) - return -ENOMEM; - p->locked = false; - mutex_init(&p->mtx); - return 0; -} - -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) -{ - free_percpu(p->counters); - p->counters = NULL; /* catch use after free bugs */ -} +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); #endif diff --git a/lib/Makefile b/lib/Makefile index 821a162..4dad4a7 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ idr.o int_sqrt.o extable.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ - is_single_threaded.o plist.o decompress.o + is_single_threaded.o plist.o decompress.o percpu-rwsem.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c new file mode 100644 index 0000000..40a415d --- /dev/null +++ b/lib/percpu-rwsem.c @@ -0,0 +1,106 @@ +#include <linux/percpu-rwsem.h> +#include <linux/rcupdate.h> +#include <linux/sched.h> + +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) +{ + brw->fast_read_ctr = alloc_percpu(int); + if (unlikely(!brw->fast_read_ctr)) + return -ENOMEM; + + mutex_init(&brw->writer_mutex); + init_rwsem(&brw->rw_sem); + atomic_set(&brw->slow_read_ctr, 0); + init_waitqueue_head(&brw->write_waitq); + return 0; +} + +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) +{ + free_percpu(brw->fast_read_ctr); + brw->fast_read_ctr = NULL; /* catch use after free bugs */ +} + +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val) +{ + bool success = false; + + preempt_disable(); + if (likely(!mutex_is_locked(&brw->writer_mutex))) { + __this_cpu_add(*brw->fast_read_ctr, val); + success = true; + } + preempt_enable(); + + return success; +} + +void percpu_down_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, +1))) + return; + + down_read(&brw->rw_sem); + atomic_inc(&brw->slow_read_ctr); + up_read(&brw->rw_sem); +} + +void percpu_up_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, -1))) + return; + + /* false-positive is possible but harmless */ + if (atomic_dec_and_test(&brw->slow_read_ctr)) + wake_up_all(&brw->write_waitq); +} + +static int clear_fast_read_ctr(struct percpu_rw_semaphore *brw) +{ + int cpu, sum = 0; + + for_each_possible_cpu(cpu) { + sum += per_cpu(*brw->fast_read_ctr, cpu); + per_cpu(*brw->fast_read_ctr, cpu) = 0; + } + + return sum; +} + +void percpu_down_write(struct percpu_rw_semaphore *brw) +{ + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ + mutex_lock(&brw->writer_mutex); + + /* + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read + * so that update_fast_ctr() can't succeed. + * + * 2. Ensures we see the result of every previous this_cpu_add() in + * update_fast_ctr(). + * + * 3. Ensures that if any reader has exited its critical section via + * fast-path, it executes a full memory barrier before we return. + */ + synchronize_sched(); + + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ + atomic_add(clear_fast_read_ctr(brw), &brw->slow_read_ctr); + + /* block the new readers completely */ + down_write(&brw->rw_sem); + + /* wait for all readers to complete their percpu_up_read() */ + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); +} + +void percpu_up_write(struct percpu_rw_semaphore *brw) +{ + /* allow the new readers, but only the slow-path */ + up_write(&brw->rw_sem); + + /* insert the barrier before the next fast-path in down_read */ + synchronize_sched(); + + mutex_unlock(&brw->writer_mutex); +} -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-10-31 19:41 ` [PATCH 1/1] " Oleg Nesterov @ 2012-11-01 15:10 ` Linus Torvalds 2012-11-01 15:34 ` Oleg Nesterov 2012-11-01 15:43 ` Paul E. McKenney 1 sibling, 1 reply; 8+ messages in thread From: Linus Torvalds @ 2012-11-01 15:10 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote: > Currently the writer does msleep() plus synchronize_sched() 3 times > to acquire/release the semaphore, and during this time the readers > are blocked completely. Even if the "write" section was not actually > started or if it was already finished. > > With this patch down_read/up_read does synchronize_sched() twice and > down_read/up_read are still possible during this time, just they use > the slow path. The changelog is wrong (it's the write path, not read path, that does the synchronize_sched). > struct percpu_rw_semaphore { > - unsigned __percpu *counters; > - bool locked; > - struct mutex mtx; > + int __percpu *fast_read_ctr; This change is wrong. You must not make the 'fast_read_ctr' thing be an int. Or at least you need to be a hell of a lot more careful about it. Why? Because the readers update the counters while possibly moving around cpu's, the increment and decrement of the counters may be on different CPU's. But that means that when you add all the counters together, things can overflow (only the final sum is meaningful). And THAT in turn means that you should not use a signed count, for the simple reason that signed integers don't have well-behaved overflow behavior in C. Now, I doubt you'll find an architecture or C compiler where this will actually ever make a difference, but the fact remains that you shouldn't use signed integers for counters like this. You should use unsigned, and you should rely on the well-defined modulo-2**n semantics. I'd also like to see a comment somewhere in the source code about the whole algorithm and the rules. Other than that, I guess it looks ok. Linus ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-01 15:10 ` Linus Torvalds @ 2012-11-01 15:34 ` Oleg Nesterov 0 siblings, 0 replies; 8+ messages in thread From: Oleg Nesterov @ 2012-11-01 15:34 UTC (permalink / raw) To: Linus Torvalds Cc: Mikulas Patocka, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel Thanks! I'll send v2 tomorrow. On 11/01, Linus Torvalds wrote: > On Wed, Oct 31, 2012 at 12:41 PM, Oleg Nesterov <oleg@redhat.com> wrote: > > Currently the writer does msleep() plus synchronize_sched() 3 times > > to acquire/release the semaphore, and during this time the readers > > are blocked completely. Even if the "write" section was not actually > > started or if it was already finished. > > > > With this patch down_read/up_read does synchronize_sched() twice and > > down_read/up_read are still possible during this time, just they use > > the slow path. > > The changelog is wrong (it's the write path, not read path, that does > the synchronize_sched). > > > struct percpu_rw_semaphore { > > - unsigned __percpu *counters; > > - bool locked; > > - struct mutex mtx; > > + int __percpu *fast_read_ctr; > > This change is wrong. > > You must not make the 'fast_read_ctr' thing be an int. Or at least you > need to be a hell of a lot more careful about it. > > Why? > > Because the readers update the counters while possibly moving around > cpu's, the increment and decrement of the counters may be on different > CPU's. But that means that when you add all the counters together, > things can overflow (only the final sum is meaningful). And THAT in > turn means that you should not use a signed count, for the simple > reason that signed integers don't have well-behaved overflow behavior > in C. > > Now, I doubt you'll find an architecture or C compiler where this will > actually ever make a difference, but the fact remains that you > shouldn't use signed integers for counters like this. You should use > unsigned, and you should rely on the well-defined modulo-2**n > semantics. > > I'd also like to see a comment somewhere in the source code about the > whole algorithm and the rules. > > Other than that, I guess it looks ok. > > Linus ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-10-31 19:41 ` [PATCH 1/1] " Oleg Nesterov 2012-11-01 15:10 ` Linus Torvalds @ 2012-11-01 15:43 ` Paul E. McKenney 2012-11-01 18:33 ` Oleg Nesterov 1 sibling, 1 reply; 8+ messages in thread From: Paul E. McKenney @ 2012-11-01 15:43 UTC (permalink / raw) To: Oleg Nesterov Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On Wed, Oct 31, 2012 at 08:41:58PM +0100, Oleg Nesterov wrote: > Currently the writer does msleep() plus synchronize_sched() 3 times > to acquire/release the semaphore, and during this time the readers > are blocked completely. Even if the "write" section was not actually > started or if it was already finished. > > With this patch down_read/up_read does synchronize_sched() twice and > down_read/up_read are still possible during this time, just they use > the slow path. > > percpu_down_write() first forces the readers to use rw_semaphore and > increment the "slow" counter to take the lock for reading, then it > takes that rw_semaphore for writing and blocks the readers. > > Also. With this patch the code relies on the documented behaviour of > synchronize_sched(), it doesn't try to pair synchronize_sched() with > barrier. OK, so it looks to me that this code relies on synchronize_sched() forcing a memory barrier on each CPU executing in the kernel. I might well be confused, so here is the sequence of events that leads me to believe this: 1. A task running on CPU 0 currently write-holds the lock. 2. CPU 1 is running in the kernel, executing a longer-than-average loop of normal instructions (no atomic instructions or memory barriers). 3. CPU 0 invokes percpu_up_write(), calling up_write(), synchronize_sched(), and finally mutex_unlock(). 4. CPU 1 executes percpu_down_read(), which calls update_fast_ctr(), which finds that ->writer_mutex is not held. CPU 1 therefore increments >fast_read_ctr and returns success. Of course, as Mikulas pointed out, the actual implementation will have forced CPU 1 to execute a memory barrier in the course of the synchronize_sched() implementation. However, if synchronize_sched() had been modified to act as synchronize_srcu() currently does, there would be no memory barrier, and thus no guarantee that CPU 1's subsequent read-side critical section would seen the effect of CPU 0's previous write-side critical section. Fortunately, this is easy to fix, with zero added overhead on the read-side fastpath, as shown by the notes interspersed below. Thoughts? Thanx, Paul > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > include/linux/percpu-rwsem.h | 83 +++++--------------------------- > lib/Makefile | 2 +- > lib/percpu-rwsem.c | 106 ++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 120 insertions(+), 71 deletions(-) > create mode 100644 lib/percpu-rwsem.c > > diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h > index 250a4ac..7f738ca 100644 > --- a/include/linux/percpu-rwsem.h > +++ b/include/linux/percpu-rwsem.h > @@ -2,82 +2,25 @@ > #define _LINUX_PERCPU_RWSEM_H > > #include <linux/mutex.h> > +#include <linux/rwsem.h> > #include <linux/percpu.h> > -#include <linux/rcupdate.h> > -#include <linux/delay.h> > +#include <linux/wait.h> > > struct percpu_rw_semaphore { > - unsigned __percpu *counters; > - bool locked; > - struct mutex mtx; > + int __percpu *fast_read_ctr; > + struct mutex writer_mutex; > + struct rw_semaphore rw_sem; > + atomic_t slow_read_ctr; > + wait_queue_head_t write_waitq; int wstate; > }; > > -#define light_mb() barrier() > -#define heavy_mb() synchronize_sched() > +extern void percpu_down_read(struct percpu_rw_semaphore *); > +extern void percpu_up_read(struct percpu_rw_semaphore *); > > -static inline void percpu_down_read(struct percpu_rw_semaphore *p) > -{ > - rcu_read_lock_sched(); > - if (unlikely(p->locked)) { > - rcu_read_unlock_sched(); > - mutex_lock(&p->mtx); > - this_cpu_inc(*p->counters); > - mutex_unlock(&p->mtx); > - return; > - } > - this_cpu_inc(*p->counters); > - rcu_read_unlock_sched(); > - light_mb(); /* A, between read of p->locked and read of data, paired with D */ > -} > +extern void percpu_down_write(struct percpu_rw_semaphore *); > +extern void percpu_up_write(struct percpu_rw_semaphore *); > > -static inline void percpu_up_read(struct percpu_rw_semaphore *p) > -{ > - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ > - this_cpu_dec(*p->counters); > -} > - > -static inline unsigned __percpu_count(unsigned __percpu *counters) > -{ > - unsigned total = 0; > - int cpu; > - > - for_each_possible_cpu(cpu) > - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); > - > - return total; > -} > - > -static inline void percpu_down_write(struct percpu_rw_semaphore *p) > -{ > - mutex_lock(&p->mtx); > - p->locked = true; > - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ > - while (__percpu_count(p->counters)) > - msleep(1); > - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ > -} > - > -static inline void percpu_up_write(struct percpu_rw_semaphore *p) > -{ > - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ > - p->locked = false; > - mutex_unlock(&p->mtx); > -} > - > -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) > -{ > - p->counters = alloc_percpu(unsigned); > - if (unlikely(!p->counters)) > - return -ENOMEM; > - p->locked = false; > - mutex_init(&p->mtx); > - return 0; > -} > - > -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) > -{ > - free_percpu(p->counters); > - p->counters = NULL; /* catch use after free bugs */ > -} > +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); > +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); > > #endif > diff --git a/lib/Makefile b/lib/Makefile > index 821a162..4dad4a7 100644 > --- a/lib/Makefile > +++ b/lib/Makefile > @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ > idr.o int_sqrt.o extable.o \ > sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ > proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ > - is_single_threaded.o plist.o decompress.o > + is_single_threaded.o plist.o decompress.o percpu-rwsem.o > > lib-$(CONFIG_MMU) += ioremap.o > lib-$(CONFIG_SMP) += cpumask.o > diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c > new file mode 100644 > index 0000000..40a415d > --- /dev/null > +++ b/lib/percpu-rwsem.c > @@ -0,0 +1,106 @@ > +#include <linux/percpu-rwsem.h> > +#include <linux/rcupdate.h> > +#include <linux/sched.h> #define WSTATE_NEED_LOCK 1 #define WSTATE_NEED_MB 2 > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > +{ > + brw->fast_read_ctr = alloc_percpu(int); > + if (unlikely(!brw->fast_read_ctr)) > + return -ENOMEM; > + > + mutex_init(&brw->writer_mutex); > + init_rwsem(&brw->rw_sem); > + atomic_set(&brw->slow_read_ctr, 0); > + init_waitqueue_head(&brw->write_waitq); > + return 0; > +} > + > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > +{ > + free_percpu(brw->fast_read_ctr); > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > +} > + > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val) > +{ > + bool success = false; int state; > + > + preempt_disable(); > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { state = ACCESS_ONCE(brw->wstate); if (likely(!state)) { > + __this_cpu_add(*brw->fast_read_ctr, val); > + success = true; } else if (state & WSTATE_NEED_MB) { __this_cpu_add(*brw->fast_read_ctr, val); smb_mb(); /* Order increment against critical section. */ success = true; } > + preempt_enable(); > + > + return success; > +} > + > +void percpu_down_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, +1))) > + return; > + > + down_read(&brw->rw_sem); > + atomic_inc(&brw->slow_read_ctr); > + up_read(&brw->rw_sem); > +} > + > +void percpu_up_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, -1))) > + return; > + > + /* false-positive is possible but harmless */ > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > + wake_up_all(&brw->write_waitq); > +} > + > +static int clear_fast_read_ctr(struct percpu_rw_semaphore *brw) > +{ > + int cpu, sum = 0; > + > + for_each_possible_cpu(cpu) { > + sum += per_cpu(*brw->fast_read_ctr, cpu); > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > + } > + > + return sum; > +} > + > +void percpu_down_write(struct percpu_rw_semaphore *brw) > +{ > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > + mutex_lock(&brw->writer_mutex); ACCESS_ONCE(brw->wstate) = WSTATE_NEED_LOCK; > + /* > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > + * so that update_fast_ctr() can't succeed. > + * > + * 2. Ensures we see the result of every previous this_cpu_add() in > + * update_fast_ctr(). > + * > + * 3. Ensures that if any reader has exited its critical section via > + * fast-path, it executes a full memory barrier before we return. > + */ > + synchronize_sched(); > + > + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ > + atomic_add(clear_fast_read_ctr(brw), &brw->slow_read_ctr); > + > + /* block the new readers completely */ > + down_write(&brw->rw_sem); > + > + /* wait for all readers to complete their percpu_up_read() */ > + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); > +} > + > +void percpu_up_write(struct percpu_rw_semaphore *brw) > +{ > + /* allow the new readers, but only the slow-path */ > + up_write(&brw->rw_sem); ACCESS_ONCE(brw->wstate) = WSTATE_NEED_MB; > + > + /* insert the barrier before the next fast-path in down_read */ > + synchronize_sched(); ACCESS_ONCE(brw->wstate) = 0; > + mutex_unlock(&brw->writer_mutex); > +} OK, o ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-01 15:43 ` Paul E. McKenney @ 2012-11-01 18:33 ` Oleg Nesterov 2012-11-02 16:18 ` Oleg Nesterov 0 siblings, 1 reply; 8+ messages in thread From: Oleg Nesterov @ 2012-11-01 18:33 UTC (permalink / raw) To: Paul E. McKenney Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel Paul, thanks. Sorry, I can't reply today, just one note... On 11/01, Paul E. McKenney wrote: > > OK, so it looks to me that this code relies on synchronize_sched() > forcing a memory barrier on each CPU executing in the kernel. No, the patch tries to avoid this assumption, but probably I missed something. > 1. A task running on CPU 0 currently write-holds the lock. > > 2. CPU 1 is running in the kernel, executing a longer-than-average > loop of normal instructions (no atomic instructions or memory > barriers). > > 3. CPU 0 invokes percpu_up_write(), calling up_write(), > synchronize_sched(), and finally mutex_unlock(). And my expectation was, this should be enough because ... > 4. CPU 1 executes percpu_down_read(), which calls update_fast_ctr(), since update_fast_ctr does preempt_disable/enable it should see all modifications done by CPU 0. IOW. Suppose that the writer (CPU 0) does percpu_done_write(); STORE; percpu_up_write(); This means STORE; synchronize_sched(); mutex_unlock(); Now. Do you mean that the next preempt_disable/enable can see the result of mutex_unlock() but not STORE? Oleg. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily 2012-11-01 18:33 ` Oleg Nesterov @ 2012-11-02 16:18 ` Oleg Nesterov 0 siblings, 0 replies; 8+ messages in thread From: Oleg Nesterov @ 2012-11-02 16:18 UTC (permalink / raw) To: Paul E. McKenney Cc: Mikulas Patocka, Peter Zijlstra, Linus Torvalds, Ingo Molnar, Srikar Dronamraju, Ananth N Mavinakayanahalli, Anton Arapov, linux-kernel On 11/01, Oleg Nesterov wrote: > > On 11/01, Paul E. McKenney wrote: > > > > OK, so it looks to me that this code relies on synchronize_sched() > > forcing a memory barrier on each CPU executing in the kernel. > > No, the patch tries to avoid this assumption, but probably I missed > something. > > > 1. A task running on CPU 0 currently write-holds the lock. > > > > 2. CPU 1 is running in the kernel, executing a longer-than-average > > loop of normal instructions (no atomic instructions or memory > > barriers). > > > > 3. CPU 0 invokes percpu_up_write(), calling up_write(), > > synchronize_sched(), and finally mutex_unlock(). > > And my expectation was, this should be enough because ... > > > 4. CPU 1 executes percpu_down_read(), which calls update_fast_ctr(), > > since update_fast_ctr does preempt_disable/enable it should see all > modifications done by CPU 0. > > IOW. Suppose that the writer (CPU 0) does > > percpu_done_write(); > STORE; > percpu_up_write(); > > This means > > STORE; > synchronize_sched(); > mutex_unlock(); > > Now. Do you mean that the next preempt_disable/enable can see the > result of mutex_unlock() but not STORE? So far I think this is not possible, so the code doesn't need the additional wstate/barriers. > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val) > > +{ > > + bool success = false; > > int state; > > > + > > + preempt_disable(); > > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { > > state = ACCESS_ONCE(brw->wstate); > if (likely(!state)) { > > > + __this_cpu_add(*brw->fast_read_ctr, val); > > + success = true; > > } else if (state & WSTATE_NEED_MB) { > __this_cpu_add(*brw->fast_read_ctr, val); > smb_mb(); /* Order increment against critical section. */ > success = true; > } ... > > +void percpu_up_write(struct percpu_rw_semaphore *brw) > > +{ > > + /* allow the new readers, but only the slow-path */ > > + up_write(&brw->rw_sem); > > ACCESS_ONCE(brw->wstate) = WSTATE_NEED_MB; > > > + > > + /* insert the barrier before the next fast-path in down_read */ > > + synchronize_sched(); But update_fast_ctr() should see mutex_is_locked(), obiously down_write() must ensure this. So update_fast_ctr() can execute the WSTATE_NEED_MB code only if it races with > ACCESS_ONCE(brw->wstate) = 0; > > > + mutex_unlock(&brw->writer_mutex); these 2 stores and sees them in reverse order. I guess that mutex_is_locked() in update_fast_ctr() looks a bit confusing. It means no-fast-path for the reader, we could use ->state instead. And even ->writer_mutex should go away if we want to optimize the write-contended case, but I think this needs another patch on top of this initial implementation. Oleg. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-11-04 15:51 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-11-04 8:47 [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily George Spelvin 2012-11-04 15:52 ` Oleg Nesterov -- strict thread matches above, loose matches on Subject: below -- 2012-10-16 19:56 [PATCH 1/2] brw_mutex: big read-write mutex Linus Torvalds 2012-10-17 16:59 ` Oleg Nesterov 2012-10-17 22:44 ` Paul E. McKenney 2012-10-18 16:24 ` Oleg Nesterov 2012-10-18 16:38 ` Paul E. McKenney 2012-10-18 17:57 ` Oleg Nesterov 2012-10-19 19:28 ` Paul E. McKenney 2012-10-22 23:36 ` [PATCH 0/2] fix and improvements for percpu-rw-semaphores (was: brw_mutex: big read-write mutex) Mikulas Patocka 2012-10-30 18:48 ` Oleg Nesterov 2012-10-31 19:41 ` [PATCH 0/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Oleg Nesterov 2012-10-31 19:41 ` [PATCH 1/1] " Oleg Nesterov 2012-11-01 15:10 ` Linus Torvalds 2012-11-01 15:34 ` Oleg Nesterov 2012-11-01 15:43 ` Paul E. McKenney 2012-11-01 18:33 ` Oleg Nesterov 2012-11-02 16:18 ` Oleg Nesterov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.