From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1762033Ab2KAPoT (ORCPT ); Thu, 1 Nov 2012 11:44:19 -0400 Received: from e38.co.us.ibm.com ([32.97.110.159]:37954 "EHLO e38.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1762007Ab2KAPoR (ORCPT ); Thu, 1 Nov 2012 11:44:17 -0400 Date: Thu, 1 Nov 2012 08:43:40 -0700 From: "Paul E. McKenney" To: Oleg Nesterov Cc: Mikulas Patocka , Peter Zijlstra , Linus Torvalds , Ingo Molnar , Srikar Dronamraju , Ananth N Mavinakayanahalli , Anton Arapov , linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily Message-ID: <20121101154340.GH3027@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20121017165902.GB9872@redhat.com> <20121017224430.GC2518@linux.vnet.ibm.com> <20121018162409.GA28504@redhat.com> <20121018163833.GK2518@linux.vnet.ibm.com> <20121018175747.GA30691@redhat.com> <20121019192838.GM2518@linux.vnet.ibm.com> <20121030184800.GA16129@redhat.com> <20121031194135.GA504@redhat.com> <20121031194158.GB504@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121031194158.GB504@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Content-Scanned: Fidelis XPS MAILER x-cbid: 12110115-5518-0000-0000-000008EC8292 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 31, 2012 at 08:41:58PM +0100, Oleg Nesterov wrote: > Currently the writer does msleep() plus synchronize_sched() 3 times > to acquire/release the semaphore, and during this time the readers > are blocked completely. Even if the "write" section was not actually > started or if it was already finished. > > With this patch down_read/up_read does synchronize_sched() twice and > down_read/up_read are still possible during this time, just they use > the slow path. > > percpu_down_write() first forces the readers to use rw_semaphore and > increment the "slow" counter to take the lock for reading, then it > takes that rw_semaphore for writing and blocks the readers. > > Also. With this patch the code relies on the documented behaviour of > synchronize_sched(), it doesn't try to pair synchronize_sched() with > barrier. OK, so it looks to me that this code relies on synchronize_sched() forcing a memory barrier on each CPU executing in the kernel. I might well be confused, so here is the sequence of events that leads me to believe this: 1. A task running on CPU 0 currently write-holds the lock. 2. CPU 1 is running in the kernel, executing a longer-than-average loop of normal instructions (no atomic instructions or memory barriers). 3. CPU 0 invokes percpu_up_write(), calling up_write(), synchronize_sched(), and finally mutex_unlock(). 4. CPU 1 executes percpu_down_read(), which calls update_fast_ctr(), which finds that ->writer_mutex is not held. CPU 1 therefore increments >fast_read_ctr and returns success. Of course, as Mikulas pointed out, the actual implementation will have forced CPU 1 to execute a memory barrier in the course of the synchronize_sched() implementation. However, if synchronize_sched() had been modified to act as synchronize_srcu() currently does, there would be no memory barrier, and thus no guarantee that CPU 1's subsequent read-side critical section would seen the effect of CPU 0's previous write-side critical section. Fortunately, this is easy to fix, with zero added overhead on the read-side fastpath, as shown by the notes interspersed below. Thoughts? Thanx, Paul > Signed-off-by: Oleg Nesterov > --- > include/linux/percpu-rwsem.h | 83 +++++--------------------------- > lib/Makefile | 2 +- > lib/percpu-rwsem.c | 106 ++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 120 insertions(+), 71 deletions(-) > create mode 100644 lib/percpu-rwsem.c > > diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h > index 250a4ac..7f738ca 100644 > --- a/include/linux/percpu-rwsem.h > +++ b/include/linux/percpu-rwsem.h > @@ -2,82 +2,25 @@ > #define _LINUX_PERCPU_RWSEM_H > > #include > +#include > #include > -#include > -#include > +#include > > struct percpu_rw_semaphore { > - unsigned __percpu *counters; > - bool locked; > - struct mutex mtx; > + int __percpu *fast_read_ctr; > + struct mutex writer_mutex; > + struct rw_semaphore rw_sem; > + atomic_t slow_read_ctr; > + wait_queue_head_t write_waitq; int wstate; > }; > > -#define light_mb() barrier() > -#define heavy_mb() synchronize_sched() > +extern void percpu_down_read(struct percpu_rw_semaphore *); > +extern void percpu_up_read(struct percpu_rw_semaphore *); > > -static inline void percpu_down_read(struct percpu_rw_semaphore *p) > -{ > - rcu_read_lock_sched(); > - if (unlikely(p->locked)) { > - rcu_read_unlock_sched(); > - mutex_lock(&p->mtx); > - this_cpu_inc(*p->counters); > - mutex_unlock(&p->mtx); > - return; > - } > - this_cpu_inc(*p->counters); > - rcu_read_unlock_sched(); > - light_mb(); /* A, between read of p->locked and read of data, paired with D */ > -} > +extern void percpu_down_write(struct percpu_rw_semaphore *); > +extern void percpu_up_write(struct percpu_rw_semaphore *); > > -static inline void percpu_up_read(struct percpu_rw_semaphore *p) > -{ > - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ > - this_cpu_dec(*p->counters); > -} > - > -static inline unsigned __percpu_count(unsigned __percpu *counters) > -{ > - unsigned total = 0; > - int cpu; > - > - for_each_possible_cpu(cpu) > - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); > - > - return total; > -} > - > -static inline void percpu_down_write(struct percpu_rw_semaphore *p) > -{ > - mutex_lock(&p->mtx); > - p->locked = true; > - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ > - while (__percpu_count(p->counters)) > - msleep(1); > - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ > -} > - > -static inline void percpu_up_write(struct percpu_rw_semaphore *p) > -{ > - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ > - p->locked = false; > - mutex_unlock(&p->mtx); > -} > - > -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) > -{ > - p->counters = alloc_percpu(unsigned); > - if (unlikely(!p->counters)) > - return -ENOMEM; > - p->locked = false; > - mutex_init(&p->mtx); > - return 0; > -} > - > -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) > -{ > - free_percpu(p->counters); > - p->counters = NULL; /* catch use after free bugs */ > -} > +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); > +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); > > #endif > diff --git a/lib/Makefile b/lib/Makefile > index 821a162..4dad4a7 100644 > --- a/lib/Makefile > +++ b/lib/Makefile > @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ > idr.o int_sqrt.o extable.o \ > sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ > proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ > - is_single_threaded.o plist.o decompress.o > + is_single_threaded.o plist.o decompress.o percpu-rwsem.o > > lib-$(CONFIG_MMU) += ioremap.o > lib-$(CONFIG_SMP) += cpumask.o > diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c > new file mode 100644 > index 0000000..40a415d > --- /dev/null > +++ b/lib/percpu-rwsem.c > @@ -0,0 +1,106 @@ > +#include > +#include > +#include #define WSTATE_NEED_LOCK 1 #define WSTATE_NEED_MB 2 > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > +{ > + brw->fast_read_ctr = alloc_percpu(int); > + if (unlikely(!brw->fast_read_ctr)) > + return -ENOMEM; > + > + mutex_init(&brw->writer_mutex); > + init_rwsem(&brw->rw_sem); > + atomic_set(&brw->slow_read_ctr, 0); > + init_waitqueue_head(&brw->write_waitq); > + return 0; > +} > + > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > +{ > + free_percpu(brw->fast_read_ctr); > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > +} > + > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, int val) > +{ > + bool success = false; int state; > + > + preempt_disable(); > + if (likely(!mutex_is_locked(&brw->writer_mutex))) { state = ACCESS_ONCE(brw->wstate); if (likely(!state)) { > + __this_cpu_add(*brw->fast_read_ctr, val); > + success = true; } else if (state & WSTATE_NEED_MB) { __this_cpu_add(*brw->fast_read_ctr, val); smb_mb(); /* Order increment against critical section. */ success = true; } > + preempt_enable(); > + > + return success; > +} > + > +void percpu_down_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, +1))) > + return; > + > + down_read(&brw->rw_sem); > + atomic_inc(&brw->slow_read_ctr); > + up_read(&brw->rw_sem); > +} > + > +void percpu_up_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, -1))) > + return; > + > + /* false-positive is possible but harmless */ > + if (atomic_dec_and_test(&brw->slow_read_ctr)) > + wake_up_all(&brw->write_waitq); > +} > + > +static int clear_fast_read_ctr(struct percpu_rw_semaphore *brw) > +{ > + int cpu, sum = 0; > + > + for_each_possible_cpu(cpu) { > + sum += per_cpu(*brw->fast_read_ctr, cpu); > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > + } > + > + return sum; > +} > + > +void percpu_down_write(struct percpu_rw_semaphore *brw) > +{ > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > + mutex_lock(&brw->writer_mutex); ACCESS_ONCE(brw->wstate) = WSTATE_NEED_LOCK; > + /* > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > + * so that update_fast_ctr() can't succeed. > + * > + * 2. Ensures we see the result of every previous this_cpu_add() in > + * update_fast_ctr(). > + * > + * 3. Ensures that if any reader has exited its critical section via > + * fast-path, it executes a full memory barrier before we return. > + */ > + synchronize_sched(); > + > + /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ > + atomic_add(clear_fast_read_ctr(brw), &brw->slow_read_ctr); > + > + /* block the new readers completely */ > + down_write(&brw->rw_sem); > + > + /* wait for all readers to complete their percpu_up_read() */ > + wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); > +} > + > +void percpu_up_write(struct percpu_rw_semaphore *brw) > +{ > + /* allow the new readers, but only the slow-path */ > + up_write(&brw->rw_sem); ACCESS_ONCE(brw->wstate) = WSTATE_NEED_MB; > + > + /* insert the barrier before the next fast-path in down_read */ > + synchronize_sched(); ACCESS_ONCE(brw->wstate) = 0; > + mutex_unlock(&brw->writer_mutex); > +} OK, o