From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758106AbcC2UU5 (ORCPT ); Tue, 29 Mar 2016 16:20:57 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:57167 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753898AbcC2UUz (ORCPT ); Tue, 29 Mar 2016 16:20:55 -0400 Date: Tue, 29 Mar 2016 22:20:50 +0200 From: Peter Zijlstra To: Waiman Long Cc: Ingo Molnar , linux-kernel@vger.kernel.org, Scott J Norton , Douglas Hatch , Will Deacon Subject: Re: [PATCH] locking/qrwlock: Allow multiple spinning readers Message-ID: <20160329202050.GN3408@twins.programming.kicks-ass.net> References: <1458444079-59601-1-git-send-email-Waiman.Long@hpe.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1458444079-59601-1-git-send-email-Waiman.Long@hpe.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Mar 19, 2016 at 11:21:19PM -0400, Waiman Long wrote: > In qrwlock, the reader that is spining on the lock will need to notify > the next reader in the queue when the lock is free. That introduces a > reader-to-reader latency that is not present in the original rwlock. How did you find this 'problem'? > That is the price for reducing lock cacheline contention. It also > reduces the performance benefit of qrwlock on reader heavy workloads. > > However, if we allow a limited number of readers to spin on the > lock simultaneously, we can eliminates some of the reader-to-reader > latencies at the expense of a bit more cacheline contention and > probably more power consumption. So the embedded people might not like that much. > This patch changes the reader slowpath to allow multiple readers to > spin on the lock. The maximum number of concurrent readers allowed > is currently set to 4 to limit the amount of additional cacheline > contention while improving reader performance on most workloads. If > a writer comes to the queue head, however, it will stop additional > readers from coming out. > > Using a multi-threaded locking microbenchmark on a 4-socket 40-core > Haswell-EX system, the locking throughput of 4.5-rc6 kernel with or > without the patch were as follows: Do you have an actual real world benchmark where this makes a difference? > /** > * queued_read_lock_slowpath - acquire read lock of a queue rwlock > * @lock: Pointer to queue rwlock structure > * @cnts: Current qrwlock lock value > */ > void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts) > { > + bool locked = true; > + > /* > * Readers come here when they cannot get the lock without waiting > */ > @@ -78,7 +71,10 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts) > * semantics) until the lock is available without waiting in > * the queue. > */ > + while ((cnts & _QW_WMASK) == _QW_LOCKED) { > + cpu_relax_lowlatency(); > + cnts = atomic_read_acquire(&lock->cnts); > + } > return; > } > atomic_sub(_QR_BIAS, &lock->cnts); > @@ -92,14 +88,31 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts) > * The ACQUIRE semantics of the following spinning code ensure > * that accesses can't leak upwards out of our subsequent critical > * section in the case that the lock is currently held for write. > + * > + * The reader increments the reader count & wait until the writer > + * releases the lock. > */ > cnts = atomic_add_return_acquire(_QR_BIAS, &lock->cnts) - _QR_BIAS; > + while ((cnts & _QW_WMASK) == _QW_LOCKED) { > + if (locked && ((cnts >> _QR_SHIFT) < MAX_SPINNING_READERS)) { > + /* > + * Unlock the wait queue so that more readers can > + * come forward and waiting for the writer to exit > + * as long as no more than MAX_SPINNING_READERS > + * readers are present. > + */ > + arch_spin_unlock(&lock->wait_lock); > + locked = false; Only 1 more can come forward with this logic. How can you ever get to 4? Also, what says the next in queue is a reader? > + } > + cpu_relax_lowlatency(); > + cnts = atomic_read_acquire(&lock->cnts); > + } > > /* > * Signal the next one in queue to become queue head > */ > + if (locked) > + arch_spin_unlock(&lock->wait_lock); > } > EXPORT_SYMBOL(queued_read_lock_slowpath); > > -- > 1.7.1 >