Re: [PATCH] locking/qrwlock: Allow multiple spinning readers

From: Waiman Long <waiman.long@hpe.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>, <linux-kernel@vger.kernel.org>,
	Scott J Norton <scott.norton@hpe.com>,
	Douglas Hatch <doug.hatch@hpe.com>,
	Will Deacon <will.deacon@arm.com>
Subject: Re: [PATCH] locking/qrwlock: Allow multiple spinning readers
Date: Thu, 31 Mar 2016 18:12:38 -0400	[thread overview]
Message-ID: <56FDA0D6.4090904@hpe.com> (raw)
In-Reply-To: <20160329202050.GN3408@twins.programming.kicks-ass.net>

On 03/29/2016 04:20 PM, Peter Zijlstra wrote:
> On Sat, Mar 19, 2016 at 11:21:19PM -0400, Waiman Long wrote:
>> In qrwlock, the reader that is spining on the lock will need to notify
>> the next reader in the queue when the lock is free. That introduces a
>> reader-to-reader latency that is not present in the original rwlock.
> How did you find this 'problem'?

I am constantly on the lookout for twists that can make the code run 
faster. That change turn out to be good for reader performance and so I 
send it out to solicit feedback.

>> That is the price for reducing lock cacheline contention. It also
>> reduces the performance benefit of qrwlock on reader heavy workloads.
>>
>> However, if we allow a limited number of readers to spin on the
>> lock simultaneously, we can eliminates some of the reader-to-reader
>> latencies at the expense of a bit more cacheline contention and
>> probably more power consumption.
> So the embedded people might not like that much.

It could be. It is always a compromise.

>> This patch changes the reader slowpath to allow multiple readers to
>> spin on the lock. The maximum number of concurrent readers allowed
>> is currently set to 4 to limit the amount of additional cacheline
>> contention while improving reader performance on most workloads. If
>> a writer comes to the queue head, however, it will stop additional
>> readers from coming out.
>>
>> Using a multi-threaded locking microbenchmark on a 4-socket 40-core
>> Haswell-EX system, the locking throughput of 4.5-rc6 kernel with or
>> without the patch were as follows:
> Do you have an actual real world benchmark where this makes a
> difference?

Not yet. Will look out for some real world workload.

>>   /**
>>    * queued_read_lock_slowpath - acquire read lock of a queue rwlock
>>    * @lock: Pointer to queue rwlock structure
>>    * @cnts: Current qrwlock lock value
>>    */
>>   void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>>   {
>> +	bool locked = true;
>> +
>>   	/*
>>   	 * Readers come here when they cannot get the lock without waiting
>>   	 */
>> @@ -78,7 +71,10 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>>   		 * semantics) until the lock is available without waiting in
>>   		 * the queue.
>>   		 */
>> +		while ((cnts&  _QW_WMASK) == _QW_LOCKED) {
>> +			cpu_relax_lowlatency();
>> +			cnts = atomic_read_acquire(&lock->cnts);
>> +		}
>>   		return;
>>   	}
>>   	atomic_sub(_QR_BIAS,&lock->cnts);
>> @@ -92,14 +88,31 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>>   	 * The ACQUIRE semantics of the following spinning code ensure
>>   	 * that accesses can't leak upwards out of our subsequent critical
>>   	 * section in the case that the lock is currently held for write.
>> +	 *
>> +	 * The reader increments the reader count&  wait until the writer
>> +	 * releases the lock.
>>   	 */
>>   	cnts = atomic_add_return_acquire(_QR_BIAS,&lock->cnts) - _QR_BIAS;
>> +	while ((cnts&  _QW_WMASK) == _QW_LOCKED) {
>> +		if (locked&&  ((cnts>>  _QR_SHIFT)<  MAX_SPINNING_READERS)) {
>> +			/*
>> +			 * Unlock the wait queue so that more readers can
>> +			 * come forward and waiting for the writer to exit
>> +			 * as long as no more than MAX_SPINNING_READERS
>> +			 * readers are present.
>> +			 */
>> +			arch_spin_unlock(&lock->wait_lock);
>> +			locked = false;
> Only 1 more can come forward with this logic. How can you ever get to 4?

Yes, each reader in the unlock path will release one in the queue. If 
the next one is also a reader, it will release one more and so on until 
the reader count reach 4 where the process will stop.

>
> Also, what says the next in queue is a reader?

I did say in the changelog that the queue head could be a writer. In 
that case, the process will stop and the writer will wait until all the 
readers are gone.

Cheers,
Longman