Re: [PATCH] locking/rwsem: Optimize down_read_trylock() under highly contended case

From: Waiman Long <longman@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Muchun Song <songmuchun@bytedance.com>
Cc: mingo@redhat.com, will@kernel.org, boqun.feng@gmail.com,
	linux-kernel@vger.kernel.org, duanxiongchun@bytedance.com,
	zhengqi.arch@bytedance.com
Subject: Re: [PATCH] locking/rwsem: Optimize down_read_trylock() under highly contended case
Date: Thu, 18 Nov 2021 13:12:03 -0500	[thread overview]
Message-ID: <54f5bdcf-8b6e-0e53-8f28-5d5c3eb5f7ad@redhat.com> (raw)
In-Reply-To: <YZZNv3JflBYwRjdd@hirez.programming.kicks-ass.net>

On 11/18/21 07:57, Peter Zijlstra wrote:
> On Thu, Nov 18, 2021 at 05:44:55PM +0800, Muchun Song wrote:
>
>> By using the above benchmark, the real executing time on a x86-64 system
>> before and after the patch were:
> What kind of x86_64 ?
>
>>                    Before Patch  After Patch
>>     # of Threads      real          real     reduced by
>>     ------------     ------        ------    ----------
>>           1          65,373        65,206       ~0.0%
>>           4          15,467        15,378       ~0.5%
>>          40           6,214         5,528      ~11.0%
>>
>> For the uncontended case, the new down_read_trylock() is the same as
>> before. For the contended cases, the new down_read_trylock() is faster
>> than before. The more contended, the more fast.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> ---
>>   kernel/locking/rwsem.c | 11 ++++-------
>>   1 file changed, 4 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index c51387a43265..ef2b2a3f508c 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -1249,17 +1249,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
>>   
>>   	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
>>   
>> -	/*
>> -	 * Optimize for the case when the rwsem is not locked at all.
>> -	 */
>> -	tmp = RWSEM_UNLOCKED_VALUE;
>> -	do {
>> +	tmp = atomic_long_read(&sem->count);
>> +	while (!(tmp & RWSEM_READ_FAILED_MASK)) {
>>   		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
>> -					tmp + RWSEM_READER_BIAS)) {
>> +						    tmp + RWSEM_READER_BIAS)) {
>>   			rwsem_set_reader_owned(sem);
>>   			return 1;
>>   		}
>> -	} while (!(tmp & RWSEM_READ_FAILED_MASK));
>> +	}
>>   	return 0;
>>   }
> This is weird... so the only difference is that leading load, but given
> contention you'd expect that load to stall, also, given it's a
> non-exclusive load, to get stolen by a competing CPU. Whereas the old
> code would start with a cmpxchg, which obviously will also stall, but
> does an exclusive load.
>
> And the thinking is that the exclusive load and the presence of the
> cmpxchg loop would keep the line on that CPU for a little while and
> progress is made.
>
> Clearly this isn't working as expected. Also I suppose it would need
> wider testing...

For a contended case, doing a shared read first doing an exclusive 
cmpxchg can certainly help to reduce cacheline trashing. I have no 
objection to making this change.

I believe most of the other trylock functions do a read first before 
doing an atomic operation. In essence, we assume the use of trylock 
means the callers are expecting an contended lock whereas callers of 
regular *lock() function are expecting an uncontended lock.

Acked-by: Waiman Long <longman@redhat.com>

-Longman