linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] locking/rwsem: Optimize down_read_trylock() under highly contended case
@ 2021-11-18  9:44 Muchun Song
  2021-11-18 12:57 ` Peter Zijlstra
  2021-11-23  8:53 ` [tip: locking/urgent] " tip-bot2 for Muchun Song
  0 siblings, 2 replies; 5+ messages in thread
From: Muchun Song @ 2021-11-18  9:44 UTC (permalink / raw)
  To: peterz, mingo, will, longman, boqun.feng
  Cc: linux-kernel, duanxiongchun, zhengqi.arch, Muchun Song

We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.

  ```c++
  #include <sys/mman.h>
  #include <sys/time.h>
  #include <sys/resource.h>
  #include <sched.h>

  #include <cstdio>
  #include <cassert>
  #include <thread>
  #include <vector>
  #include <chrono>

  volatile int mutex;

  void trigger(int cpu, char* ptr, std::size_t sz)
  {
  	cpu_set_t set;
  	CPU_ZERO(&set);
  	CPU_SET(cpu, &set);
  	assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);

  	while (mutex);

  	for (std::size_t i = 0; i < sz; i += 4096) {
  		*ptr = '\0';
  		ptr += 4096;
  	}
  }

  int main(int argc, char* argv[])
  {
  	std::size_t sz = 100;

  	if (argc > 1)
  		sz = atoi(argv[1]);

  	auto nproc = std::thread::hardware_concurrency();
  	std::vector<std::thread> thr;
  	sz <<= 30;
  	auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
			 MAP_PRIVATE, -1, 0);
  	assert(ptr != MAP_FAILED);
  	char* cptr = static_cast<char*>(ptr);
  	auto run = sz / nproc;
  	run = (run >> 12) << 12;

  	mutex = 1;

  	for (auto i = 0U; i < nproc; ++i) {
  		thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
  		cptr += run;
  	}

  	rusage usage_start;
  	getrusage(RUSAGE_SELF, &usage_start);
  	auto start = std::chrono::system_clock::now();

  	mutex = 0;

  	for (auto& t : thr)
  		t.join();

  	rusage usage_end;
  	getrusage(RUSAGE_SELF, &usage_end);
  	auto end = std::chrono::system_clock::now();
  	timeval utime;
  	timeval stime;
  	timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
  	timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
  	printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
  	printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
  	printf("real: %lu\n",
  	       std::chrono::duration_cast<std::chrono::milliseconds>(end -
  	       start).count());

  	return 0;
  }
  ```

The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.

  25.55%  [kernel]                  [k] down_read_trylock
  14.78%  [kernel]                  [k] handle_mm_fault
  13.45%  [kernel]                  [k] up_read
   8.61%  [kernel]                  [k] clear_page_erms
   3.89%  [kernel]                  [k] __do_page_fault

The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.

  91.89 │      lock   cmpxchg %rdx,(%rdi)

Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3aed ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.

By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:

                  Before Patch  After Patch
   # of Threads      real          real     reduced by
   ------------     ------        ------    ----------
         1          65,373        65,206       ~0.0%
         4          15,467        15,378       ~0.5%
        40           6,214         5,528      ~11.0%

For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 kernel/locking/rwsem.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c51387a43265..ef2b2a3f508c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1249,17 +1249,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
 
 	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
 
-	/*
-	 * Optimize for the case when the rwsem is not locked at all.
-	 */
-	tmp = RWSEM_UNLOCKED_VALUE;
-	do {
+	tmp = atomic_long_read(&sem->count);
+	while (!(tmp & RWSEM_READ_FAILED_MASK)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
-					tmp + RWSEM_READER_BIAS)) {
+						    tmp + RWSEM_READER_BIAS)) {
 			rwsem_set_reader_owned(sem);
 			return 1;
 		}
-	} while (!(tmp & RWSEM_READ_FAILED_MASK));
+	}
 	return 0;
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] locking/rwsem: Optimize down_read_trylock() under highly contended case
  2021-11-18  9:44 [PATCH] locking/rwsem: Optimize down_read_trylock() under highly contended case Muchun Song
@ 2021-11-18 12:57 ` Peter Zijlstra
  2021-11-18 18:12   ` Waiman Long
  2021-11-19  2:52   ` Muchun Song
  2021-11-23  8:53 ` [tip: locking/urgent] " tip-bot2 for Muchun Song
  1 sibling, 2 replies; 5+ messages in thread
From: Peter Zijlstra @ 2021-11-18 12:57 UTC (permalink / raw)
  To: Muchun Song
  Cc: mingo, will, longman, boqun.feng, linux-kernel, duanxiongchun,
	zhengqi.arch

On Thu, Nov 18, 2021 at 05:44:55PM +0800, Muchun Song wrote:

> By using the above benchmark, the real executing time on a x86-64 system
> before and after the patch were:

What kind of x86_64 ?

> 
>                   Before Patch  After Patch
>    # of Threads      real          real     reduced by
>    ------------     ------        ------    ----------
>          1          65,373        65,206       ~0.0%
>          4          15,467        15,378       ~0.5%
>         40           6,214         5,528      ~11.0%
> 
> For the uncontended case, the new down_read_trylock() is the same as
> before. For the contended cases, the new down_read_trylock() is faster
> than before. The more contended, the more fast.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  kernel/locking/rwsem.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index c51387a43265..ef2b2a3f508c 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1249,17 +1249,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
>  
>  	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
>  
> -	/*
> -	 * Optimize for the case when the rwsem is not locked at all.
> -	 */
> -	tmp = RWSEM_UNLOCKED_VALUE;
> -	do {
> +	tmp = atomic_long_read(&sem->count);
> +	while (!(tmp & RWSEM_READ_FAILED_MASK)) {
>  		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
> -					tmp + RWSEM_READER_BIAS)) {
> +						    tmp + RWSEM_READER_BIAS)) {
>  			rwsem_set_reader_owned(sem);
>  			return 1;
>  		}
> -	} while (!(tmp & RWSEM_READ_FAILED_MASK));
> +	}
>  	return 0;
>  }

This is weird... so the only difference is that leading load, but given
contention you'd expect that load to stall, also, given it's a
non-exclusive load, to get stolen by a competing CPU. Whereas the old
code would start with a cmpxchg, which obviously will also stall, but
does an exclusive load.

And the thinking is that the exclusive load and the presence of the
cmpxchg loop would keep the line on that CPU for a little while and
progress is made.

Clearly this isn't working as expected. Also I suppose it would need
wider testing...


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] locking/rwsem: Optimize down_read_trylock() under highly contended case
  2021-11-18 12:57 ` Peter Zijlstra
@ 2021-11-18 18:12   ` Waiman Long
  2021-11-19  2:52   ` Muchun Song
  1 sibling, 0 replies; 5+ messages in thread
From: Waiman Long @ 2021-11-18 18:12 UTC (permalink / raw)
  To: Peter Zijlstra, Muchun Song
  Cc: mingo, will, boqun.feng, linux-kernel, duanxiongchun, zhengqi.arch

On 11/18/21 07:57, Peter Zijlstra wrote:
> On Thu, Nov 18, 2021 at 05:44:55PM +0800, Muchun Song wrote:
>
>> By using the above benchmark, the real executing time on a x86-64 system
>> before and after the patch were:
> What kind of x86_64 ?
>
>>                    Before Patch  After Patch
>>     # of Threads      real          real     reduced by
>>     ------------     ------        ------    ----------
>>           1          65,373        65,206       ~0.0%
>>           4          15,467        15,378       ~0.5%
>>          40           6,214         5,528      ~11.0%
>>
>> For the uncontended case, the new down_read_trylock() is the same as
>> before. For the contended cases, the new down_read_trylock() is faster
>> than before. The more contended, the more fast.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> ---
>>   kernel/locking/rwsem.c | 11 ++++-------
>>   1 file changed, 4 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index c51387a43265..ef2b2a3f508c 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -1249,17 +1249,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
>>   
>>   	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
>>   
>> -	/*
>> -	 * Optimize for the case when the rwsem is not locked at all.
>> -	 */
>> -	tmp = RWSEM_UNLOCKED_VALUE;
>> -	do {
>> +	tmp = atomic_long_read(&sem->count);
>> +	while (!(tmp & RWSEM_READ_FAILED_MASK)) {
>>   		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
>> -					tmp + RWSEM_READER_BIAS)) {
>> +						    tmp + RWSEM_READER_BIAS)) {
>>   			rwsem_set_reader_owned(sem);
>>   			return 1;
>>   		}
>> -	} while (!(tmp & RWSEM_READ_FAILED_MASK));
>> +	}
>>   	return 0;
>>   }
> This is weird... so the only difference is that leading load, but given
> contention you'd expect that load to stall, also, given it's a
> non-exclusive load, to get stolen by a competing CPU. Whereas the old
> code would start with a cmpxchg, which obviously will also stall, but
> does an exclusive load.
>
> And the thinking is that the exclusive load and the presence of the
> cmpxchg loop would keep the line on that CPU for a little while and
> progress is made.
>
> Clearly this isn't working as expected. Also I suppose it would need
> wider testing...

For a contended case, doing a shared read first doing an exclusive 
cmpxchg can certainly help to reduce cacheline trashing. I have no 
objection to making this change.

I believe most of the other trylock functions do a read first before 
doing an atomic operation. In essence, we assume the use of trylock 
means the callers are expecting an contended lock whereas callers of 
regular *lock() function are expecting an uncontended lock.

Acked-by: Waiman Long <longman@redhat.com>

-Longman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] locking/rwsem: Optimize down_read_trylock() under highly contended case
  2021-11-18 12:57 ` Peter Zijlstra
  2021-11-18 18:12   ` Waiman Long
@ 2021-11-19  2:52   ` Muchun Song
  1 sibling, 0 replies; 5+ messages in thread
From: Muchun Song @ 2021-11-19  2:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Waiman Long, boqun.feng, LKML,
	Xiongchun duan, Qi Zheng

On Thu, Nov 18, 2021 at 8:57 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 18, 2021 at 05:44:55PM +0800, Muchun Song wrote:
>
> > By using the above benchmark, the real executing time on a x86-64 system
> > before and after the patch were:
>
> What kind of x86_64 ?

Intel(R) Xeon(R) CPU E5-2650

>
> >
> >                   Before Patch  After Patch
> >    # of Threads      real          real     reduced by
> >    ------------     ------        ------    ----------
> >          1          65,373        65,206       ~0.0%
> >          4          15,467        15,378       ~0.5%
> >         40           6,214         5,528      ~11.0%
> >
> > For the uncontended case, the new down_read_trylock() is the same as
> > before. For the contended cases, the new down_read_trylock() is faster
> > than before. The more contended, the more fast.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  kernel/locking/rwsem.c | 11 ++++-------
> >  1 file changed, 4 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index c51387a43265..ef2b2a3f508c 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1249,17 +1249,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
> >
> >       DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
> >
> > -     /*
> > -      * Optimize for the case when the rwsem is not locked at all.
> > -      */
> > -     tmp = RWSEM_UNLOCKED_VALUE;
> > -     do {
> > +     tmp = atomic_long_read(&sem->count);
> > +     while (!(tmp & RWSEM_READ_FAILED_MASK)) {
> >               if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
> > -                                     tmp + RWSEM_READER_BIAS)) {
> > +                                                 tmp + RWSEM_READER_BIAS)) {
> >                       rwsem_set_reader_owned(sem);
> >                       return 1;
> >               }
> > -     } while (!(tmp & RWSEM_READ_FAILED_MASK));
> > +     }
> >       return 0;
> >  }
>
> This is weird... so the only difference is that leading load, but given
> contention you'd expect that load to stall, also, given it's a
> non-exclusive load, to get stolen by a competing CPU. Whereas the old
> code would start with a cmpxchg, which obviously will also stall, but
> does an exclusive load.

The old code always assumes that rwsem is unlocked at all, so the
initialization of local variable @tmp is 0 (RWSEM_UNLOCKED_VALUE).
Ideally, the first cmpxchg will be successful. However, if rwsem is
contended (e.g. mmap read lock is almost hold by other threads
), sem->count is not always RWSEM_UNLOCKED_VALUE, so the
first cmpxchg has to fail and the second cmpxchg usually succeeds.
This patch will load sem->count first to @tmp and then the cmpxchg
will usually succeed. So this patch reduces the number of executions
of cmpxchg if running the above benchmark.

I have added a patch as follows to verify this. I got the following
statistics by using bpftrace (bpftrace -e
'tracepoint:rwsem:rwsem_read_loop /comm == "a.out"/ {
@cnt[args->count] = count(); }').

Before patch:
@cnt[17]: 1
@cnt[11]: 6
@cnt[10]: 14
@cnt[9]: 62
@cnt[8]: 339
@cnt[1]: 1211
@cnt[7]: 1585
@cnt[6]: 7796
@cnt[5]: 34742
@cnt[4]: 158245
@cnt[3]: 921631
@cnt[2]: 25089070

After patch:
@cnt[9]: 1
@cnt[8]: 3
@cnt[7]: 28
@cnt[6]: 151
@cnt[0]: 197
@cnt[5]: 823
@cnt[4]: 5521
@cnt[3]: 39126
@cnt[2]: 328012
@cnt[1]: 25840840

As you can see the total number of executions of cmpxchg is
reduced by about ~1x.

Thanks.

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ef2b2a3f508c..cdeb1ad1921f 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1246,14 +1246,18 @@ static inline int __down_read_killable(struct
rw_semaphore *sem)
 static inline int __down_read_trylock(struct rw_semaphore *sem)
 {
        long tmp;
+       int count = 0;

        DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);

        tmp = atomic_long_read(&sem->count);
        while (!(tmp & RWSEM_READ_FAILED_MASK)) {
+               count++;
                if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
                                                    tmp + RWSEM_READER_BIAS)) {
                        rwsem_set_reader_owned(sem);
+                       if (&current->mm->mmap_lock == sem)
+                               trace_rwsem_read_loop(count);
                        return 1;
                }
        }

>
> And the thinking is that the exclusive load and the presence of the
> cmpxchg loop would keep the line on that CPU for a little while and
> progress is made.
>
> Clearly this isn't working as expected. Also I suppose it would need
> wider testing...
>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [tip: locking/urgent] locking/rwsem: Optimize down_read_trylock() under highly contended case
  2021-11-18  9:44 [PATCH] locking/rwsem: Optimize down_read_trylock() under highly contended case Muchun Song
  2021-11-18 12:57 ` Peter Zijlstra
@ 2021-11-23  8:53 ` tip-bot2 for Muchun Song
  1 sibling, 0 replies; 5+ messages in thread
From: tip-bot2 for Muchun Song @ 2021-11-23  8:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Muchun Song, Peter Zijlstra (Intel), Waiman Long, x86, linux-kernel

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID:     14c24048841151548a3f4d9e218510c844c1b737
Gitweb:        https://git.kernel.org/tip/14c24048841151548a3f4d9e218510c844c1b737
Author:        Muchun Song <songmuchun@bytedance.com>
AuthorDate:    Thu, 18 Nov 2021 17:44:55 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 23 Nov 2021 09:45:36 +01:00

locking/rwsem: Optimize down_read_trylock() under highly contended case

We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.

  ```c++
  #include <sys/mman.h>
  #include <sys/time.h>
  #include <sys/resource.h>
  #include <sched.h>

  #include <cstdio>
  #include <cassert>
  #include <thread>
  #include <vector>
  #include <chrono>

  volatile int mutex;

  void trigger(int cpu, char* ptr, std::size_t sz)
  {
  	cpu_set_t set;
  	CPU_ZERO(&set);
  	CPU_SET(cpu, &set);
  	assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);

  	while (mutex);

  	for (std::size_t i = 0; i < sz; i += 4096) {
  		*ptr = '\0';
  		ptr += 4096;
  	}
  }

  int main(int argc, char* argv[])
  {
  	std::size_t sz = 100;

  	if (argc > 1)
  		sz = atoi(argv[1]);

  	auto nproc = std::thread::hardware_concurrency();
  	std::vector<std::thread> thr;
  	sz <<= 30;
  	auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
			 MAP_PRIVATE, -1, 0);
  	assert(ptr != MAP_FAILED);
  	char* cptr = static_cast<char*>(ptr);
  	auto run = sz / nproc;
  	run = (run >> 12) << 12;

  	mutex = 1;

  	for (auto i = 0U; i < nproc; ++i) {
  		thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
  		cptr += run;
  	}

  	rusage usage_start;
  	getrusage(RUSAGE_SELF, &usage_start);
  	auto start = std::chrono::system_clock::now();

  	mutex = 0;

  	for (auto& t : thr)
  		t.join();

  	rusage usage_end;
  	getrusage(RUSAGE_SELF, &usage_end);
  	auto end = std::chrono::system_clock::now();
  	timeval utime;
  	timeval stime;
  	timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
  	timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
  	printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
  	printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
  	printf("real: %lu\n",
  	       std::chrono::duration_cast<std::chrono::milliseconds>(end -
  	       start).count());

  	return 0;
  }
  ```

The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.

  25.55%  [kernel]                  [k] down_read_trylock
  14.78%  [kernel]                  [k] handle_mm_fault
  13.45%  [kernel]                  [k] up_read
   8.61%  [kernel]                  [k] clear_page_erms
   3.89%  [kernel]                  [k] __do_page_fault

The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.

  91.89 │      lock   cmpxchg %rdx,(%rdi)

Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3aed ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.

By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:

                  Before Patch  After Patch
   # of Threads      real          real     reduced by
   ------------     ------        ------    ----------
         1          65,373        65,206       ~0.0%
         4          15,467        15,378       ~0.5%
        40           6,214         5,528      ~11.0%

For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
---
 kernel/locking/rwsem.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index e039cf1..04a74d0 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1248,17 +1248,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
 
 	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
 
-	/*
-	 * Optimize for the case when the rwsem is not locked at all.
-	 */
-	tmp = RWSEM_UNLOCKED_VALUE;
-	do {
+	tmp = atomic_long_read(&sem->count);
+	while (!(tmp & RWSEM_READ_FAILED_MASK)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
-					tmp + RWSEM_READER_BIAS)) {
+						    tmp + RWSEM_READER_BIAS)) {
 			rwsem_set_reader_owned(sem);
 			return 1;
 		}
-	} while (!(tmp & RWSEM_READ_FAILED_MASK));
+	}
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-11-23  8:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-18  9:44 [PATCH] locking/rwsem: Optimize down_read_trylock() under highly contended case Muchun Song
2021-11-18 12:57 ` Peter Zijlstra
2021-11-18 18:12   ` Waiman Long
2021-11-19  2:52   ` Muchun Song
2021-11-23  8:53 ` [tip: locking/urgent] " tip-bot2 for Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).