linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH-tip v2 0/3] locking/rwsem: Miscellaneous rwsem enhancements
@ 2023-02-16 21:09 Waiman Long
  2023-02-16 21:09 ` [PATCH v2 1/3] locking/rwsem: Minor code refactoring in rwsem_mark_wake() Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Waiman Long @ 2023-02-16 21:09 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, Hillf Danton, Waiman Long

 v2:
  - Break out some code refactoring into a separate patch.
  - Move the early handoff code from rwsem_wake() to rwsem_mark_wake() to
    cover all wakeup conditions as suggested by PeterZ.
  - Simplify the all reader wake up patch.

The second patch in this series is similar to patch 4 of the v7 patch
series [1]. So it should provide similar performance benefit.  Patch 3
is another minor enhancement about reader wakeup for some specific
use cases.

[1] https://lore.kernel.org/lkml/20230126003628.365092-1-longman@redhat.com/

Waiman Long (3):
  locking/rwsem: Minor code refactoring in rwsem_mark_wake()
  locking/rwsem: Enable early rwsem writer lock handoff
  locking/rwsem: Wake up all readers for wait queue waker

 kernel/locking/lock_events_list.h |   1 +
 kernel/locking/rwsem.c            | 133 ++++++++++++++++++++++--------
 2 files changed, 99 insertions(+), 35 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] locking/rwsem: Minor code refactoring in rwsem_mark_wake()
  2023-02-16 21:09 [PATCH-tip v2 0/3] locking/rwsem: Miscellaneous rwsem enhancements Waiman Long
@ 2023-02-16 21:09 ` Waiman Long
  2023-02-16 21:09 ` [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff Waiman Long
  2023-02-16 21:09 ` [PATCH v2 3/3] locking/rwsem: Wake up all readers for wait queue waker Waiman Long
  2 siblings, 0 replies; 8+ messages in thread
From: Waiman Long @ 2023-02-16 21:09 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, Hillf Danton, Waiman Long

Rename "oldcount" to "count" as it is not always old count value.
Also make some minor code refactoring to reduce indentation. There
is no functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 44 +++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index acb5a50309a1..e589f69793df 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -40,7 +40,7 @@
  *
  * When the rwsem is reader-owned and a spinning writer has timed out,
  * the nonspinnable bit will be set to disable optimistic spinning.
-
+ *
  * When a writer acquires a rwsem, it puts its task_struct pointer
  * into the owner field. It is cleared after an unlock.
  *
@@ -413,7 +413,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 			    struct wake_q_head *wake_q)
 {
 	struct rwsem_waiter *waiter, *tmp;
-	long oldcount, woken = 0, adjustment = 0;
+	long count, woken = 0, adjustment = 0;
 	struct list_head wlist;
 
 	lockdep_assert_held(&sem->wait_lock);
@@ -424,22 +424,23 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 	 */
 	waiter = rwsem_first_waiter(sem);
 
-	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
-		if (wake_type == RWSEM_WAKE_ANY) {
-			/*
-			 * Mark writer at the front of the queue for wakeup.
-			 * Until the task is actually later awoken later by
-			 * the caller, other writers are able to steal it.
-			 * Readers, on the other hand, will block as they
-			 * will notice the queued writer.
-			 */
-			wake_q_add(wake_q, waiter->task);
-			lockevent_inc(rwsem_wake_writer);
-		}
+	if (waiter->type != RWSEM_WAITING_FOR_WRITE)
+		goto wake_readers;
 
-		return;
+	if (wake_type == RWSEM_WAKE_ANY) {
+		/*
+		 * Mark writer at the front of the queue for wakeup.
+		 * Until the task is actually later awoken later by
+		 * the caller, other writers are able to steal it.
+		 * Readers, on the other hand, will block as they
+		 * will notice the queued writer.
+		 */
+		wake_q_add(wake_q, waiter->task);
+		lockevent_inc(rwsem_wake_writer);
 	}
+	return;
 
+wake_readers:
 	/*
 	 * No reader wakeup if there are too many of them already.
 	 */
@@ -455,15 +456,15 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		struct task_struct *owner;
 
 		adjustment = RWSEM_READER_BIAS;
-		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
-		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+		count = atomic_long_fetch_add(adjustment, &sem->count);
+		if (unlikely(count & RWSEM_WRITER_MASK)) {
 			/*
 			 * When we've been waiting "too" long (for writers
 			 * to give up the lock), request a HANDOFF to
 			 * force the issue.
 			 */
 			if (time_after(jiffies, waiter->timeout)) {
-				if (!(oldcount & RWSEM_FLAG_HANDOFF)) {
+				if (!(count & RWSEM_FLAG_HANDOFF)) {
 					adjustment -= RWSEM_FLAG_HANDOFF;
 					lockevent_inc(rwsem_rlock_handoff);
 				}
@@ -524,21 +525,21 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 	adjustment = woken * RWSEM_READER_BIAS - adjustment;
 	lockevent_cond_inc(rwsem_wake_reader, woken);
 
-	oldcount = atomic_long_read(&sem->count);
+	count = atomic_long_read(&sem->count);
 	if (list_empty(&sem->wait_list)) {
 		/*
 		 * Combined with list_move_tail() above, this implies
 		 * rwsem_del_waiter().
 		 */
 		adjustment -= RWSEM_FLAG_WAITERS;
-		if (oldcount & RWSEM_FLAG_HANDOFF)
+		if (count & RWSEM_FLAG_HANDOFF)
 			adjustment -= RWSEM_FLAG_HANDOFF;
 	} else if (woken) {
 		/*
 		 * When we've woken a reader, we no longer need to force
 		 * writers to give up the lock and we can clear HANDOFF.
 		 */
-		if (oldcount & RWSEM_FLAG_HANDOFF)
+		if (count & RWSEM_FLAG_HANDOFF)
 			adjustment -= RWSEM_FLAG_HANDOFF;
 	}
 
@@ -844,7 +845,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		 * Try to acquire the lock
 		 */
 		taken = rwsem_try_write_lock_unqueued(sem);
-
 		if (taken)
 			break;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff
  2023-02-16 21:09 [PATCH-tip v2 0/3] locking/rwsem: Miscellaneous rwsem enhancements Waiman Long
  2023-02-16 21:09 ` [PATCH v2 1/3] locking/rwsem: Minor code refactoring in rwsem_mark_wake() Waiman Long
@ 2023-02-16 21:09 ` Waiman Long
  2023-02-17  5:24   ` Boqun Feng
  2023-02-23 12:32   ` Peter Zijlstra
  2023-02-16 21:09 ` [PATCH v2 3/3] locking/rwsem: Wake up all readers for wait queue waker Waiman Long
  2 siblings, 2 replies; 8+ messages in thread
From: Waiman Long @ 2023-02-16 21:09 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, Hillf Danton, Waiman Long

The lock handoff provided in rwsem isn't a true handoff like that in
the mutex. Instead, it is more like a quiescent state where optimistic
spinning and lock stealing are disabled to make it easier for the first
waiter to acquire the lock.

For readers, setting the HANDOFF bit will disable writers from stealing
the lock. The actual handoff is done at rwsem_wake() time after taking
the wait_lock. There isn't much we need to improve here other than
setting the RWSEM_NONSPINNABLE bit in owner.

For writers, setting the HANDOFF bit does not guarantee that it can
acquire the rwsem successfully in a subsequent rwsem_try_write_lock()
after setting the bit there. A reader can come in and add a
RWSEM_READER_BIAS temporarily which can spoil the takeover of the rwsem
in rwsem_try_write_lock() leading to additional delay.

For mutex, lock handoff is done at unlock time as the owner value and
the handoff bit is in the same lock word and can be updated atomically.

That is the not case for rwsem which has a count value for locking and
a different owner value for storing lock owner. In addition, the handoff
processing differs depending on whether the first waiter is a writer or a
reader. We can only make that waiter type determination after acquiring
the wait lock. Together with the fact that the RWSEM_FLAG_HANDOFF bit
is stable while holding the wait_lock, the most convenient place to
do the early handoff is at rwsem_mark_wake() where wait_lock has to be
acquired anyway.

There isn't much additional cost in doing this check and early handoff
in rwsem_mark_wake() while increasing the chance that a lock handoff
will be successful when the HANDOFF setting writer wakes up without
even the need to take the wait_lock at all. Note that if early handoff
fails to happen in rwsem_mark_wake(), a late handoff can still happen
when the awoken writer calls rwsem_try_write_lock().

Kernel test robot noticed a 19.3% improvement of
will-it-scale.per_thread_ops in an earlier version of this commit [1].

[1] https://lore.kernel.org/lkml/202302122155.87699b56-oliver.sang@intel.com/

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |  1 +
 kernel/locking/rwsem.c            | 71 +++++++++++++++++++++++++++----
 2 files changed, 64 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 97fb6f3f840a..fd80f5828f24 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -67,3 +67,4 @@ LOCK_EVENT(rwsem_rlock_handoff)	/* # of read lock handoffs		*/
 LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
 LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
 LOCK_EVENT(rwsem_wlock_handoff)	/* # of write lock handoffs		*/
+LOCK_EVENT(rwsem_wlock_ehandoff) /* # of write lock early handoffs	*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index e589f69793df..fc3961ceabe8 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -412,8 +412,9 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 			    enum rwsem_wake_type wake_type,
 			    struct wake_q_head *wake_q)
 {
+	long count = atomic_long_read(&sem->count);
 	struct rwsem_waiter *waiter, *tmp;
-	long count, woken = 0, adjustment = 0;
+	long woken = 0, adjustment = 0;
 	struct list_head wlist;
 
 	lockdep_assert_held(&sem->wait_lock);
@@ -432,19 +433,39 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		 * Mark writer at the front of the queue for wakeup.
 		 * Until the task is actually later awoken later by
 		 * the caller, other writers are able to steal it.
+		 *
+		 * *Unless* HANDOFF is set, in which case only the
+		 * first waiter is allowed to take it.
+		 *
 		 * Readers, on the other hand, will block as they
 		 * will notice the queued writer.
 		 */
 		wake_q_add(wake_q, waiter->task);
 		lockevent_inc(rwsem_wake_writer);
+
+		if ((count & RWSEM_LOCK_MASK) || !(count & RWSEM_FLAG_HANDOFF))
+			return;
+
+		/*
+		 * If the rwsem is free and handoff flag is set with wait_lock
+		 * held, no other CPUs can take an active lock. We can do an
+		 * early handoff.
+		 */
+		adjustment = RWSEM_WRITER_LOCKED - RWSEM_FLAG_HANDOFF;
+		atomic_long_set(&sem->owner, (long)waiter->task);
+		waiter->task = NULL;
+		atomic_long_add(adjustment, &sem->count);
+		rwsem_del_waiter(sem, waiter);
+		lockevent_inc(rwsem_wlock_ehandoff);
 	}
 	return;
 
 wake_readers:
 	/*
-	 * No reader wakeup if there are too many of them already.
+	 * No reader wakeup if there are too many of them already or
+	 * something wrong happens.
 	 */
-	if (unlikely(atomic_long_read(&sem->count) < 0))
+	if (unlikely(count < 0))
 		return;
 
 	/*
@@ -468,7 +489,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 					adjustment -= RWSEM_FLAG_HANDOFF;
 					lockevent_inc(rwsem_rlock_handoff);
 				}
+				/*
+				 * With HANDOFF set for reader, we must
+				 * terminate all spinning.
+				 */
 				waiter->handoff_set = true;
+				rwsem_set_nonspinnable(sem);
 			}
 
 			atomic_long_add(-adjustment, &sem->count);
@@ -610,6 +636,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 
 	lockdep_assert_held(&sem->wait_lock);
 
+	if (!waiter->task) {
+		/* Write lock handed off */
+		smp_acquire__after_ctrl_dep();
+		return true;
+	}
+
 	count = atomic_long_read(&sem->count);
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
@@ -755,6 +787,10 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 
 	owner = rwsem_owner_flags(sem, &flags);
 	state = rwsem_owner_state(owner, flags);
+
+	if (owner == current)
+		return OWNER_NONSPINNABLE;	/* Handoff granted */
+
 	if (state != OWNER_WRITER)
 		return state;
 
@@ -1164,32 +1200,51 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		 * the lock, attempt to spin on owner to accelerate lock
 		 * transfer. If the previous owner is a on-cpu writer and it
 		 * has just released the lock, OWNER_NULL will be returned.
-		 * In this case, we attempt to acquire the lock again
-		 * without sleeping.
+		 * In this case, the waker may be in the process of early
+		 * lock handoff. Use the wait_lock to synchronize with that
+		 * before checking for handoff.
 		 */
 		if (waiter.handoff_set) {
 			enum owner_state owner_state;
 
 			owner_state = rwsem_spin_on_owner(sem);
-			if (owner_state == OWNER_NULL)
-				goto trylock_again;
+			if ((owner_state == OWNER_NULL) &&
+			    READ_ONCE(waiter.task)) {
+				raw_spin_lock_irq(&sem->wait_lock);
+				raw_spin_unlock_irq(&sem->wait_lock);
+			}
+			if (!READ_ONCE(waiter.task))
+				goto handed_off;
 		}
 
 		schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_writer);
+		if (!READ_ONCE(waiter.task))
+			goto handed_off;
+
 		set_current_state(state);
-trylock_again:
 		raw_spin_lock_irq(&sem->wait_lock);
 	}
 	__set_current_state(TASK_RUNNING);
 	raw_spin_unlock_irq(&sem->wait_lock);
+out:
 	lockevent_inc(rwsem_wlock);
 	trace_contention_end(sem, 0);
 	return sem;
 
+handed_off:
+	/* Write lock handed off */
+	set_current_state(TASK_RUNNING);	/* smp_mb() */
+	goto out;
+
 out_nolock:
 	__set_current_state(TASK_RUNNING);
 	raw_spin_lock_irq(&sem->wait_lock);
+	if (!waiter.task) {
+		smp_acquire__after_ctrl_dep();
+		raw_spin_unlock_irq(&sem->wait_lock);
+		goto out;
+	}
 	rwsem_del_wake_waiter(sem, &waiter, &wake_q);
 	lockevent_inc(rwsem_wlock_fail);
 	trace_contention_end(sem, -EINTR);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] locking/rwsem: Wake up all readers for wait queue waker
  2023-02-16 21:09 [PATCH-tip v2 0/3] locking/rwsem: Miscellaneous rwsem enhancements Waiman Long
  2023-02-16 21:09 ` [PATCH v2 1/3] locking/rwsem: Minor code refactoring in rwsem_mark_wake() Waiman Long
  2023-02-16 21:09 ` [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff Waiman Long
@ 2023-02-16 21:09 ` Waiman Long
  2 siblings, 0 replies; 8+ messages in thread
From: Waiman Long @ 2023-02-16 21:09 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, Hillf Danton, Waiman Long

As noted in commit 54c1ee4d614d ("locking/rwsem: Conditionally wake
waiters in reader/writer slowpaths"), it was possible for a rwsem to get
into a state where a reader-owned rwsem could have many readers waiting
in the wait queue but no writer.

Recently, it was found that one way to cause this condition is to have a
highly contended rwsem with many readers, like a mmap_sem. There can be
hundreds of readers waiting in the wait queue of a writer-owned mmap_sem.
The rwsem_wake() call by the up_write() call of the rwsem owning writer
can hit the 256 reader wakeup limit and leave the rests of the readers
remaining in the wait queue. The reason for the limit is to avoid
excessive delay in doing other useful work.

With commit 54c1ee4d614d ("locking/rwsem: Conditionally wake waiters
in reader/writer slowpaths"), a new incoming reader should wake up
another batch of up to 256 readers. However, these incoming readers
or writers will have to wait in the wait queue and there is nothing
else they can do until it is their turn to be waken up. This patch
renames rwsem_mark_wake() to __rwsem_mark_wake() and adds an additional
in_waitq argument to indicate that the waker is in the wait queue and
can ignore the limit. A rwsem_mark_wake() helper is added that keeps
the original semantics.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index fc3961ceabe8..35b4adf8ea55 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -408,9 +408,9 @@ rwsem_del_waiter(struct rw_semaphore *sem, struct rwsem_waiter *waiter)
  *
  * Implies rwsem_del_waiter() for all woken readers.
  */
-static void rwsem_mark_wake(struct rw_semaphore *sem,
-			    enum rwsem_wake_type wake_type,
-			    struct wake_q_head *wake_q)
+static void __rwsem_mark_wake(struct rw_semaphore *sem,
+			      enum rwsem_wake_type wake_type,
+			      struct wake_q_head *wake_q, bool in_waitq)
 {
 	long count = atomic_long_read(&sem->count);
 	struct rwsem_waiter *waiter, *tmp;
@@ -542,9 +542,10 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		list_move_tail(&waiter->list, &wlist);
 
 		/*
-		 * Limit # of readers that can be woken up per wakeup call.
+		 * Limit # of readers that can be woken up per wakeup call
+		 * unless the waker is waiting in the wait queue.
 		 */
-		if (unlikely(woken >= MAX_READERS_WAKEUP))
+		if (unlikely(!in_waitq && (woken >= MAX_READERS_WAKEUP)))
 			break;
 	}
 
@@ -594,6 +595,13 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 	}
 }
 
+static inline void rwsem_mark_wake(struct rw_semaphore *sem,
+				   enum rwsem_wake_type wake_type,
+				   struct wake_q_head *wake_q)
+{
+	__rwsem_mark_wake(sem, wake_type, wake_q, false);
+}
+
 /*
  * Remove a waiter and try to wake up other waiters in the wait queue
  * This function is called from the out_nolock path of both the reader and
@@ -1022,7 +1030,7 @@ static inline void rwsem_cond_wake_waiter(struct rw_semaphore *sem, long count,
 		wake_type = RWSEM_WAKE_ANY;
 		clear_nonspinnable(sem);
 	}
-	rwsem_mark_wake(sem, wake_type, wake_q);
+	__rwsem_mark_wake(sem, wake_type, wake_q, true);
 }
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff
  2023-02-16 21:09 ` [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff Waiman Long
@ 2023-02-17  5:24   ` Boqun Feng
  2023-02-17 14:25     ` Waiman Long
  2023-02-23 12:32   ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: Boqun Feng @ 2023-02-17  5:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, linux-kernel, Hillf Danton

On Thu, Feb 16, 2023 at 04:09:32PM -0500, Waiman Long wrote:
> The lock handoff provided in rwsem isn't a true handoff like that in
> the mutex. Instead, it is more like a quiescent state where optimistic
> spinning and lock stealing are disabled to make it easier for the first
> waiter to acquire the lock.
> 
> For readers, setting the HANDOFF bit will disable writers from stealing
> the lock. The actual handoff is done at rwsem_wake() time after taking
> the wait_lock. There isn't much we need to improve here other than
> setting the RWSEM_NONSPINNABLE bit in owner.
> 
> For writers, setting the HANDOFF bit does not guarantee that it can
> acquire the rwsem successfully in a subsequent rwsem_try_write_lock()
> after setting the bit there. A reader can come in and add a
> RWSEM_READER_BIAS temporarily which can spoil the takeover of the rwsem
> in rwsem_try_write_lock() leading to additional delay.
> 
> For mutex, lock handoff is done at unlock time as the owner value and
> the handoff bit is in the same lock word and can be updated atomically.
> 
> That is the not case for rwsem which has a count value for locking and
> a different owner value for storing lock owner. In addition, the handoff
> processing differs depending on whether the first waiter is a writer or a
> reader. We can only make that waiter type determination after acquiring
> the wait lock. Together with the fact that the RWSEM_FLAG_HANDOFF bit
> is stable while holding the wait_lock, the most convenient place to
> do the early handoff is at rwsem_mark_wake() where wait_lock has to be
> acquired anyway.
> 
> There isn't much additional cost in doing this check and early handoff
> in rwsem_mark_wake() while increasing the chance that a lock handoff
> will be successful when the HANDOFF setting writer wakes up without
> even the need to take the wait_lock at all. Note that if early handoff
> fails to happen in rwsem_mark_wake(), a late handoff can still happen
> when the awoken writer calls rwsem_try_write_lock().
> 
> Kernel test robot noticed a 19.3% improvement of
> will-it-scale.per_thread_ops in an earlier version of this commit [1].
> 
> [1] https://lore.kernel.org/lkml/202302122155.87699b56-oliver.sang@intel.com/
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/lock_events_list.h |  1 +
>  kernel/locking/rwsem.c            | 71 +++++++++++++++++++++++++++----
>  2 files changed, 64 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
> index 97fb6f3f840a..fd80f5828f24 100644
> --- a/kernel/locking/lock_events_list.h
> +++ b/kernel/locking/lock_events_list.h
> @@ -67,3 +67,4 @@ LOCK_EVENT(rwsem_rlock_handoff)	/* # of read lock handoffs		*/
>  LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
>  LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
>  LOCK_EVENT(rwsem_wlock_handoff)	/* # of write lock handoffs		*/
> +LOCK_EVENT(rwsem_wlock_ehandoff) /* # of write lock early handoffs	*/
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index e589f69793df..fc3961ceabe8 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -412,8 +412,9 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>  			    enum rwsem_wake_type wake_type,
>  			    struct wake_q_head *wake_q)
>  {
> +	long count = atomic_long_read(&sem->count);
>  	struct rwsem_waiter *waiter, *tmp;
> -	long count, woken = 0, adjustment = 0;
> +	long woken = 0, adjustment = 0;
>  	struct list_head wlist;
>  
>  	lockdep_assert_held(&sem->wait_lock);
> @@ -432,19 +433,39 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>  		 * Mark writer at the front of the queue for wakeup.
>  		 * Until the task is actually later awoken later by
>  		 * the caller, other writers are able to steal it.
> +		 *
> +		 * *Unless* HANDOFF is set, in which case only the
> +		 * first waiter is allowed to take it.
> +		 *
>  		 * Readers, on the other hand, will block as they
>  		 * will notice the queued writer.
>  		 */
>  		wake_q_add(wake_q, waiter->task);
>  		lockevent_inc(rwsem_wake_writer);
> +
> +		if ((count & RWSEM_LOCK_MASK) || !(count & RWSEM_FLAG_HANDOFF))
> +			return;
> +
> +		/*
> +		 * If the rwsem is free and handoff flag is set with wait_lock
> +		 * held, no other CPUs can take an active lock. We can do an
> +		 * early handoff.
> +		 */
> +		adjustment = RWSEM_WRITER_LOCKED - RWSEM_FLAG_HANDOFF;
> +		atomic_long_set(&sem->owner, (long)waiter->task);
> +		waiter->task = NULL;
> +		atomic_long_add(adjustment, &sem->count);
> +		rwsem_del_waiter(sem, waiter);
> +		lockevent_inc(rwsem_wlock_ehandoff);
>  	}
>  	return;
>  
>  wake_readers:
>  	/*
> -	 * No reader wakeup if there are too many of them already.
> +	 * No reader wakeup if there are too many of them already or
> +	 * something wrong happens.
>  	 */
> -	if (unlikely(atomic_long_read(&sem->count) < 0))
> +	if (unlikely(count < 0))
>  		return;
>  
>  	/*
> @@ -468,7 +489,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>  					adjustment -= RWSEM_FLAG_HANDOFF;
>  					lockevent_inc(rwsem_rlock_handoff);
>  				}
> +				/*
> +				 * With HANDOFF set for reader, we must
> +				 * terminate all spinning.
> +				 */
>  				waiter->handoff_set = true;
> +				rwsem_set_nonspinnable(sem);
>  			}
>  
>  			atomic_long_add(-adjustment, &sem->count);
> @@ -610,6 +636,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>  
>  	lockdep_assert_held(&sem->wait_lock);
>  
> +	if (!waiter->task) {
> +		/* Write lock handed off */
> +		smp_acquire__after_ctrl_dep();

I don't think you need smp_acquire__after_ctrl_dep() here, since:

*	The other side is just a normal write "waiter->task = NULL"
*	The "&sem->wait_lock" already provides the necessary ACQUIRE

, but I come to this series so late, I may miss something subtle.

> +		return true;
> +	}
> +
>  	count = atomic_long_read(&sem->count);
>  	do {
>  		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
> @@ -755,6 +787,10 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
>  
>  	owner = rwsem_owner_flags(sem, &flags);
>  	state = rwsem_owner_state(owner, flags);
> +
> +	if (owner == current)
> +		return OWNER_NONSPINNABLE;	/* Handoff granted */
> +
>  	if (state != OWNER_WRITER)
>  		return state;
>  
> @@ -1164,32 +1200,51 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>  		 * the lock, attempt to spin on owner to accelerate lock
>  		 * transfer. If the previous owner is a on-cpu writer and it
>  		 * has just released the lock, OWNER_NULL will be returned.
> -		 * In this case, we attempt to acquire the lock again
> -		 * without sleeping.
> +		 * In this case, the waker may be in the process of early
> +		 * lock handoff. Use the wait_lock to synchronize with that
> +		 * before checking for handoff.
>  		 */
>  		if (waiter.handoff_set) {
>  			enum owner_state owner_state;
>  
>  			owner_state = rwsem_spin_on_owner(sem);
> -			if (owner_state == OWNER_NULL)
> -				goto trylock_again;
> +			if ((owner_state == OWNER_NULL) &&
> +			    READ_ONCE(waiter.task)) {

In theory, if there is a read outside some synchronization (say locks),
not only READ_ONCE(), but also WRITE_ONCE() is needed (even for write
inside locks), otherwise, KCSAN will yell (if
KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n). This is because plain write may get
teared.

Regards,
Boqun

> +				raw_spin_lock_irq(&sem->wait_lock);
> +				raw_spin_unlock_irq(&sem->wait_lock);
> +			}
> +			if (!READ_ONCE(waiter.task))
> +				goto handed_off;
>  		}
>  
>  		schedule_preempt_disabled();
>  		lockevent_inc(rwsem_sleep_writer);
> +		if (!READ_ONCE(waiter.task))
> +			goto handed_off;
> +
>  		set_current_state(state);
> -trylock_again:
>  		raw_spin_lock_irq(&sem->wait_lock);
>  	}
>  	__set_current_state(TASK_RUNNING);
>  	raw_spin_unlock_irq(&sem->wait_lock);
> +out:
>  	lockevent_inc(rwsem_wlock);
>  	trace_contention_end(sem, 0);
>  	return sem;
>  
> +handed_off:
> +	/* Write lock handed off */
> +	set_current_state(TASK_RUNNING);	/* smp_mb() */
> +	goto out;
> +
>  out_nolock:
>  	__set_current_state(TASK_RUNNING);
>  	raw_spin_lock_irq(&sem->wait_lock);
> +	if (!waiter.task) {
> +		smp_acquire__after_ctrl_dep();
> +		raw_spin_unlock_irq(&sem->wait_lock);
> +		goto out;
> +	}
>  	rwsem_del_wake_waiter(sem, &waiter, &wake_q);
>  	lockevent_inc(rwsem_wlock_fail);
>  	trace_contention_end(sem, -EINTR);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff
  2023-02-17  5:24   ` Boqun Feng
@ 2023-02-17 14:25     ` Waiman Long
  0 siblings, 0 replies; 8+ messages in thread
From: Waiman Long @ 2023-02-17 14:25 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, linux-kernel, Hillf Danton


On 2/17/23 00:24, Boqun Feng wrote:
> On Thu, Feb 16, 2023 at 04:09:32PM -0500, Waiman Long wrote:
>> The lock handoff provided in rwsem isn't a true handoff like that in
>> the mutex. Instead, it is more like a quiescent state where optimistic
>> spinning and lock stealing are disabled to make it easier for the first
>> waiter to acquire the lock.
>>
>> For readers, setting the HANDOFF bit will disable writers from stealing
>> the lock. The actual handoff is done at rwsem_wake() time after taking
>> the wait_lock. There isn't much we need to improve here other than
>> setting the RWSEM_NONSPINNABLE bit in owner.
>>
>> For writers, setting the HANDOFF bit does not guarantee that it can
>> acquire the rwsem successfully in a subsequent rwsem_try_write_lock()
>> after setting the bit there. A reader can come in and add a
>> RWSEM_READER_BIAS temporarily which can spoil the takeover of the rwsem
>> in rwsem_try_write_lock() leading to additional delay.
>>
>> For mutex, lock handoff is done at unlock time as the owner value and
>> the handoff bit is in the same lock word and can be updated atomically.
>>
>> That is the not case for rwsem which has a count value for locking and
>> a different owner value for storing lock owner. In addition, the handoff
>> processing differs depending on whether the first waiter is a writer or a
>> reader. We can only make that waiter type determination after acquiring
>> the wait lock. Together with the fact that the RWSEM_FLAG_HANDOFF bit
>> is stable while holding the wait_lock, the most convenient place to
>> do the early handoff is at rwsem_mark_wake() where wait_lock has to be
>> acquired anyway.
>>
>> There isn't much additional cost in doing this check and early handoff
>> in rwsem_mark_wake() while increasing the chance that a lock handoff
>> will be successful when the HANDOFF setting writer wakes up without
>> even the need to take the wait_lock at all. Note that if early handoff
>> fails to happen in rwsem_mark_wake(), a late handoff can still happen
>> when the awoken writer calls rwsem_try_write_lock().
>>
>> Kernel test robot noticed a 19.3% improvement of
>> will-it-scale.per_thread_ops in an earlier version of this commit [1].
>>
>> [1] https://lore.kernel.org/lkml/202302122155.87699b56-oliver.sang@intel.com/
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/locking/lock_events_list.h |  1 +
>>   kernel/locking/rwsem.c            | 71 +++++++++++++++++++++++++++----
>>   2 files changed, 64 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
>> index 97fb6f3f840a..fd80f5828f24 100644
>> --- a/kernel/locking/lock_events_list.h
>> +++ b/kernel/locking/lock_events_list.h
>> @@ -67,3 +67,4 @@ LOCK_EVENT(rwsem_rlock_handoff)	/* # of read lock handoffs		*/
>>   LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
>>   LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
>>   LOCK_EVENT(rwsem_wlock_handoff)	/* # of write lock handoffs		*/
>> +LOCK_EVENT(rwsem_wlock_ehandoff) /* # of write lock early handoffs	*/
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index e589f69793df..fc3961ceabe8 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -412,8 +412,9 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>>   			    enum rwsem_wake_type wake_type,
>>   			    struct wake_q_head *wake_q)
>>   {
>> +	long count = atomic_long_read(&sem->count);
>>   	struct rwsem_waiter *waiter, *tmp;
>> -	long count, woken = 0, adjustment = 0;
>> +	long woken = 0, adjustment = 0;
>>   	struct list_head wlist;
>>   
>>   	lockdep_assert_held(&sem->wait_lock);
>> @@ -432,19 +433,39 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>>   		 * Mark writer at the front of the queue for wakeup.
>>   		 * Until the task is actually later awoken later by
>>   		 * the caller, other writers are able to steal it.
>> +		 *
>> +		 * *Unless* HANDOFF is set, in which case only the
>> +		 * first waiter is allowed to take it.
>> +		 *
>>   		 * Readers, on the other hand, will block as they
>>   		 * will notice the queued writer.
>>   		 */
>>   		wake_q_add(wake_q, waiter->task);
>>   		lockevent_inc(rwsem_wake_writer);
>> +
>> +		if ((count & RWSEM_LOCK_MASK) || !(count & RWSEM_FLAG_HANDOFF))
>> +			return;
>> +
>> +		/*
>> +		 * If the rwsem is free and handoff flag is set with wait_lock
>> +		 * held, no other CPUs can take an active lock. We can do an
>> +		 * early handoff.
>> +		 */
>> +		adjustment = RWSEM_WRITER_LOCKED - RWSEM_FLAG_HANDOFF;
>> +		atomic_long_set(&sem->owner, (long)waiter->task);
>> +		waiter->task = NULL;
>> +		atomic_long_add(adjustment, &sem->count);
>> +		rwsem_del_waiter(sem, waiter);
>> +		lockevent_inc(rwsem_wlock_ehandoff);
>>   	}
>>   	return;
>>   
>>   wake_readers:
>>   	/*
>> -	 * No reader wakeup if there are too many of them already.
>> +	 * No reader wakeup if there are too many of them already or
>> +	 * something wrong happens.
>>   	 */
>> -	if (unlikely(atomic_long_read(&sem->count) < 0))
>> +	if (unlikely(count < 0))
>>   		return;
>>   
>>   	/*
>> @@ -468,7 +489,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>>   					adjustment -= RWSEM_FLAG_HANDOFF;
>>   					lockevent_inc(rwsem_rlock_handoff);
>>   				}
>> +				/*
>> +				 * With HANDOFF set for reader, we must
>> +				 * terminate all spinning.
>> +				 */
>>   				waiter->handoff_set = true;
>> +				rwsem_set_nonspinnable(sem);
>>   			}
>>   
>>   			atomic_long_add(-adjustment, &sem->count);
>> @@ -610,6 +636,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>>   
>>   	lockdep_assert_held(&sem->wait_lock);
>>   
>> +	if (!waiter->task) {
>> +		/* Write lock handed off */
>> +		smp_acquire__after_ctrl_dep();
> I don't think you need smp_acquire__after_ctrl_dep() here, since:
>
> *	The other side is just a normal write "waiter->task = NULL"
> *	The "&sem->wait_lock" already provides the necessary ACQUIRE
>
> , but I come to this series so late, I may miss something subtle.

You are probably right about that. I was just being on cautious side to 
make sure that there is a proper acquire barrier. Thinking about it, the 
lock acquire is actually done in the previous wait_lock critical section 
where the early lock handoff is done. So the wait_lock acquire barrier 
should be good enough.


>
>> +		return true;
>> +	}
>> +
>>   	count = atomic_long_read(&sem->count);
>>   	do {
>>   		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>> @@ -755,6 +787,10 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
>>   
>>   	owner = rwsem_owner_flags(sem, &flags);
>>   	state = rwsem_owner_state(owner, flags);
>> +
>> +	if (owner == current)
>> +		return OWNER_NONSPINNABLE;	/* Handoff granted */
>> +
>>   	if (state != OWNER_WRITER)
>>   		return state;
>>   
>> @@ -1164,32 +1200,51 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>>   		 * the lock, attempt to spin on owner to accelerate lock
>>   		 * transfer. If the previous owner is a on-cpu writer and it
>>   		 * has just released the lock, OWNER_NULL will be returned.
>> -		 * In this case, we attempt to acquire the lock again
>> -		 * without sleeping.
>> +		 * In this case, the waker may be in the process of early
>> +		 * lock handoff. Use the wait_lock to synchronize with that
>> +		 * before checking for handoff.
>>   		 */
>>   		if (waiter.handoff_set) {
>>   			enum owner_state owner_state;
>>   
>>   			owner_state = rwsem_spin_on_owner(sem);
>> -			if (owner_state == OWNER_NULL)
>> -				goto trylock_again;
>> +			if ((owner_state == OWNER_NULL) &&
>> +			    READ_ONCE(waiter.task)) {
> In theory, if there is a read outside some synchronization (say locks),
> not only READ_ONCE(), but also WRITE_ONCE() is needed (even for write
> inside locks), otherwise, KCSAN will yell (if
> KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n). This is because plain write may get
> teared.

Yes, I should have used a WRITE_ONCE() in writing NULL to waiter->task.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff
  2023-02-16 21:09 ` [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff Waiman Long
  2023-02-17  5:24   ` Boqun Feng
@ 2023-02-23 12:32   ` Peter Zijlstra
  2023-02-23 14:16     ` Waiman Long
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2023-02-23 12:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel, Hillf Danton

On Thu, Feb 16, 2023 at 04:09:32PM -0500, Waiman Long wrote:
> @@ -432,19 +433,39 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>  		 * Mark writer at the front of the queue for wakeup.
>  		 * Until the task is actually later awoken later by
>  		 * the caller, other writers are able to steal it.
> +		 *
> +		 * *Unless* HANDOFF is set, in which case only the
> +		 * first waiter is allowed to take it.
> +		 *
>  		 * Readers, on the other hand, will block as they
>  		 * will notice the queued writer.
>  		 */
>  		wake_q_add(wake_q, waiter->task);
>  		lockevent_inc(rwsem_wake_writer);
> +
> +		if ((count & RWSEM_LOCK_MASK) || !(count & RWSEM_FLAG_HANDOFF))
> +			return;
> +
> +		/*
> +		 * If the rwsem is free and handoff flag is set with wait_lock
> +		 * held, no other CPUs can take an active lock. We can do an
> +		 * early handoff.
> +		 */
> +		adjustment = RWSEM_WRITER_LOCKED - RWSEM_FLAG_HANDOFF;
> +		atomic_long_set(&sem->owner, (long)waiter->task);
> +		waiter->task = NULL;
> +		atomic_long_add(adjustment, &sem->count);
> +		rwsem_del_waiter(sem, waiter);
> +		lockevent_inc(rwsem_wlock_ehandoff);
>  	}

*sigh*... you can't even properly copy/paste from the reader side :/
This is broken, the moment you do waiter->task = NULL, waiter can
dissapear from under you and that rwsem_del_waiter() is a UaF.

Nor did you unify the reader and writer side and still have a giant
trainwreck on your hands..

Surely all that isn't too hard... let me post it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff
  2023-02-23 12:32   ` Peter Zijlstra
@ 2023-02-23 14:16     ` Waiman Long
  0 siblings, 0 replies; 8+ messages in thread
From: Waiman Long @ 2023-02-23 14:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel, Hillf Danton


On 2/23/23 07:32, Peter Zijlstra wrote:
> On Thu, Feb 16, 2023 at 04:09:32PM -0500, Waiman Long wrote:
>> @@ -432,19 +433,39 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>>   		 * Mark writer at the front of the queue for wakeup.
>>   		 * Until the task is actually later awoken later by
>>   		 * the caller, other writers are able to steal it.
>> +		 *
>> +		 * *Unless* HANDOFF is set, in which case only the
>> +		 * first waiter is allowed to take it.
>> +		 *
>>   		 * Readers, on the other hand, will block as they
>>   		 * will notice the queued writer.
>>   		 */
>>   		wake_q_add(wake_q, waiter->task);
>>   		lockevent_inc(rwsem_wake_writer);
>> +
>> +		if ((count & RWSEM_LOCK_MASK) || !(count & RWSEM_FLAG_HANDOFF))
>> +			return;
>> +
>> +		/*
>> +		 * If the rwsem is free and handoff flag is set with wait_lock
>> +		 * held, no other CPUs can take an active lock. We can do an
>> +		 * early handoff.
>> +		 */
>> +		adjustment = RWSEM_WRITER_LOCKED - RWSEM_FLAG_HANDOFF;
>> +		atomic_long_set(&sem->owner, (long)waiter->task);
>> +		waiter->task = NULL;
>> +		atomic_long_add(adjustment, &sem->count);
>> +		rwsem_del_waiter(sem, waiter);
>> +		lockevent_inc(rwsem_wlock_ehandoff);
>>   	}
> *sigh*... you can't even properly copy/paste from the reader side :/
> This is broken, the moment you do waiter->task = NULL, waiter can
> dissapear from under you and that rwsem_del_waiter() is a UaF.
>
> Nor did you unify the reader and writer side and still have a giant
> trainwreck on your hands..
>
> Surely all that isn't too hard... let me post it.
>
Right, I should keep the clearing after deleting the waiter. Thanks for 
spotting that. I will review your patches soon.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-02-23 14:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-16 21:09 [PATCH-tip v2 0/3] locking/rwsem: Miscellaneous rwsem enhancements Waiman Long
2023-02-16 21:09 ` [PATCH v2 1/3] locking/rwsem: Minor code refactoring in rwsem_mark_wake() Waiman Long
2023-02-16 21:09 ` [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff Waiman Long
2023-02-17  5:24   ` Boqun Feng
2023-02-17 14:25     ` Waiman Long
2023-02-23 12:32   ` Peter Zijlstra
2023-02-23 14:16     ` Waiman Long
2023-02-16 21:09 ` [PATCH v2 3/3] locking/rwsem: Wake up all readers for wait queue waker Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).