linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH-tip 0/2] locking/rwsem: Miscellaneous rwsem enhancements
@ 2023-02-13 19:48 Waiman Long
  2023-02-13 19:48 ` [PATCH 1/2] locking/rwsem: Enable early rwsem writer lock handoff Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Waiman Long @ 2023-02-13 19:48 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, Waiman Long

The first patch in this series is the same patch 4 of the v7 patch series
[1] with some update in commit log and comment. The version number is
reset here as the first 3 patches of the series have been merged. Patch
2 is another minor enhancement for some specific use cases.

[1] https://lore.kernel.org/lkml/20230126003628.365092-1-longman@redhat.com/

Waiman Long (2):
  locking/rwsem: Enable early rwsem writer lock handoff
  locking/rwsem: Wake up all readers for wait queue waker

 kernel/locking/rwsem.c | 89 +++++++++++++++++++++++++++++++++---------
 1 file changed, 71 insertions(+), 18 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] locking/rwsem: Enable early rwsem writer lock handoff
  2023-02-13 19:48 [PATCH-tip 0/2] locking/rwsem: Miscellaneous rwsem enhancements Waiman Long
@ 2023-02-13 19:48 ` Waiman Long
  2023-02-13 19:48 ` [PATCH 2/2] locking/rwsem: Wake up all readers for wait queue waker Waiman Long
       [not found] ` <20230214030901.3250-1-hdanton@sina.com>
  2 siblings, 0 replies; 4+ messages in thread
From: Waiman Long @ 2023-02-13 19:48 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, Waiman Long

The lock handoff provided in rwsem isn't a true handoff like that in
the mutex. Instead, it is more like a quiescent state where optimistic
spinning and lock stealing are disabled to make it easier for the first
waiter to acquire the lock.

For readers, setting the HANDOFF bit will disable writers from stealing
the lock. The actual handoff is done at rwsem_wake() time after taking
the wait_lock. There isn't much we need to improve here other than
setting the RWSEM_NONSPINNABLE bit in owner.

For writers, setting the HANDOFF bit does not guarantee that it can
acquire the rwsem successfully in a subsequent rwsem_try_write_lock()
after setting the bit there. A reader can come in and add a
RWSEM_READER_BIAS temporarily which can spoil the takeover of the rwsem
in rwsem_try_write_lock() leading to additional delay.

For mutex, lock handoff is done at unlock time as the owner value and
the handoff bit is in the same lock word and can be updated atomically.

That is the not case for rwsem which has a count value for locking and
a different owner value for storing lock owner. In addition, the handoff
processing differs depending on whether the first waiter is a writer or a
reader. We can only make that waiter type determination after acquiring
the wait lock. Together with the fact that the RWSEM_FLAG_HANDOFF bit
is stable while holding the wait_lock, the most convenient place to do
the early handoff is at rwsem_wake() where wait_lock has to be acquired
anyway. There isn't much additional cost in doing this check there while
increasing the chance that a lock handoff will be successful when the
writer wakes up.

Since a lot can happen between unlock time and after acquiring the
wait_lock in rwsem_wake(), we have to reconfirm the presence of the
handoff bit and the lock is free before doing the handoff.

Running a 96-thread rwsem locking test on a 96-thread x86-64 system,
the locking throughput increases slightly from 588 kops/s to 592 kops/s
with this change.

Kernel test robot also noticed a 19.3% improvement of
will-it-scale.per_thread_ops due to this commit [1].

[1] https://lore.kernel.org/lkml/202302122155.87699b56-oliver.sang@intel.com/

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 74 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 63 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index acb5a50309a1..3936a5fe1229 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -40,7 +40,7 @@
  *
  * When the rwsem is reader-owned and a spinning writer has timed out,
  * the nonspinnable bit will be set to disable optimistic spinning.
-
+ *
  * When a writer acquires a rwsem, it puts its task_struct pointer
  * into the owner field. It is cleared after an unlock.
  *
@@ -430,6 +430,10 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 			 * Mark writer at the front of the queue for wakeup.
 			 * Until the task is actually later awoken later by
 			 * the caller, other writers are able to steal it.
+			 *
+			 * *Unless* HANDOFF is set, in which case only the
+			 * first waiter is allowed to take it.
+			 *
 			 * Readers, on the other hand, will block as they
 			 * will notice the queued writer.
 			 */
@@ -467,7 +471,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 					adjustment -= RWSEM_FLAG_HANDOFF;
 					lockevent_inc(rwsem_rlock_handoff);
 				}
+				/*
+				 * With HANDOFF set for reader, we must
+				 * terminate all spinning.
+				 */
 				waiter->handoff_set = true;
+				rwsem_set_nonspinnable(sem);
 			}
 
 			atomic_long_add(-adjustment, &sem->count);
@@ -609,6 +618,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 
 	lockdep_assert_held(&sem->wait_lock);
 
+	if (!waiter->task) {
+		/* Write lock handed off */
+		smp_acquire__after_ctrl_dep();
+		return true;
+	}
+
 	count = atomic_long_read(&sem->count);
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
@@ -754,6 +769,10 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 
 	owner = rwsem_owner_flags(sem, &flags);
 	state = rwsem_owner_state(owner, flags);
+
+	if (owner == current)
+		return OWNER_NONSPINNABLE;	/* Handoff granted */
+
 	if (state != OWNER_WRITER)
 		return state;
 
@@ -844,7 +863,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		 * Try to acquire the lock
 		 */
 		taken = rwsem_try_write_lock_unqueued(sem);
-
 		if (taken)
 			break;
 
@@ -1168,21 +1186,23 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		 * without sleeping.
 		 */
 		if (waiter.handoff_set) {
-			enum owner_state owner_state;
-
-			owner_state = rwsem_spin_on_owner(sem);
-			if (owner_state == OWNER_NULL)
-				goto trylock_again;
+			rwsem_spin_on_owner(sem);
+			if (!READ_ONCE(waiter.task)) {
+				/* Write lock handed off */
+				smp_acquire__after_ctrl_dep();
+				set_current_state(TASK_RUNNING);
+				goto out;
+			}
 		}
 
 		schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_writer);
 		set_current_state(state);
-trylock_again:
 		raw_spin_lock_irq(&sem->wait_lock);
 	}
 	__set_current_state(TASK_RUNNING);
 	raw_spin_unlock_irq(&sem->wait_lock);
+out:
 	lockevent_inc(rwsem_wlock);
 	trace_contention_end(sem, 0);
 	return sem;
@@ -1190,6 +1210,11 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 out_nolock:
 	__set_current_state(TASK_RUNNING);
 	raw_spin_lock_irq(&sem->wait_lock);
+	if (!waiter.task) {
+		smp_acquire__after_ctrl_dep();
+		raw_spin_unlock_irq(&sem->wait_lock);
+		goto out;
+	}
 	rwsem_del_wake_waiter(sem, &waiter, &wake_q);
 	lockevent_inc(rwsem_wlock_fail);
 	trace_contention_end(sem, -EINTR);
@@ -1202,14 +1227,41 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
  */
 static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
-	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
+	unsigned long flags;
+	unsigned long count;
 
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
-	if (!list_empty(&sem->wait_list))
-		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+	if (list_empty(&sem->wait_list))
+		goto unlock_out;
+
+	/*
+	 * If the rwsem is free and handoff flag is set with wait_lock held,
+	 * no other CPUs can take an active lock.
+	 */
+	count = atomic_long_read(&sem->count);
+	if (!(count & RWSEM_LOCK_MASK) && (count & RWSEM_FLAG_HANDOFF)) {
+		/*
+		 * Since rwsem_mark_wake() will handle the handoff to readers
+		 * properly, we don't need to do anything extra for readers.
+		 * Early handoff processing will only be needed for writers.
+		 */
+		struct rwsem_waiter *waiter = rwsem_first_waiter(sem);
+		long adj = RWSEM_WRITER_LOCKED - RWSEM_FLAG_HANDOFF;
+
+		if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
+			atomic_long_set(&sem->owner, (long)waiter->task);
+			atomic_long_add(adj, &sem->count);
+			wake_q_add(&wake_q, waiter->task);
+			rwsem_del_waiter(sem, waiter);
+			waiter->task = NULL;	/* Signal the handoff */
+			goto unlock_out;
+		}
+	}
+	rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
+unlock_out:
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 	wake_up_q(&wake_q);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] locking/rwsem: Wake up all readers for wait queue waker
  2023-02-13 19:48 [PATCH-tip 0/2] locking/rwsem: Miscellaneous rwsem enhancements Waiman Long
  2023-02-13 19:48 ` [PATCH 1/2] locking/rwsem: Enable early rwsem writer lock handoff Waiman Long
@ 2023-02-13 19:48 ` Waiman Long
       [not found] ` <20230214030901.3250-1-hdanton@sina.com>
  2 siblings, 0 replies; 4+ messages in thread
From: Waiman Long @ 2023-02-13 19:48 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, Waiman Long

As noted in commit 54c1ee4d614d ("locking/rwsem: Conditionally wake
waiters in reader/writer slowpaths"), it was possible for a rwsem to get
into a state where a reader-owned rwsem could have many readers waiting
in the wait queue but no writer.

Recently, it was found that one way to cause this condition is to have a
highly contended rwsem with many readers, like a mmap_sem. There can be
hundreds of readers waiting in the wait queue of a writer-owned mmap_sem.
The rwsem_wake() call by the up_write() call of the rwsem owning writer
can hit the 256 reader wakeup limit and leave the rests of the readers
remaining in the wait queue. The reason for the limit is to avoid
excessive delay in doing other useful work.

With commit 54c1ee4d614d ("locking/rwsem: Conditionally wake waiters in
reader/writer slowpaths"), a new incoming reader should wake up another
batch of up to 256 readers. However, these incoming readers or writers
will have to wait in the wait queue and there is nothing else they can
do until it is their turn to be waken up. This patch adds an additional
in_waitq argument to rwsem_mark_wake() to indicate that the waker is
in the wait queue and can ignore the limit.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 3936a5fe1229..723a8824b967 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -410,7 +410,7 @@ rwsem_del_waiter(struct rw_semaphore *sem, struct rwsem_waiter *waiter)
  */
 static void rwsem_mark_wake(struct rw_semaphore *sem,
 			    enum rwsem_wake_type wake_type,
-			    struct wake_q_head *wake_q)
+			    struct wake_q_head *wake_q, bool in_waitq)
 {
 	struct rwsem_waiter *waiter, *tmp;
 	long oldcount, woken = 0, adjustment = 0;
@@ -524,9 +524,10 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		list_move_tail(&waiter->list, &wlist);
 
 		/*
-		 * Limit # of readers that can be woken up per wakeup call.
+		 * Limit # of readers that can be woken up per wakeup call
+		 * unless the waker is waiting in the wait queue.
 		 */
-		if (unlikely(woken >= MAX_READERS_WAKEUP))
+		if (unlikely(!in_waitq && (woken >= MAX_READERS_WAKEUP)))
 			break;
 	}
 
@@ -597,7 +598,7 @@ rwsem_del_wake_waiter(struct rw_semaphore *sem, struct rwsem_waiter *waiter,
 	 * be eligible to acquire or spin on the lock.
 	 */
 	if (rwsem_del_waiter(sem, waiter) && first)
-		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, wake_q, false);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	if (!wake_q_empty(wake_q))
 		wake_up_q(wake_q);
@@ -1004,7 +1005,7 @@ static inline void rwsem_cond_wake_waiter(struct rw_semaphore *sem, long count,
 		wake_type = RWSEM_WAKE_ANY;
 		clear_nonspinnable(sem);
 	}
-	rwsem_mark_wake(sem, wake_type, wake_q);
+	rwsem_mark_wake(sem, wake_type, wake_q, true);
 }
 
 /*
@@ -1042,7 +1043,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 			raw_spin_lock_irq(&sem->wait_lock);
 			if (!list_empty(&sem->wait_list))
 				rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
-						&wake_q);
+						&wake_q, false);
 			raw_spin_unlock_irq(&sem->wait_lock);
 			wake_up_q(&wake_q);
 		}
@@ -1259,7 +1260,7 @@ static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 			goto unlock_out;
 		}
 	}
-	rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+	rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q, false);
 
 unlock_out:
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
@@ -1281,7 +1282,7 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	if (!list_empty(&sem->wait_list))
-		rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q, false);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 	wake_up_q(&wake_q);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Wake up all readers for wait queue waker
       [not found] ` <20230214030901.3250-1-hdanton@sina.com>
@ 2023-02-16 21:07   ` Waiman Long
  0 siblings, 0 replies; 4+ messages in thread
From: Waiman Long @ 2023-02-16 21:07 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel

On 2/13/23 22:09, Hillf Danton wrote:
> On Mon, 13 Feb 2023 14:48:32 -0500 Waiman Long <longman@redhat.com>
>>   
>> @@ -1281,7 +1282,7 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
>>   	raw_spin_lock_irqsave(&sem->wait_lock, flags);
>>   
>>   	if (!list_empty(&sem->wait_list))
>> -		rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
>> +		rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q, false);
>>   
>>   	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
>>   	wake_up_q(&wake_q);
>> -- 
>> 2.31.1
> Downgrade is conceptually the right time to let all read waiters go
> regardless write waiter.

Still, a downgraded task is still in the read critical section and we 
shouldn't introduce arbitrary latency to that. Let's focus on the easy 
one and we can discuss about other possibility later.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-02-16 21:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-13 19:48 [PATCH-tip 0/2] locking/rwsem: Miscellaneous rwsem enhancements Waiman Long
2023-02-13 19:48 ` [PATCH 1/2] locking/rwsem: Enable early rwsem writer lock handoff Waiman Long
2023-02-13 19:48 ` [PATCH 2/2] locking/rwsem: Wake up all readers for wait queue waker Waiman Long
     [not found] ` <20230214030901.3250-1-hdanton@sina.com>
2023-02-16 21:07   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).