All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff
@ 2022-11-18  2:20 Waiman Long
  2022-11-18  2:20 ` [PATCH v6 1/6] locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath Waiman Long
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Waiman Long @ 2022-11-18  2:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷,
	Waiman Long

v6:
 - Fix an error in patch 2 reported by kernel test robot.

v5:
 - Drop patch 2 and replace it with 2 new patches disabling preemption on
   all reader functions and writer functions respectively. The other
   patches are adjusted accordingly.

v4:
 - Update patch descriptions in patches 1 & 2 to make clear the live
   lock conditions that are being fixed by these patches. There is no code
   change from v3.

It turns out the current waiter optimistic spinning code does not work
that well if we have RT tasks in the mix. This patch series include two
different fixes to resolve those issues. The last 3 patches modify the
handoff code to implement true lock handoff similar to that of mutex.

Waiman Long (6):
  locking/rwsem: Prevent non-first waiter from spinning in down_write()
    slowpath
  locking/rwsem: Disable preemption at all down_read*() and up_read()
    code paths
  locking/rwsem: Disable preemption at all down_write*() and up_write()
    code paths
  locking/rwsem: Change waiter->hanodff_set to a handoff_state enum
  locking/rwsem: Enable direct rwsem lock handoff
  locking/rwsem: Update handoff lock events tracking

 kernel/locking/lock_events_list.h |   6 +-
 kernel/locking/rwsem.c            | 240 ++++++++++++++++++++++--------
 2 files changed, 182 insertions(+), 64 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v6 1/6] locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
  2022-11-18  2:20 [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
@ 2022-11-18  2:20 ` Waiman Long
  2022-12-16 15:02   ` Jiri Wiesner
  2023-01-20 22:58   ` Waiman Long
  2022-11-18  2:20 ` [PATCH v6 2/6] locking/rwsem: Disable preemption at all down_read*() and up_read() code paths Waiman Long
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 21+ messages in thread
From: Waiman Long @ 2022-11-18  2:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷,
	Waiman Long, stable

A non-first waiter can potentially spin in the for loop of
rwsem_down_write_slowpath() without sleeping but fail to acquire the
lock even if the rwsem is free if the following sequence happens:

  Non-first RT waiter    First waiter      Lock holder
  -------------------    ------------      -----------
  Acquire wait_lock
  rwsem_try_write_lock():
    Set handoff bit if RT or
      wait too long
    Set waiter->handoff_set
  Release wait_lock
                         Acquire wait_lock
                         Inherit waiter->handoff_set
                         Release wait_lock
					   Clear owner
                                           Release lock
  if (waiter.handoff_set) {
    rwsem_spin_on_owner(();
    if (OWNER_NULL)
      goto trylock_again;
  }
  trylock_again:
  Acquire wait_lock
  rwsem_try_write_lock():
     if (first->handoff_set && (waiter != first))
     	return false;
  Release wait_lock

A non-first waiter cannot really acquire the rwsem even if it mistakenly
believes that it can spin on OWNER_NULL value. If that waiter happens
to be an RT task running on the same CPU as the first waiter, it can
block the first waiter from acquiring the rwsem leading to live lock.
Fix this problem by making sure that a non-first waiter cannot spin in
the slowpath loop without sleeping.

Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent")
Reviewed-and-tested-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org
---
 kernel/locking/rwsem.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 44873594de03..be2df9ea7c30 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -624,18 +624,16 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 			 */
 			if (first->handoff_set && (waiter != first))
 				return false;
-
-			/*
-			 * First waiter can inherit a previously set handoff
-			 * bit and spin on rwsem if lock acquisition fails.
-			 */
-			if (waiter == first)
-				waiter->handoff_set = true;
 		}
 
 		new = count;
 
 		if (count & RWSEM_LOCK_MASK) {
+			/*
+			 * A waiter (first or not) can set the handoff bit
+			 * if it is an RT task or wait in the wait queue
+			 * for too long.
+			 */
 			if (has_handoff || (!rt_task(waiter->task) &&
 					    !time_after(jiffies, waiter->timeout)))
 				return false;
@@ -651,11 +649,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 	} while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
 
 	/*
-	 * We have either acquired the lock with handoff bit cleared or
-	 * set the handoff bit.
+	 * We have either acquired the lock with handoff bit cleared or set
+	 * the handoff bit. Only the first waiter can have its handoff_set
+	 * set here to enable optimistic spinning in slowpath loop.
 	 */
 	if (new & RWSEM_FLAG_HANDOFF) {
-		waiter->handoff_set = true;
+		first->handoff_set = true;
 		lockevent_inc(rwsem_wlock_handoff);
 		return false;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 2/6] locking/rwsem: Disable preemption at all down_read*() and up_read() code paths
  2022-11-18  2:20 [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
  2022-11-18  2:20 ` [PATCH v6 1/6] locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath Waiman Long
@ 2022-11-18  2:20 ` Waiman Long
  2022-12-16 15:03   ` Jiri Wiesner
  2022-11-18  2:20 ` [PATCH v6 3/6] locking/rwsem: Disable preemption at all down_write*() and up_write() " Waiman Long
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2022-11-18  2:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷,
	Waiman Long

Commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically
spin on owner") assumes that when the owner field is changed to NULL,
the lock will become free soon. Commit 48dfb5d2560d ("locking/rwsem:
Disable preemption while trying for rwsem lock") disables preemption
when acquiring rwsem for write. However, preemption has not yet been
disabled when acquiring a read lock on a rwsem. So a reader can add a
RWSEM_READER_BIAS to count without setting owner to signal a reader,
got preempted out by a RT task which then spins in the writer slowpath
as owner remains NULL leading to live lock.

One easy way to fix this problem is to disable preemption at all the
down_read*() and up_read() code paths as implemented in this patch.

Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner")
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index be2df9ea7c30..ebaff8a87e1d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1091,7 +1091,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 			/* Ordered by sem->wait_lock against rwsem_mark_wake(). */
 			break;
 		}
-		schedule();
+		schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_reader);
 	}
 
@@ -1253,14 +1253,20 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
  */
 static inline int __down_read_common(struct rw_semaphore *sem, int state)
 {
+	int ret = 0;
 	long count;
 
+	preempt_disable();
 	if (!rwsem_read_trylock(sem, &count)) {
-		if (IS_ERR(rwsem_down_read_slowpath(sem, count, state)))
-			return -EINTR;
+		if (IS_ERR(rwsem_down_read_slowpath(sem, count, state))) {
+			ret = -EINTR;
+			goto out;
+		}
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	}
-	return 0;
+out:
+	preempt_enable();
+	return ret;
 }
 
 static inline void __down_read(struct rw_semaphore *sem)
@@ -1280,19 +1286,23 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
 
 static inline int __down_read_trylock(struct rw_semaphore *sem)
 {
+	int ret = 0;
 	long tmp;
 
 	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
 
+	preempt_disable();
 	tmp = atomic_long_read(&sem->count);
 	while (!(tmp & RWSEM_READ_FAILED_MASK)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
 						    tmp + RWSEM_READER_BIAS)) {
 			rwsem_set_reader_owned(sem);
-			return 1;
+			ret = 1;
+			break;
 		}
 	}
-	return 0;
+	preempt_enable();
+	return ret;
 }
 
 /*
@@ -1334,6 +1344,7 @@ static inline void __up_read(struct rw_semaphore *sem)
 	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
 	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 
+	preempt_disable();
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
@@ -1342,6 +1353,7 @@ static inline void __up_read(struct rw_semaphore *sem)
 		clear_nonspinnable(sem);
 		rwsem_wake(sem);
 	}
+	preempt_enable();
 }
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 3/6] locking/rwsem: Disable preemption at all down_write*() and up_write() code paths
  2022-11-18  2:20 [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
  2022-11-18  2:20 ` [PATCH v6 1/6] locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath Waiman Long
  2022-11-18  2:20 ` [PATCH v6 2/6] locking/rwsem: Disable preemption at all down_read*() and up_read() code paths Waiman Long
@ 2022-11-18  2:20 ` Waiman Long
  2022-12-16 15:03   ` Jiri Wiesner
  2022-11-18  2:20 ` [PATCH v6 4/6] locking/rwsem: Change waiter->hanodff_set to a handoff_state enum Waiman Long
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2022-11-18  2:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷,
	Waiman Long

The previous patch has disabled preemption at all the down_read()
and up_read() code paths. For symmetry, this patch extends commit
48dfb5d2560d ("locking/rwsem: Disable preemption while trying for rwsem
lock") to have preemption disabled at all the down_write() and up_write()
code path including downgrade_write().

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ebaff8a87e1d..8159a69b5de8 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -256,16 +256,13 @@ static inline bool rwsem_read_trylock(struct rw_semaphore *sem, long *cntp)
 static inline bool rwsem_write_trylock(struct rw_semaphore *sem)
 {
 	long tmp = RWSEM_UNLOCKED_VALUE;
-	bool ret = false;
 
-	preempt_disable();
 	if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, RWSEM_WRITER_LOCKED)) {
 		rwsem_set_owner(sem);
-		ret = true;
+		return true;
 	}
 
-	preempt_enable();
-	return ret;
+	return false;
 }
 
 /*
@@ -716,7 +713,6 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 		return false;
 	}
 
-	preempt_disable();
 	/*
 	 * Disable preemption is equal to the RCU read-side crital section,
 	 * thus the task_strcut structure won't go away.
@@ -728,7 +724,6 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	if ((flags & RWSEM_NONSPINNABLE) ||
 	    (owner && !(flags & RWSEM_READER_OWNED) && !owner_on_cpu(owner)))
 		ret = false;
-	preempt_enable();
 
 	lockevent_cond_inc(rwsem_opt_fail, !ret);
 	return ret;
@@ -828,8 +823,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	int loop = 0;
 	u64 rspin_threshold = 0;
 
-	preempt_disable();
-
 	/* sem->wait_lock should not be held when doing optimistic spinning */
 	if (!osq_lock(&sem->osq))
 		goto done;
@@ -937,7 +930,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	}
 	osq_unlock(&sem->osq);
 done:
-	preempt_enable();
 	lockevent_cond_inc(rwsem_opt_fail, !taken);
 	return taken;
 }
@@ -1178,15 +1170,12 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		if (waiter.handoff_set) {
 			enum owner_state owner_state;
 
-			preempt_disable();
 			owner_state = rwsem_spin_on_owner(sem);
-			preempt_enable();
-
 			if (owner_state == OWNER_NULL)
 				goto trylock_again;
 		}
 
-		schedule();
+		schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_writer);
 		set_current_state(state);
 trylock_again:
@@ -1310,12 +1299,15 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
  */
 static inline int __down_write_common(struct rw_semaphore *sem, int state)
 {
+	int ret = 0;
+
+	preempt_disable();
 	if (unlikely(!rwsem_write_trylock(sem))) {
 		if (IS_ERR(rwsem_down_write_slowpath(sem, state)))
-			return -EINTR;
+			ret = -EINTR;
 	}
-
-	return 0;
+	preempt_enable();
+	return ret;
 }
 
 static inline void __down_write(struct rw_semaphore *sem)
@@ -1330,8 +1322,14 @@ static inline int __down_write_killable(struct rw_semaphore *sem)
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
+	int ret;
+
+	preempt_disable();
 	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
-	return rwsem_write_trylock(sem);
+	ret = rwsem_write_trylock(sem);
+	preempt_enable();
+
+	return ret;
 }
 
 /*
@@ -1374,9 +1372,9 @@ static inline void __up_write(struct rw_semaphore *sem)
 	preempt_disable();
 	rwsem_clear_owner(sem);
 	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
-	preempt_enable();
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
+	preempt_enable();
 }
 
 /*
@@ -1394,11 +1392,13 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	 * write side. As such, rely on RELEASE semantics.
 	 */
 	DEBUG_RWSEMS_WARN_ON(rwsem_owner(sem) != current, sem);
+	preempt_disable();
 	tmp = atomic_long_fetch_add_release(
 		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
 	rwsem_set_reader_owned(sem);
 	if (tmp & RWSEM_FLAG_WAITERS)
 		rwsem_downgrade_wake(sem);
+	preempt_enable();
 }
 
 #else /* !CONFIG_PREEMPT_RT */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 4/6] locking/rwsem: Change waiter->hanodff_set to a handoff_state enum
  2022-11-18  2:20 [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
                   ` (2 preceding siblings ...)
  2022-11-18  2:20 ` [PATCH v6 3/6] locking/rwsem: Disable preemption at all down_write*() and up_write() " Waiman Long
@ 2022-11-18  2:20 ` Waiman Long
  2022-11-18  2:20 ` [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff Waiman Long
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2022-11-18  2:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷,
	Waiman Long

Change the boolean waiter->handoff_set to an enum type so that we can
have more states in some later patches. Also use READ_ONCE() outside
wait_lock critical sections for read and WRITE_ONCE() inside wait_lock
critical sections for write for proper synchronization. There is no
functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8159a69b5de8..aa771fa1a1fe 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -335,12 +335,17 @@ enum rwsem_waiter_type {
 	RWSEM_WAITING_FOR_READ
 };
 
+enum rwsem_handoff_state {
+	HANDOFF_NONE = 0,
+	HANDOFF_REQUESTED,
+};
+
 struct rwsem_waiter {
 	struct list_head list;
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
+	enum rwsem_handoff_state handoff_state;
 	unsigned long timeout;
-	bool handoff_set;
 };
 #define rwsem_first_waiter(sem) \
 	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -467,7 +472,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 					adjustment -= RWSEM_FLAG_HANDOFF;
 					lockevent_inc(rwsem_rlock_handoff);
 				}
-				waiter->handoff_set = true;
+				WRITE_ONCE(waiter->handoff_state, HANDOFF_REQUESTED);
 			}
 
 			atomic_long_add(-adjustment, &sem->count);
@@ -619,7 +624,7 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 			 * waiter is the one that set it. Otherwisee, we
 			 * still try to acquire the rwsem.
 			 */
-			if (first->handoff_set && (waiter != first))
+			if (first->handoff_state && (waiter != first))
 				return false;
 		}
 
@@ -647,11 +652,11 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 
 	/*
 	 * We have either acquired the lock with handoff bit cleared or set
-	 * the handoff bit. Only the first waiter can have its handoff_set
+	 * the handoff bit. Only the first waiter can have its handoff_state
 	 * set here to enable optimistic spinning in slowpath loop.
 	 */
 	if (new & RWSEM_FLAG_HANDOFF) {
-		first->handoff_set = true;
+		WRITE_ONCE(first->handoff_state, HANDOFF_REQUESTED);
 		lockevent_inc(rwsem_wlock_handoff);
 		return false;
 	}
@@ -1035,7 +1040,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_READ;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
-	waiter.handoff_set = false;
+	waiter.handoff_state = HANDOFF_NONE;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list)) {
@@ -1122,7 +1127,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_WRITE;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
-	waiter.handoff_set = false;
+	waiter.handoff_state = HANDOFF_NONE;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	rwsem_add_waiter(sem, &waiter);
@@ -1167,7 +1172,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		 * In this case, we attempt to acquire the lock again
 		 * without sleeping.
 		 */
-		if (waiter.handoff_set) {
+		if (READ_ONCE(waiter.handoff_state)) {
 			enum owner_state owner_state;
 
 			owner_state = rwsem_spin_on_owner(sem);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff
  2022-11-18  2:20 [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
                   ` (3 preceding siblings ...)
  2022-11-18  2:20 ` [PATCH v6 4/6] locking/rwsem: Change waiter->hanodff_set to a handoff_state enum Waiman Long
@ 2022-11-18  2:20 ` Waiman Long
  2023-01-23 14:59   ` Peter Zijlstra
  2022-11-18  2:20 ` [PATCH v6 6/6] locking/rwsem: Update handoff lock events tracking Waiman Long
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2022-11-18  2:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷,
	Waiman Long

The lock handoff provided in rwsem isn't a true handoff like that in
the mutex. Instead, it is more like a quiescent state where optimistic
spinning and lock stealing are disabled to make it easier for the first
waiter to acquire the lock.

Reworking the code to enable a true lock handoff is more complex due to
the following facts:
 1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
    is too expensive to always take the wait_lock in the unlock path
    to prevent racing.
 2) The reader lock fast path may add a RWSEM_READER_BIAS at the wrong
    time to prevent a proper lock handoff from a reader owned rwsem.

A lock handoff can only be initiated when the following conditions are
true:
 1) The RWSEM_FLAG_HANDOFF bit is set.
 2) The task to do the handoff don't see any other active lock
    excluding the lock that it might have held.

The new handoff mechanism performs handoff in rwsem_wakeup() to minimize
overhead. The rwsem count will be known at that point to determine if
handoff should be done. However, there is a small time gap between the
rwsem becomes free and the wait_lock is taken where a reader can come
in and add a RWSEM_READER_BIAS to the count or the current first waiter
can take the rwsem and clear RWSEM_FLAG_HANDOFF in the interim. That
will fail the handoff operation. To handle the former case, a secondary
handoff will also be done in the rwsem_down_read_slowpath() to catch it.

With true lock handoff, there is no need to do a NULL owner spinning
anymore as wakeup will be performed if handoff is possible. So it
is likely that the first waiter won't actually go to sleep even when
schedule() is called in this case.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 135 +++++++++++++++++++++++++++++++++++------
 1 file changed, 117 insertions(+), 18 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index aa771fa1a1fe..047e5fcb2457 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -338,6 +338,7 @@ enum rwsem_waiter_type {
 enum rwsem_handoff_state {
 	HANDOFF_NONE = 0,
 	HANDOFF_REQUESTED,
+	HANDOFF_GRANTED,
 };
 
 struct rwsem_waiter {
@@ -486,6 +487,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		 */
 		owner = waiter->task;
 		__rwsem_set_reader_owned(sem, owner);
+	} else if (waiter->handoff_state == HANDOFF_GRANTED) {
+		/*
+		 * rwsem_handoff() has added to count RWSEM_READER_BIAS of
+		 * the first waiter.
+		 */
+		adjustment = RWSEM_READER_BIAS;
 	}
 
 	/*
@@ -583,7 +590,7 @@ rwsem_del_wake_waiter(struct rw_semaphore *sem, struct rwsem_waiter *waiter,
 		      struct wake_q_head *wake_q)
 		      __releases(&sem->wait_lock)
 {
-	bool first = rwsem_first_waiter(sem) == waiter;
+	struct rwsem_waiter *first = rwsem_first_waiter(sem);
 
 	wake_q_init(wake_q);
 
@@ -592,8 +599,21 @@ rwsem_del_wake_waiter(struct rw_semaphore *sem, struct rwsem_waiter *waiter,
 	 * the first waiter, we wake up the remaining waiters as they may
 	 * be eligible to acquire or spin on the lock.
 	 */
-	if (rwsem_del_waiter(sem, waiter) && first)
+	if (rwsem_del_waiter(sem, waiter) && (waiter == first)) {
+		switch (waiter->handoff_state) {
+		case HANDOFF_GRANTED:
+			raw_spin_unlock_irq(&sem->wait_lock);
+			return;
+		case HANDOFF_REQUESTED:
+			/* Pass handoff state to the new first waiter */
+			first = rwsem_first_waiter(sem);
+			WRITE_ONCE(first->handoff_state, HANDOFF_REQUESTED);
+			fallthrough;
+		default:
+			break;
+		}
 		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, wake_q);
+	}
 	raw_spin_unlock_irq(&sem->wait_lock);
 	if (!wake_q_empty(wake_q))
 		wake_up_q(wake_q);
@@ -759,6 +779,11 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 
 	owner = rwsem_owner_flags(sem, &flags);
 	state = rwsem_owner_state(owner, flags);
+
+	/* A handoff may have been granted */
+	if (!flags && (owner == current))
+		return OWNER_NONSPINNABLE;
+
 	if (state != OWNER_WRITER)
 		return state;
 
@@ -969,6 +994,32 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 }
 #endif
 
+/*
+ * Hand off the lock to the first waiter
+ */
+static void rwsem_handoff(struct rw_semaphore *sem, long adj,
+			  struct wake_q_head *wake_q)
+{
+	struct rwsem_waiter *waiter;
+	enum rwsem_wake_type wake_type;
+
+	lockdep_assert_held(&sem->wait_lock);
+	adj -= RWSEM_FLAG_HANDOFF;
+	waiter = rwsem_first_waiter(sem);
+	WRITE_ONCE(waiter->handoff_state, HANDOFF_GRANTED);
+	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
+		wake_type = RWSEM_WAKE_ANY;
+		adj += RWSEM_WRITER_LOCKED;
+		atomic_long_set(&sem->owner, (long)waiter->task);
+	} else {
+		wake_type = RWSEM_WAKE_READ_OWNED;
+		adj += RWSEM_READER_BIAS;
+		__rwsem_set_reader_owned(sem, waiter->task);
+	}
+	atomic_long_add(adj, &sem->count);
+	rwsem_mark_wake(sem, wake_type, wake_q);
+}
+
 /*
  * Prepare to wake up waiter(s) in the wait queue by putting them into the
  * given wake_q if the rwsem lock owner isn't a writer. If rwsem is likely
@@ -1043,6 +1094,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 	waiter.handoff_state = HANDOFF_NONE;
 
 	raw_spin_lock_irq(&sem->wait_lock);
+	count = atomic_long_read(&sem->count);
 	if (list_empty(&sem->wait_list)) {
 		/*
 		 * In case the wait queue is empty and the lock isn't owned
@@ -1050,7 +1102,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 		 * immediately as its RWSEM_READER_BIAS has already been set
 		 * in the count.
 		 */
-		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+		if (!(count & RWSEM_WRITER_MASK)) {
 			/* Provide lock ACQUIRE */
 			smp_acquire__after_ctrl_dep();
 			raw_spin_unlock_irq(&sem->wait_lock);
@@ -1059,13 +1111,36 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 			return sem;
 		}
 		adjustment += RWSEM_FLAG_WAITERS;
+	} else if ((count & RWSEM_FLAG_HANDOFF) &&
+		  ((count & RWSEM_LOCK_MASK) == RWSEM_READER_BIAS)) {
+		/*
+		 * If the waiter to be handed off is a reader, all the
+		 * readers in the wait queue will be waken up. As this reader
+		 * hasn't been queued in the wait queue yet, it may as well
+		 * keep its RWSEM_READER_BIAS and return after waking up
+		 * other readers in the queue.
+		 */
+		if (rwsem_first_waiter(sem)->type == RWSEM_WAITING_FOR_READ)
+			adjustment = 0;
+		rwsem_handoff(sem, adjustment, &wake_q);
+
+		if (!adjustment) {
+			raw_spin_unlock_irq(&sem->wait_lock);
+			wake_up_q(&wake_q);
+			return sem;
+		}
+		adjustment = 0;
 	}
 	rwsem_add_waiter(sem, &waiter);
 
-	/* we're now waiting on the lock, but no longer actively locking */
-	count = atomic_long_add_return(adjustment, &sem->count);
-
-	rwsem_cond_wake_waiter(sem, count, &wake_q);
+	if (adjustment) {
+		/*
+		 * We are now waiting on the lock with no handoff, but no
+		 * longer actively locking.
+		 */
+		count = atomic_long_add_return(adjustment, &sem->count);
+		rwsem_cond_wake_waiter(sem, count, &wake_q);
+	}
 	raw_spin_unlock_irq(&sem->wait_lock);
 
 	if (!wake_q_empty(&wake_q))
@@ -1154,6 +1229,8 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	trace_contention_begin(sem, LCB_F_WRITE);
 
 	for (;;) {
+		enum rwsem_handoff_state handoff;
+
 		if (rwsem_try_write_lock(sem, &waiter)) {
 			/* rwsem_try_write_lock() implies ACQUIRE on success */
 			break;
@@ -1168,26 +1245,33 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		 * After setting the handoff bit and failing to acquire
 		 * the lock, attempt to spin on owner to accelerate lock
 		 * transfer. If the previous owner is a on-cpu writer and it
-		 * has just released the lock, OWNER_NULL will be returned.
-		 * In this case, we attempt to acquire the lock again
-		 * without sleeping.
+		 * has just released the lock, handoff_state is likely to be
+		 * set to HANDOFF_GRANTED or is to be set soon.
 		 */
-		if (READ_ONCE(waiter.handoff_state)) {
-			enum owner_state owner_state;
+		handoff = READ_ONCE(waiter.handoff_state);
+		if (handoff) {
+			if (handoff == HANDOFF_REQUESTED) {
+				rwsem_spin_on_owner(sem);
+				handoff = READ_ONCE(waiter.handoff_state);
+			}
 
-			owner_state = rwsem_spin_on_owner(sem);
-			if (owner_state == OWNER_NULL)
-				goto trylock_again;
+			if (handoff == HANDOFF_GRANTED)
+				goto skip_sleep;
 		}
 
 		schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_writer);
 		set_current_state(state);
-trylock_again:
+skip_sleep:
 		raw_spin_lock_irq(&sem->wait_lock);
+		if (waiter.handoff_state == HANDOFF_GRANTED) {
+			rwsem_del_waiter(sem, &waiter);
+			break;
+		}
 	}
 	__set_current_state(TASK_RUNNING);
 	raw_spin_unlock_irq(&sem->wait_lock);
+out_lock:
 	lockevent_inc(rwsem_wlock);
 	trace_contention_end(sem, 0);
 	return sem;
@@ -1196,6 +1280,9 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	__set_current_state(TASK_RUNNING);
 	raw_spin_lock_irq(&sem->wait_lock);
 	rwsem_del_wake_waiter(sem, &waiter, &wake_q);
+	if (unlikely(READ_ONCE(waiter.handoff_state) == HANDOFF_GRANTED))
+		goto out_lock;
+
 	lockevent_inc(rwsem_wlock_fail);
 	trace_contention_end(sem, -EINTR);
 	return ERR_PTR(-EINTR);
@@ -1207,12 +1294,24 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
  */
 static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
-	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
+	unsigned long flags;
+	long count;
 
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
-	if (!list_empty(&sem->wait_list))
+	if (list_empty(&sem->wait_list)) {
+		raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+		return sem;
+	}
+	/*
+	 * If the rwsem is free and handoff flag is set with wait_lock held,
+	 * no other CPUs can take an active lock.
+	 */
+	count = atomic_long_read(&sem->count);
+	if (!(count & RWSEM_LOCK_MASK) && (count & RWSEM_FLAG_HANDOFF))
+		rwsem_handoff(sem, 0, &wake_q);
+	else
 		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 6/6] locking/rwsem: Update handoff lock events tracking
  2022-11-18  2:20 [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
                   ` (4 preceding siblings ...)
  2022-11-18  2:20 ` [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff Waiman Long
@ 2022-11-18  2:20 ` Waiman Long
  2023-01-17 20:53 ` [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
  2023-01-22 13:46 ` Peter Zijlstra
  7 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2022-11-18  2:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷,
	Waiman Long

With the new direct rwsem lock handoff, the corresponding handoff lock
events are updated to also track the number of secondary lock handoffs
in rwsem_down_read_slowpath() to see how prevalent those handoff
events are. The number of primary lock handoffs in the unlock paths is
(rwsem_handoff_read + rwsem_handoff_write - rwsem_handoff_rslow).

After running a 96-thread rwsem microbenchmark with equal number
of readers and writers on a 2-socket 96-thread system for 40s, the
following handoff stats were obtained:

  rwsem_handoff_read=189
  rwsem_handoff_rslow=1
  rwsem_handoff_write=6678
  rwsem_handoff_wspin=6681

The number of primary handoffs was 6866, whereas there was only one
secondary handoff for this test run.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h | 6 ++++--
 kernel/locking/rwsem.c            | 9 +++++----
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 97fb6f3f840a..04d101767c2c 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -63,7 +63,9 @@ LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_steal)	/* # of read locks by lock stealing	*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
-LOCK_EVENT(rwsem_rlock_handoff)	/* # of read lock handoffs		*/
 LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
 LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
-LOCK_EVENT(rwsem_wlock_handoff)	/* # of write lock handoffs		*/
+LOCK_EVENT(rwsem_handoff_read)	/* # of read lock handoffs		*/
+LOCK_EVENT(rwsem_handoff_write)	/* # of write lock handoffs		*/
+LOCK_EVENT(rwsem_handoff_rslow)	/* # of handoffs in read slowpath	*/
+LOCK_EVENT(rwsem_handoff_wspin)	/* # of handoff spins in write slowpath	*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 047e5fcb2457..b3135e12ca22 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -469,10 +469,8 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 			 * force the issue.
 			 */
 			if (time_after(jiffies, waiter->timeout)) {
-				if (!(oldcount & RWSEM_FLAG_HANDOFF)) {
+				if (!(oldcount & RWSEM_FLAG_HANDOFF))
 					adjustment -= RWSEM_FLAG_HANDOFF;
-					lockevent_inc(rwsem_rlock_handoff);
-				}
 				WRITE_ONCE(waiter->handoff_state, HANDOFF_REQUESTED);
 			}
 
@@ -677,7 +675,6 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 	 */
 	if (new & RWSEM_FLAG_HANDOFF) {
 		WRITE_ONCE(first->handoff_state, HANDOFF_REQUESTED);
-		lockevent_inc(rwsem_wlock_handoff);
 		return false;
 	}
 
@@ -1011,10 +1008,12 @@ static void rwsem_handoff(struct rw_semaphore *sem, long adj,
 		wake_type = RWSEM_WAKE_ANY;
 		adj += RWSEM_WRITER_LOCKED;
 		atomic_long_set(&sem->owner, (long)waiter->task);
+		lockevent_inc(rwsem_handoff_write);
 	} else {
 		wake_type = RWSEM_WAKE_READ_OWNED;
 		adj += RWSEM_READER_BIAS;
 		__rwsem_set_reader_owned(sem, waiter->task);
+		lockevent_inc(rwsem_handoff_read);
 	}
 	atomic_long_add(adj, &sem->count);
 	rwsem_mark_wake(sem, wake_type, wake_q);
@@ -1123,6 +1122,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 		if (rwsem_first_waiter(sem)->type == RWSEM_WAITING_FOR_READ)
 			adjustment = 0;
 		rwsem_handoff(sem, adjustment, &wake_q);
+		lockevent_inc(rwsem_handoff_rslow);
 
 		if (!adjustment) {
 			raw_spin_unlock_irq(&sem->wait_lock);
@@ -1253,6 +1253,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 			if (handoff == HANDOFF_REQUESTED) {
 				rwsem_spin_on_owner(sem);
 				handoff = READ_ONCE(waiter.handoff_state);
+				lockevent_inc(rwsem_handoff_wspin);
 			}
 
 			if (handoff == HANDOFF_GRANTED)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/6] locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
  2022-11-18  2:20 ` [PATCH v6 1/6] locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath Waiman Long
@ 2022-12-16 15:02   ` Jiri Wiesner
  2023-01-20 22:58   ` Waiman Long
  1 sibling, 0 replies; 21+ messages in thread
From: Jiri Wiesner @ 2022-12-16 15:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷,
	stable

On Thu, Nov 17, 2022 at 09:20:11PM -0500, Waiman Long wrote:
> A non-first waiter can potentially spin in the for loop of
> rwsem_down_write_slowpath() without sleeping but fail to acquire the
> lock even if the rwsem is free if the following sequence happens:
> 
>   Non-first RT waiter    First waiter      Lock holder
>   -------------------    ------------      -----------
>   Acquire wait_lock
>   rwsem_try_write_lock():
>     Set handoff bit if RT or
>       wait too long
>     Set waiter->handoff_set
>   Release wait_lock
>                          Acquire wait_lock
>                          Inherit waiter->handoff_set
>                          Release wait_lock
> 					   Clear owner
>                                            Release lock
>   if (waiter.handoff_set) {
>     rwsem_spin_on_owner(();
>     if (OWNER_NULL)
>       goto trylock_again;
>   }
>   trylock_again:
>   Acquire wait_lock
>   rwsem_try_write_lock():
>      if (first->handoff_set && (waiter != first))
>      	return false;
>   Release wait_lock
> 
> A non-first waiter cannot really acquire the rwsem even if it mistakenly
> believes that it can spin on OWNER_NULL value. If that waiter happens
> to be an RT task running on the same CPU as the first waiter, it can
> block the first waiter from acquiring the rwsem leading to live lock.
> Fix this problem by making sure that a non-first waiter cannot spin in
> the slowpath loop without sleeping.
> 
> Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent")
> Reviewed-and-tested-by: Mukesh Ojha <quic_mojha@quicinc.com>
> Signed-off-by: Waiman Long <longman@redhat.com>
> Cc: stable@vger.kernel.org
> ---

I was checking if commit 6eebd5fb2083 ("locking/rwsem: Allow slowpath writer to ignore handoff bit if not set by first waiter") resolves the issue that was discussed in [1]. I modified the program and script slighly:
fsim.c:
#include <unistd.h>
#include <stdlib.h>
#include <signal.h>
void sig_handle(int sig) { exit(0); }
int main(void)
{
        unsigned long c;
        signal(SIGALRM, sig_handle);
        alarm(3);
        while (1)
                c++;
}

run-fsim.sh:
#!/bin/bash
if [ ! -e fsim ]; then
        gcc -o fsim fsim.c
        if [ $? -ne 0 ]; then
                echo Failed to compile fsim
                exit -1
        fi
fi
MAX_ITERATIONS=20000
#The fsim processes are meant to run on both logical CPUs belonging to a CPU core, e.g. 1 and 129.
CPU_RANGE1="${1:-1 11}"
CPU_RANGE2="${1:-129 139}"
for i in `seq 1 $MAX_ITERATIONS`; do
        echo "Start $i/$MAX_ITERATIONS: `date`"
        for CPU in `seq $CPU_RANGE1` `seq $CPU_RANGE2`; do
                taskset -c $CPU chrt -r 10 ./fsim &>/dev/null &
                taskset -c $CPU chrt -r 20 ./fsim &>/dev/null &
                taskset -c $CPU chrt -r 30 ./fsim &>/dev/null &
                taskset -c $CPU chrt -r 40 ./fsim &>/dev/null &
        done
        echo "Wait  $i/$MAX_ITERATIONS: `date`"
        wait
done

No soft lockups were triggered but after 1.5 hours of testing, the fsim processes got stuck and only one of them was visible in the output of top:
> top - 18:45:01 up 44 min,  3 users,  load average: 72.00, 71.04, 54.81
> Tasks: 2226 total,   4 running, 2222 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us,  0.4 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> MiB Mem : 239777.1+total, 235671.7+free, 4332.156 used, 1435.641 buff/cache
> MiB Swap: 1023.996 total, 1023.996 free,    0.000 used. 235444.9+avail Mem
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 100666 root     -31   0       0      0      0 D 94.84 0.000  14:59.40 fsim
>  98193 root      20   0   42224   6844   3484 R 0.794 0.003   0:07.05 top
>      1 root      20   0   79220  12544   9124 S 0.000 0.005   0:11.95 systemd

All of the fsim processes got stuck at the same code path - while exiting:
> [ 2462.649033] INFO: task fsim:100600 blocked for more than 491 seconds.
> [ 2462.649036]       Tainted: G            E     N 5.14.21-sle15-sp5-221214-hoff3-7 #8
> [ 2462.649038] task:fsim            state:D stack:    0 pid:100600 ppid: 95456 flags:0x00000000
> [ 2462.649042] Call Trace:
> [ 2462.649045]  <TASK>
> [ 2462.649046]  __schedule+0x2cd/0x1140
> [ 2462.649059]  schedule+0x5c/0xc0
> [ 2462.649061]  rwsem_down_write_slowpath+0x349/0x5d0
> [ 2462.649070]  unlink_file_vma+0x2d/0x60
> [ 2462.649074]  free_pgtables+0x67/0x110
> [ 2462.649083]  exit_mmap+0xaf/0x1f0
> [ 2462.649088]  mmput+0x56/0x120
> [ 2462.649090]  do_exit+0x306/0xb50
> [ 2462.649095]  do_group_exit+0x3a/0xa0
> [ 2462.649098]  __x64_sys_exit_group+0x14/0x20
> [ 2462.649102]  do_syscall_64+0x5b/0x80
> [ 2462.649116]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> [ 2462.649120] RIP: 0033:0x7f90abae6c46
> [ 2462.649122] RSP: 002b:00007ffc0ca21638 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [ 2462.649124] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f90abae6c46
> [ 2462.649125] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> [ 2462.649127] RBP: 00007f90abdf5970 R08: 00000000000000e7 R09: ffffffffffffff80
> [ 2462.649128] R10: 0000000000000002 R11: 0000000000000246 R12: 00007f90abdf5970
> [ 2462.649129] R13: 0000000000000001 R14: 00007f90abdf9328 R15: 0000000000000000
> [ 2462.649132]  </TASK>
> [ 2462.649133] INFO: task fsim:100601 blocked for more than 491 seconds.
> [ 2462.649216] INFO: task fsim:100603 blocked for more than 491 seconds.
> [ 2462.649295] INFO: task fsim:100604 blocked for more than 491 seconds.
> [ 2462.649371] INFO: task fsim:100605 blocked for more than 491 seconds.
> [ 2462.649449] INFO: task fsim:100606 blocked for more than 491 seconds.
> [ 2462.649526] INFO: task fsim:100607 blocked for more than 491 seconds.
> [ 2462.649606] INFO: task fsim:100608 blocked for more than 491 seconds.
> [ 2462.649676] INFO: task fsim:100609 blocked for more than 491 seconds.

So, I tested these fixes all together added on top of 6eebd5fb2083:
* locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
* locking/rwsem: Disable preemption at all down_read*() and up_read() code paths
* locking/rwsem: Disable preemption at all down_write*() and up_write() code paths

After 20 hours of runtime, none of the fsim processes got stuck nor any soft lockups occurred. AFAICT, it works.
Tested-by: Jiri Wiesner <jwiesner@suse.de>

[1] https://lore.kernel.org/lkml/20220617134325.GC30825@techsingularity.net/

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 2/6] locking/rwsem: Disable preemption at all down_read*() and up_read() code paths
  2022-11-18  2:20 ` [PATCH v6 2/6] locking/rwsem: Disable preemption at all down_read*() and up_read() code paths Waiman Long
@ 2022-12-16 15:03   ` Jiri Wiesner
  0 siblings, 0 replies; 21+ messages in thread
From: Jiri Wiesner @ 2022-12-16 15:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷

On Thu, Nov 17, 2022 at 09:20:12PM -0500, Waiman Long wrote:
> Commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically
> spin on owner") assumes that when the owner field is changed to NULL,
> the lock will become free soon. Commit 48dfb5d2560d ("locking/rwsem:
> Disable preemption while trying for rwsem lock") disables preemption
> when acquiring rwsem for write. However, preemption has not yet been
> disabled when acquiring a read lock on a rwsem. So a reader can add a
> RWSEM_READER_BIAS to count without setting owner to signal a reader,
> got preempted out by a RT task which then spins in the writer slowpath
> as owner remains NULL leading to live lock.
> 
> One easy way to fix this problem is to disable preemption at all the
> down_read*() and up_read() code paths as implemented in this patch.
> 
> Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner")
> Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---

Tested-by: Jiri Wiesner <jwiesner@suse.de>

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/6] locking/rwsem: Disable preemption at all down_write*() and up_write() code paths
  2022-11-18  2:20 ` [PATCH v6 3/6] locking/rwsem: Disable preemption at all down_write*() and up_write() " Waiman Long
@ 2022-12-16 15:03   ` Jiri Wiesner
  0 siblings, 0 replies; 21+ messages in thread
From: Jiri Wiesner @ 2022-12-16 15:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷

On Thu, Nov 17, 2022 at 09:20:13PM -0500, Waiman Long wrote:
> The previous patch has disabled preemption at all the down_read()
> and up_read() code paths. For symmetry, this patch extends commit
> 48dfb5d2560d ("locking/rwsem: Disable preemption while trying for rwsem
> lock") to have preemption disabled at all the down_write() and up_write()
> code path including downgrade_write().
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---

Tested-by: Jiri Wiesner <jwiesner@suse.de>

-- 
Jiri Wiesner
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff
  2022-11-18  2:20 [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
                   ` (5 preceding siblings ...)
  2022-11-18  2:20 ` [PATCH v6 6/6] locking/rwsem: Update handoff lock events tracking Waiman Long
@ 2023-01-17 20:53 ` Waiman Long
  2023-01-22 13:46 ` Peter Zijlstra
  7 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2023-01-17 20:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷

On 11/17/22 21:20, Waiman Long wrote:
> v6:
>   - Fix an error in patch 2 reported by kernel test robot.
>
> v5:
>   - Drop patch 2 and replace it with 2 new patches disabling preemption on
>     all reader functions and writer functions respectively. The other
>     patches are adjusted accordingly.
>
> v4:
>   - Update patch descriptions in patches 1 & 2 to make clear the live
>     lock conditions that are being fixed by these patches. There is no code
>     change from v3.
>
> It turns out the current waiter optimistic spinning code does not work
> that well if we have RT tasks in the mix. This patch series include two
> different fixes to resolve those issues. The last 3 patches modify the
> handoff code to implement true lock handoff similar to that of mutex.
>
> Waiman Long (6):
>    locking/rwsem: Prevent non-first waiter from spinning in down_write()
>      slowpath
>    locking/rwsem: Disable preemption at all down_read*() and up_read()
>      code paths
>    locking/rwsem: Disable preemption at all down_write*() and up_write()
>      code paths
>    locking/rwsem: Change waiter->hanodff_set to a handoff_state enum
>    locking/rwsem: Enable direct rwsem lock handoff
>    locking/rwsem: Update handoff lock events tracking
>
>   kernel/locking/lock_events_list.h |   6 +-
>   kernel/locking/rwsem.c            | 240 ++++++++++++++++++++++--------
>   2 files changed, 182 insertions(+), 64 deletions(-)
>
Peter,

I know that you are busy these days. Do have any further comment or 
suggestion on this patch series?

Looking forward to your reply.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/6] locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
  2022-11-18  2:20 ` [PATCH v6 1/6] locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath Waiman Long
  2022-12-16 15:02   ` Jiri Wiesner
@ 2023-01-20 22:58   ` Waiman Long
  1 sibling, 0 replies; 21+ messages in thread
From: Waiman Long @ 2023-01-20 22:58 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷,
	stable

On 11/17/22 21:20, Waiman Long wrote:
> A non-first waiter can potentially spin in the for loop of
> rwsem_down_write_slowpath() without sleeping but fail to acquire the
> lock even if the rwsem is free if the following sequence happens:
>
>    Non-first RT waiter    First waiter      Lock holder
>    -------------------    ------------      -----------
>    Acquire wait_lock
>    rwsem_try_write_lock():
>      Set handoff bit if RT or
>        wait too long
>      Set waiter->handoff_set
>    Release wait_lock
>                           Acquire wait_lock
>                           Inherit waiter->handoff_set
>                           Release wait_lock
> 					   Clear owner
>                                             Release lock
>    if (waiter.handoff_set) {
>      rwsem_spin_on_owner(();
>      if (OWNER_NULL)
>        goto trylock_again;
>    }
>    trylock_again:
>    Acquire wait_lock
>    rwsem_try_write_lock():
>       if (first->handoff_set && (waiter != first))
>       	return false;
>    Release wait_lock
>
> A non-first waiter cannot really acquire the rwsem even if it mistakenly
> believes that it can spin on OWNER_NULL value. If that waiter happens
> to be an RT task running on the same CPU as the first waiter, it can
> block the first waiter from acquiring the rwsem leading to live lock.
> Fix this problem by making sure that a non-first waiter cannot spin in
> the slowpath loop without sleeping.
>
> Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent")
> Reviewed-and-tested-by: Mukesh Ojha <quic_mojha@quicinc.com>
> Signed-off-by: Waiman Long <longman@redhat.com>
> Cc: stable@vger.kernel.org
> ---
>   kernel/locking/rwsem.c | 19 +++++++++----------
>   1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 44873594de03..be2df9ea7c30 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -624,18 +624,16 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>   			 */
>   			if (first->handoff_set && (waiter != first))
>   				return false;
> -
> -			/*
> -			 * First waiter can inherit a previously set handoff
> -			 * bit and spin on rwsem if lock acquisition fails.
> -			 */
> -			if (waiter == first)
> -				waiter->handoff_set = true;
>   		}
>   
>   		new = count;
>   
>   		if (count & RWSEM_LOCK_MASK) {
> +			/*
> +			 * A waiter (first or not) can set the handoff bit
> +			 * if it is an RT task or wait in the wait queue
> +			 * for too long.
> +			 */
>   			if (has_handoff || (!rt_task(waiter->task) &&
>   					    !time_after(jiffies, waiter->timeout)))
>   				return false;
> @@ -651,11 +649,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>   	} while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
>   
>   	/*
> -	 * We have either acquired the lock with handoff bit cleared or
> -	 * set the handoff bit.
> +	 * We have either acquired the lock with handoff bit cleared or set
> +	 * the handoff bit. Only the first waiter can have its handoff_set
> +	 * set here to enable optimistic spinning in slowpath loop.
>   	 */
>   	if (new & RWSEM_FLAG_HANDOFF) {
> -		waiter->handoff_set = true;
> +		first->handoff_set = true;
>   		lockevent_inc(rwsem_wlock_handoff);
>   		return false;
>   	}

Peter,

I would really like to get this fix patch merged as soon as possible if 
you don't see any problem with it. For the rests of the series, you can 
take your time.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff
  2022-11-18  2:20 [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
                   ` (6 preceding siblings ...)
  2023-01-17 20:53 ` [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
@ 2023-01-22 13:46 ` Peter Zijlstra
  2023-01-23  3:40   ` Waiman Long
  7 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2023-01-22 13:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel,
	john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷

On Thu, Nov 17, 2022 at 09:20:10PM -0500, Waiman Long wrote:

> Waiman Long (6):
>   locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
>   locking/rwsem: Disable preemption at all down_read*() and up_read() code paths
>   locking/rwsem: Disable preemption at all down_write*() and up_write() code paths
>   locking/rwsem: Change waiter->hanodff_set to a handoff_state enum

After all these we still have down_read_non_owner() using
__rwsem_set_reader_owner() outside of the preempt_disable() region.

Now, let me go stare at this one:

>   locking/rwsem: Enable direct rwsem lock handoff



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff
  2023-01-22 13:46 ` Peter Zijlstra
@ 2023-01-23  3:40   ` Waiman Long
  2023-01-23 21:10     ` Waiman Long
  0 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2023-01-23  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel,
	john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷


On 1/22/23 08:46, Peter Zijlstra wrote:
> On Thu, Nov 17, 2022 at 09:20:10PM -0500, Waiman Long wrote:
>
>> Waiman Long (6):
>>    locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
>>    locking/rwsem: Disable preemption at all down_read*() and up_read() code paths
>>    locking/rwsem: Disable preemption at all down_write*() and up_write() code paths
>>    locking/rwsem: Change waiter->hanodff_set to a handoff_state enum
> After all these we still have down_read_non_owner() using
> __rwsem_set_reader_owner() outside of the preempt_disable() region.
>
> Now, let me go stare at this one:

Thanks for spotting that. I kind of overlook the fact we have a 
down_read_non_owner(). Will update that as well.

Cheers,
Longman

>>    locking/rwsem: Enable direct rwsem lock handoff
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff
  2022-11-18  2:20 ` [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff Waiman Long
@ 2023-01-23 14:59   ` Peter Zijlstra
  2023-01-23 17:30     ` Waiman Long
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2023-01-23 14:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel,
	john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷

On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
> The lock handoff provided in rwsem isn't a true handoff like that in
> the mutex. Instead, it is more like a quiescent state where optimistic
> spinning and lock stealing are disabled to make it easier for the first
> waiter to acquire the lock.
> 
> Reworking the code to enable a true lock handoff is more complex due to
> the following facts:
>  1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
>     is too expensive to always take the wait_lock in the unlock path
>     to prevent racing.

Specifically, the issue is that we'd need to turn the
atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
loop, like:

	tmp = atomic_long_read(&sem->count);
	do {
		if (tmp & (WAITERS|HANDOFF))
			return slow_unock();
	} while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
						 tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});

in order to not race with a concurrent setting of the HANDOFF bit,
right? And we don't like to turn unlock into a cmpxchg loop.

(OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
unlock, any fail and we go to the slow path -- I suppose the distinct
difference is that we sorta expect some contention on the read side)

>  2) The reader lock fast path may add a RWSEM_READER_BIAS at the wrong
>     time to prevent a proper lock handoff from a reader owned rwsem.

This would be much the same, right? We'd have to turn
rwsem_read_trylock() into a cmpxchg-loop and we don't like that.
Therefore we do that speculative add and fix up later.

Now, I'm not enturely sure what issue you allude to here; is the problem
that you can't quite tell when the last reader is gone?

> A lock handoff can only be initiated when the following conditions are
> true:
>  1) The RWSEM_FLAG_HANDOFF bit is set.

d'uh ;-)

>  2) The task to do the handoff don't see any other active lock
>     excluding the lock that it might have held.

2) here is the 2) above, right?

> The new handoff mechanism performs handoff in rwsem_wakeup() to minimize
> overhead. The rwsem count will be known at that point to determine if
> handoff should be done. However, there is a small time gap between the
> rwsem becomes free and the wait_lock is taken

Right, that's between atomic_long_fetch_add_release() and calling the
slow path because WAITERS bit is set.

> where a reader can come in and add a RWSEM_READER_BIAS to the count or

Both 2s above.

> the current first waiter can take the rwsem and clear
> RWSEM_FLAG_HANDOFF in the interim.

The actual intended action.

> That will fail the handoff operation.

I would not list that latter as a failure, it's exactly what we want to
happen, no?

> To handle the former case, a secondary handoff will also be done in
> the rwsem_down_read_slowpath() to catch it.

Right. In short:

Having HANDOVER set:
 - implies WAITERS set
 - disables all fastpaths (spinning or otherwise)
 - dis-allows anybody except first waiter to obtain lock

Therefore, having the window between clearing owner and prodding first
waiter is 'harmless'.

> With true lock handoff, there is no need to do a NULL owner spinning
> anymore as wakeup will be performed if handoff is possible. So it
> is likely that the first waiter won't actually go to sleep even when
> schedule() is called in this case.

Right, removing that NULL spinning was the whole purpose -- except I
seem to have forgotten why it was a problem :-)

OK, lemme go read the actual patch.

Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
above, you added a whole intermediate GRANTED state. Why?

Since we fundamentally must deal with the release->wait_lock hole, why
do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
Why can't we skip the whole rwsem_wake()->GRANTED part and only do
handoff in the slowpath?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff
  2023-01-23 14:59   ` Peter Zijlstra
@ 2023-01-23 17:30     ` Waiman Long
  2023-01-23 22:07       ` Waiman Long
  2023-01-24 12:29       ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Waiman Long @ 2023-01-23 17:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel,
	john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷

On 1/23/23 09:59, Peter Zijlstra wrote:
> On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
>> The lock handoff provided in rwsem isn't a true handoff like that in
>> the mutex. Instead, it is more like a quiescent state where optimistic
>> spinning and lock stealing are disabled to make it easier for the first
>> waiter to acquire the lock.
>>
>> Reworking the code to enable a true lock handoff is more complex due to
>> the following facts:
>>   1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
>>      is too expensive to always take the wait_lock in the unlock path
>>      to prevent racing.
> Specifically, the issue is that we'd need to turn the
> atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
> loop, like:
>
> 	tmp = atomic_long_read(&sem->count);
> 	do {
> 		if (tmp & (WAITERS|HANDOFF))
> 			return slow_unock();
> 	} while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
> 						 tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});
>
> in order to not race with a concurrent setting of the HANDOFF bit,
> right? And we don't like to turn unlock into a cmpxchg loop.
>
> (OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
> unlock, any fail and we go to the slow path -- I suppose the distinct
> difference is that we sorta expect some contention on the read side)

I see that your inclination is to do the handoff right at the unlock 
time. It is certainly possible to do that, but it will be more complex 
in the case of rwsem than mutex.

First of all for mutex, the owner is the lock word. You do a successful 
cmpxchg once and it is all done. rwsem isn't like that. Its owner field 
is separated from the actual lock word (count). So setting the right 
lock value and owner cannot be atomic. That is why I am using the 
wait_lock for synchronizing between the unlocker and the first waiter 
that set handoff. Since rwsem_wake() will take the wait_lock anyway, so 
I decide to do the handoff at that time.

Another consideration that I have is to minimize the unlock overhead. To 
do handoff at unlock time requires an extra read of the rwsem count.


>   2) The reader lock fast path may add a RWSEM_READER_BIAS at the wrong
>      time to prevent a proper lock handoff from a reader owned rwsem.
> This would be much the same, right? We'd have to turn
> rwsem_read_trylock() into a cmpxchg-loop and we don't like that.
> Therefore we do that speculative add and fix up later.
>
> Now, I'm not enturely sure what issue you allude to here; is the problem
> that you can't quite tell when the last reader is gone?
>> A lock handoff can only be initiated when the following conditions are
>> true:
>>   1) The RWSEM_FLAG_HANDOFF bit is set.
> d'uh ;-)
>
>>   2) The task to do the handoff don't see any other active lock
>>      excluding the lock that it might have held.
> 2) here is the 2) above, right?
>
>> The new handoff mechanism performs handoff in rwsem_wakeup() to minimize
>> overhead. The rwsem count will be known at that point to determine if
>> handoff should be done. However, there is a small time gap between the
>> rwsem becomes free and the wait_lock is taken
> Right, that's between atomic_long_fetch_add_release() and calling the
> slow path because WAITERS bit is set.
Yes, there is no extra overhead unless the waiter bit is set.
>
>> where a reader can come in and add a RWSEM_READER_BIAS to the count or
> Both 2s above.
>
>> the current first waiter can take the rwsem and clear
>> RWSEM_FLAG_HANDOFF in the interim.
> The actual intended action.
>
>> That will fail the handoff operation.
> I would not list that latter as a failure, it's exactly what we want to
> happen, no?

Yes, that is the intended action.

I adds this due to the fact that even if the HANDOFF bit is observed to 
be set at unlock time, it does not guarantee a handoff can be 
successfully done because the first waiter can be interrupted out or 
killed in the interim. The HANDOFF bit has to be confirmed to be set 
while holding the wait lock to be sure that we can do a handoff.

>> To handle the former case, a secondary handoff will also be done in
>> the rwsem_down_read_slowpath() to catch it.
> Right. In short:
>
> Having HANDOVER set:
>   - implies WAITERS set
>   - disables all fastpaths (spinning or otherwise)
>   - dis-allows anybody except first waiter to obtain lock
>
> Therefore, having the window between clearing owner and prodding first
> waiter is 'harmless'.

As said above, we need to confirm that the HANDOFF bit is set with 
wait_lock held. Now, the HANDOFF bit may not set at unlock time, or it 
may not be.

We can pass the count value fetched at unlock time down to rwsem_wake() 
to confirm that HANDOFF bit is set at unlock time. However, it is also 
possible that the original waiter that set HANDOFF have bailed out, then 
a reader acquire the lock and another waiter set HANDOFF before the 
unlocker acquire the wait lock. Then the rwsem is really reader-owned in 
this case. So we can't perform handoff. That is why I also check for if 
there is an active lock (mostly read lock) at rwsem_wake(). However, 
that can be a false negative because an incoming reader may have just 
added a READER_BIAS which is to be removed soon. That is the reason why 
I have a secondary handoff check in the reader slowpath.

>
>> With true lock handoff, there is no need to do a NULL owner spinning
>> anymore as wakeup will be performed if handoff is possible. So it
>> is likely that the first waiter won't actually go to sleep even when
>> schedule() is called in this case.
> Right, removing that NULL spinning was the whole purpose -- except I
> seem to have forgotten why it was a problem :-)
>
> OK, lemme go read the actual patch.
>
> Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
> above, you added a whole intermediate GRANTED state. Why?
>
> Since we fundamentally must deal with the release->wait_lock hole, why
> do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
> Why can't we skip the whole rwsem_wake()->GRANTED part and only do
> handoff in the slowpath?

First of all, the owner value for a reader-owned rwsem is mostly of an 
advisory value as it holds one of the possible owners. So it may be a 
bit risky to use it as an indication that a handoff had happened as it 
may be screwed up in some rare cases. That is why I use the repurposed 
handoff_state value in the waiter structure. Also reading this value is 
less costly than reading the rwsem cacheline which can be heavily contended.

I will update the patch description to highlight the points that I 
discussed in this email.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff
  2023-01-23  3:40   ` Waiman Long
@ 2023-01-23 21:10     ` Waiman Long
  0 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2023-01-23 21:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel,
	john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷

On 1/22/23 22:40, Waiman Long wrote:
>
> On 1/22/23 08:46, Peter Zijlstra wrote:
>> On Thu, Nov 17, 2022 at 09:20:10PM -0500, Waiman Long wrote:
>>
>>> Waiman Long (6):
>>>    locking/rwsem: Prevent non-first waiter from spinning in 
>>> down_write() slowpath
>>>    locking/rwsem: Disable preemption at all down_read*() and 
>>> up_read() code paths
>>>    locking/rwsem: Disable preemption at all down_write*() and 
>>> up_write() code paths
>>>    locking/rwsem: Change waiter->hanodff_set to a handoff_state enum
>> After all these we still have down_read_non_owner() using
>> __rwsem_set_reader_owner() outside of the preempt_disable() region.
>>
>> Now, let me go stare at this one:
>
> Thanks for spotting that. I kind of overlook the fact we have a 
> down_read_non_owner(). Will update that as well.

After looking further into this, we will need to either pass one more 
parameter to __down_read() for this special case or to do 
preempt_disable/enable in each of the higher level down_read functions. 
As down_read_non_owner() is a rarely called function, I doubt it is 
worth the extra effort to do that since the owner value (other than the 
RWSEM_READER_OWNED bit) is for debugging purpose only and is not 
critical to the correct functioning of a rwsem. I will add a comment to 
note that.

Cheers,
Longman




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff
  2023-01-23 17:30     ` Waiman Long
@ 2023-01-23 22:07       ` Waiman Long
  2023-01-24 12:58         ` Peter Zijlstra
  2023-01-24 12:29       ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Waiman Long @ 2023-01-23 22:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel,
	john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷

On 1/23/23 12:30, Waiman Long wrote:
> I will update the patch description to highlight the points that I 
> discussed in this email. 

I am planning to update the patch description to as follows:

     The lock handoff provided in rwsem isn't a true handoff like that in
     the mutex. Instead, it is more like a quiescent state where optimistic
     spinning and lock stealing are disabled to make it easier for the first
     waiter to acquire the lock.

     For mutex, lock handoff is done at unlock time as the owner value and
     the handoff bit is in the same lock word and can be updated atomically.

     That is the not case for rwsem which has a separate count value for
     locking and an owner value. The only way to update them in a 
quasi-atomic
     way is to use the wait_lock for synchronization as the handoff bit can
     only be updated while holding the wait_lock. So for rwsem, the new
     lock handoff mechanism is done mostly at rwsem_wake() time when the
     wait_lock has to be acquired anyway to minimize additional overhead.

     Passing the count value at unlock time down to rwsem_wake() to 
determine
     if handoff should be done is not safe as the waiter that set the
     RWSEM_FLAG_HANDOFF bit may have been interrupted out or killed in the
     interim. So we need to recheck the count value again after taking the
     wait_lock. If there is an active lock, we can't perform the handoff
     even if the handoff bit is set at both the unlock and rwsem_wake()
     times. It is because there is a slight possibility that the original
     waiter that set the handoff bit may have bailed out followed by a read
     lock and then the handoff bit is set by another waiter.

     It is also likely that the active lock in this case may be a transient
     RWSEM_READER_BIAS that will be removed soon. So we have a secondary
     handoff done at reader slow path to handle this particular case.

     For reader-owned rwsem, the owner value other than the 
RWSEM_READER_OWNED
     bit is mostly for debugging purpose only. So it is not safe to use
     the owner value to confirm a handoff to a reader has happened. On the
     other hand, we can do that when handing off to a writer. However, it
     is simpler to use the same mechanism to notify a handoff has happened
     for both readers and writers. So a new HANDOFF_GRANTED state is added
     to enum rwsem_handoff_state to signify that. This new value will be
     written to the handoff_state value of the first waiter.

     With true lock handoff, there is no need to do a NULL owner spinning
     anymore as wakeup will be performed if handoff is successful. So it
     is likely that the first waiter won't actually go to sleep even when
     schedule() is called in this case.

Please let me know what you think.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff
  2023-01-23 17:30     ` Waiman Long
  2023-01-23 22:07       ` Waiman Long
@ 2023-01-24 12:29       ` Peter Zijlstra
  2023-01-25  1:53         ` Waiman Long
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2023-01-24 12:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel,
	john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷

On Mon, Jan 23, 2023 at 12:30:59PM -0500, Waiman Long wrote:
> On 1/23/23 09:59, Peter Zijlstra wrote:
> > On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
> > > The lock handoff provided in rwsem isn't a true handoff like that in
> > > the mutex. Instead, it is more like a quiescent state where optimistic
> > > spinning and lock stealing are disabled to make it easier for the first
> > > waiter to acquire the lock.
> > > 
> > > Reworking the code to enable a true lock handoff is more complex due to
> > > the following facts:
> > >   1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
> > >      is too expensive to always take the wait_lock in the unlock path
> > >      to prevent racing.
> > Specifically, the issue is that we'd need to turn the
> > atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
> > loop, like:
> > 
> > 	tmp = atomic_long_read(&sem->count);
> > 	do {
> > 		if (tmp & (WAITERS|HANDOFF))
> > 			return slow_unock();
> > 	} while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
> > 						 tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});
> > 
> > in order to not race with a concurrent setting of the HANDOFF bit,
> > right? And we don't like to turn unlock into a cmpxchg loop.
> > 
> > (OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
> > unlock, any fail and we go to the slow path -- I suppose the distinct
> > difference is that we sorta expect some contention on the read side)
> 
> I see that your inclination is to do the handoff right at the unlock time.
> It is certainly possible to do that, but it will be more complex in the case
> of rwsem than mutex.

Still, it would make things ever so much simpler -- but I agree we'll
probably not get away with it on the performance side of things.

> > Right. In short:
> > 
> > Having HANDOVER set:
> >   - implies WAITERS set
> >   - disables all fastpaths (spinning or otherwise)
> >   - dis-allows anybody except first waiter to obtain lock
> > 
> > Therefore, having the window between clearing owner and prodding first
> > waiter is 'harmless'.
> 
> As said above, we need to confirm that the HANDOFF bit is set with wait_lock
> held. Now, the HANDOFF bit may not set at unlock time, or it may not be.
> 
> We can pass the count value fetched at unlock time down to rwsem_wake() to
> confirm that HANDOFF bit is set at unlock time. However, it is also possible
> that the original waiter that set HANDOFF have bailed out, then a reader
> acquire the lock and another waiter set HANDOFF before the unlocker acquire
> the wait lock. Then the rwsem is really reader-owned in this case. So we
> can't perform handoff. That is why I also check for if there is an active
> lock (mostly read lock) at rwsem_wake(). However, that can be a false
> negative because an incoming reader may have just added a READER_BIAS which
> is to be removed soon. That is the reason why I have a secondary handoff
> check in the reader slowpath.
> 
> > 
> > > With true lock handoff, there is no need to do a NULL owner spinning
> > > anymore as wakeup will be performed if handoff is possible. So it
> > > is likely that the first waiter won't actually go to sleep even when
> > > schedule() is called in this case.
> > Right, removing that NULL spinning was the whole purpose -- except I
> > seem to have forgotten why it was a problem :-)
> > 
> > OK, lemme go read the actual patch.
> > 
> > Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
> > above, you added a whole intermediate GRANTED state. Why?
> > 
> > Since we fundamentally must deal with the release->wait_lock hole, why
> > do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
> > Why can't we skip the whole rwsem_wake()->GRANTED part and only do
> > handoff in the slowpath?
> 
> First of all, the owner value for a reader-owned rwsem is mostly of an
> advisory value as it holds one of the possible owners. So it may be a bit
> risky to use it as an indication that a handoff had happened as it may be
> screwed up in some rare cases. That is why I use the repurposed
> handoff_state value in the waiter structure. Also reading this value is less
> costly than reading the rwsem cacheline which can be heavily contended.
> 
> I will update the patch description to highlight the points that I discussed
> in this email.

Maybe I'm being dense, but I'm not seeing it. If we make HANDOFF block
all the fastpaths, all the spinning, all the stealing, everything; then
all that is left is the slowpath that is holding wait_lock.

Then in both slowpaths, ensure only the first waiter can go on and we're
done.

What am I missing? Why does it need to be so complicated?

That is, afaict something like the below would actually work, no? Yes,
simply deleting that spinning in write_slowpath isn't ideal, but I
suspect we can do something to rwsem_try_write_lock() to make up for
that if we think about it.

Again, please explain, explicitly and in small steps, why you think you
need all that complexity.

--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -40,7 +40,7 @@
  *
  * When the rwsem is reader-owned and a spinning writer has timed out,
  * the nonspinnable bit will be set to disable optimistic spinning.
-
+ *
  * When a writer acquires a rwsem, it puts its task_struct pointer
  * into the owner field. It is cleared after an unlock.
  *
@@ -430,6 +430,10 @@ static void rwsem_mark_wake(struct rw_se
 			 * Mark writer at the front of the queue for wakeup.
 			 * Until the task is actually later awoken later by
 			 * the caller, other writers are able to steal it.
+			 *
+			 * *Unless* HANDOFF is set, in which case only the
+			 * first waiter is allowed to take it.
+			 *
 			 * Readers, on the other hand, will block as they
 			 * will notice the queued writer.
 			 */
@@ -463,6 +467,9 @@ static void rwsem_mark_wake(struct rw_se
 			 * force the issue.
 			 */
 			if (time_after(jiffies, waiter->timeout)) {
+				/*
+				 * Setting HANDOFF blocks fastpaths and stealing.
+				 */
 				if (!(oldcount & RWSEM_FLAG_HANDOFF)) {
 					adjustment -= RWSEM_FLAG_HANDOFF;
 					lockevent_inc(rwsem_rlock_handoff);
@@ -471,6 +478,13 @@ static void rwsem_mark_wake(struct rw_se
 			}
 
 			atomic_long_add(-adjustment, &sem->count);
+
+			if (waiter->handoff_set) {
+				/*
+				 * With HANDOFF set we must terminate all spinning.
+				 */
+				rwsem_set_nonspinnable(sem);
+			}
 			return;
 		}
 		/*
@@ -844,7 +858,6 @@ static bool rwsem_optimistic_spin(struct
 		 * Try to acquire the lock
 		 */
 		taken = rwsem_try_write_lock_unqueued(sem);
-
 		if (taken)
 			break;
 
@@ -1159,22 +1172,6 @@ rwsem_down_write_slowpath(struct rw_sema
 		if (signal_pending_state(state, current))
 			goto out_nolock;
 
-		/*
-		 * After setting the handoff bit and failing to acquire
-		 * the lock, attempt to spin on owner to accelerate lock
-		 * transfer. If the previous owner is a on-cpu writer and it
-		 * has just released the lock, OWNER_NULL will be returned.
-		 * In this case, we attempt to acquire the lock again
-		 * without sleeping.
-		 */
-		if (waiter.handoff_set) {
-			enum owner_state owner_state;
-
-			owner_state = rwsem_spin_on_owner(sem);
-			if (owner_state == OWNER_NULL)
-				goto trylock_again;
-		}
-
 		schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_writer);
 		set_current_state(state);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff
  2023-01-23 22:07       ` Waiman Long
@ 2023-01-24 12:58         ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2023-01-24 12:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel,
	john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷

On Mon, Jan 23, 2023 at 05:07:08PM -0500, Waiman Long wrote:
> On 1/23/23 12:30, Waiman Long wrote:
> > I will update the patch description to highlight the points that I
> > discussed in this email.
> 
> I am planning to update the patch description to as follows:
> 
>     The lock handoff provided in rwsem isn't a true handoff like that in
>     the mutex. Instead, it is more like a quiescent state where optimistic
>     spinning and lock stealing are disabled to make it easier for the first
>     waiter to acquire the lock.
> 
>     For mutex, lock handoff is done at unlock time as the owner value and
>     the handoff bit is in the same lock word and can be updated atomically.
> 
>     That is the not case for rwsem which has a separate count value for
>     locking and an owner value. The only way to update them in a
> quasi-atomic
>     way is to use the wait_lock for synchronization as the handoff bit can
>     only be updated while holding the wait_lock. So for rwsem, the new
>     lock handoff mechanism is done mostly at rwsem_wake() time when the
>     wait_lock has to be acquired anyway to minimize additional overhead.

So for first==reader, sure, and you don't need anything special, since
rwsem_mark_wake() already does the right thing.

But for first==writer, I don't follow; *WHY* do you have to complicate
this path so. The write_slowpath already takes wait_lock for
rwsem_try_write_lock() and that already knows about handoff.

>     It is also likely that the active lock in this case may be a transient
>     RWSEM_READER_BIAS that will be removed soon. So we have a secondary
>     handoff done at reader slow path to handle this particular case.

Only because you made it so damn complicated. If instead you rely on the
wait_lock in write_slowpath you can keep it all in once place AFAICT.

>     For reader-owned rwsem, the owner value other than the
> RWSEM_READER_OWNED
>     bit is mostly for debugging purpose only. So it is not safe to use
>     the owner value to confirm a handoff to a reader has happened. On the

What ?!? Where do we care about the owner value? There's
RWSEM_FLAG_HANDOFF which lives in sem->count and there's
waiter->handoff_set. Nowhere do we care about sem->owner in this.

>     other hand, we can do that when handing off to a writer. However, it
>     is simpler to use the same mechanism to notify a handoff has happened
>     for both readers and writers. So a new HANDOFF_GRANTED state is added

I really can't follow whatever logic jump here.

>     to enum rwsem_handoff_state to signify that. This new value will be
>     written to the handoff_state value of the first waiter.
> 
>     With true lock handoff, there is no need to do a NULL owner spinning
>     anymore as wakeup will be performed if handoff is successful. So it
>     is likely that the first waiter won't actually go to sleep even when
>     schedule() is called in this case.

So this spinning, this is purely for writer->write handoff (which is
exceedingly rare since it is readers that set handoff), right?

Why is that so important?

Also, why can't we add something like

	owner = rwsem_owner_flags(sem, &flags);
	if (owner && !(flags & RWSEM_READER_OWNED))
		atomic_long_cond_read_relaxed(&sem->counter, !(VAL & RWSEM_WRITER_LOCKED))

to the start of that? If it's really needed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff
  2023-01-24 12:29       ` Peter Zijlstra
@ 2023-01-25  1:53         ` Waiman Long
  0 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2023-01-25  1:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel,
	john.p.donnelly, Hillf Danton, Mukesh Ojha,
	Ting11 Wang 王婷

On 1/24/23 07:29, Peter Zijlstra wrote:
> On Mon, Jan 23, 2023 at 12:30:59PM -0500, Waiman Long wrote:
>> On 1/23/23 09:59, Peter Zijlstra wrote:
>>> On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
>>>> The lock handoff provided in rwsem isn't a true handoff like that in
>>>> the mutex. Instead, it is more like a quiescent state where optimistic
>>>> spinning and lock stealing are disabled to make it easier for the first
>>>> waiter to acquire the lock.
>>>>
>>>> Reworking the code to enable a true lock handoff is more complex due to
>>>> the following facts:
>>>>    1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
>>>>       is too expensive to always take the wait_lock in the unlock path
>>>>       to prevent racing.
>>> Specifically, the issue is that we'd need to turn the
>>> atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
>>> loop, like:
>>>
>>> 	tmp = atomic_long_read(&sem->count);
>>> 	do {
>>> 		if (tmp & (WAITERS|HANDOFF))
>>> 			return slow_unock();
>>> 	} while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
>>> 						 tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});
>>>
>>> in order to not race with a concurrent setting of the HANDOFF bit,
>>> right? And we don't like to turn unlock into a cmpxchg loop.
>>>
>>> (OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
>>> unlock, any fail and we go to the slow path -- I suppose the distinct
>>> difference is that we sorta expect some contention on the read side)
>> I see that your inclination is to do the handoff right at the unlock time.
>> It is certainly possible to do that, but it will be more complex in the case
>> of rwsem than mutex.
> Still, it would make things ever so much simpler -- but I agree we'll
> probably not get away with it on the performance side of things.
>
>>> Right. In short:
>>>
>>> Having HANDOVER set:
>>>    - implies WAITERS set
>>>    - disables all fastpaths (spinning or otherwise)
>>>    - dis-allows anybody except first waiter to obtain lock
>>>
>>> Therefore, having the window between clearing owner and prodding first
>>> waiter is 'harmless'.
>> As said above, we need to confirm that the HANDOFF bit is set with wait_lock
>> held. Now, the HANDOFF bit may not set at unlock time, or it may not be.
>>
>> We can pass the count value fetched at unlock time down to rwsem_wake() to
>> confirm that HANDOFF bit is set at unlock time. However, it is also possible
>> that the original waiter that set HANDOFF have bailed out, then a reader
>> acquire the lock and another waiter set HANDOFF before the unlocker acquire
>> the wait lock. Then the rwsem is really reader-owned in this case. So we
>> can't perform handoff. That is why I also check for if there is an active
>> lock (mostly read lock) at rwsem_wake(). However, that can be a false
>> negative because an incoming reader may have just added a READER_BIAS which
>> is to be removed soon. That is the reason why I have a secondary handoff
>> check in the reader slowpath.
>>
>>>> With true lock handoff, there is no need to do a NULL owner spinning
>>>> anymore as wakeup will be performed if handoff is possible. So it
>>>> is likely that the first waiter won't actually go to sleep even when
>>>> schedule() is called in this case.
>>> Right, removing that NULL spinning was the whole purpose -- except I
>>> seem to have forgotten why it was a problem :-)
>>>
>>> OK, lemme go read the actual patch.
>>>
>>> Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
>>> above, you added a whole intermediate GRANTED state. Why?
>>>
>>> Since we fundamentally must deal with the release->wait_lock hole, why
>>> do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
>>> Why can't we skip the whole rwsem_wake()->GRANTED part and only do
>>> handoff in the slowpath?
>> First of all, the owner value for a reader-owned rwsem is mostly of an
>> advisory value as it holds one of the possible owners. So it may be a bit
>> risky to use it as an indication that a handoff had happened as it may be
>> screwed up in some rare cases. That is why I use the repurposed
>> handoff_state value in the waiter structure. Also reading this value is less
>> costly than reading the rwsem cacheline which can be heavily contended.
>>
>> I will update the patch description to highlight the points that I discussed
>> in this email.
> Maybe I'm being dense, but I'm not seeing it. If we make HANDOFF block
> all the fastpaths, all the spinning, all the stealing, everything; then
> all that is left is the slowpath that is holding wait_lock.
>
> Then in both slowpaths, ensure only the first waiter can go on and we're
> done.
>
> What am I missing? Why does it need to be so complicated?
>
> That is, afaict something like the below would actually work, no? Yes,
> simply deleting that spinning in write_slowpath isn't ideal, but I
> suspect we can do something to rwsem_try_write_lock() to make up for
> that if we think about it.
>
> Again, please explain, explicitly and in small steps, why you think you
> need all that complexity.
>
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -40,7 +40,7 @@
>    *
>    * When the rwsem is reader-owned and a spinning writer has timed out,
>    * the nonspinnable bit will be set to disable optimistic spinning.
> -
> + *
>    * When a writer acquires a rwsem, it puts its task_struct pointer
>    * into the owner field. It is cleared after an unlock.
>    *
> @@ -430,6 +430,10 @@ static void rwsem_mark_wake(struct rw_se
>   			 * Mark writer at the front of the queue for wakeup.
>   			 * Until the task is actually later awoken later by
>   			 * the caller, other writers are able to steal it.
> +			 *
> +			 * *Unless* HANDOFF is set, in which case only the
> +			 * first waiter is allowed to take it.
> +			 *
>   			 * Readers, on the other hand, will block as they
>   			 * will notice the queued writer.
>   			 */
> @@ -463,6 +467,9 @@ static void rwsem_mark_wake(struct rw_se
>   			 * force the issue.
>   			 */
>   			if (time_after(jiffies, waiter->timeout)) {
> +				/*
> +				 * Setting HANDOFF blocks fastpaths and stealing.
> +				 */
>   				if (!(oldcount & RWSEM_FLAG_HANDOFF)) {
>   					adjustment -= RWSEM_FLAG_HANDOFF;
>   					lockevent_inc(rwsem_rlock_handoff);
> @@ -471,6 +478,13 @@ static void rwsem_mark_wake(struct rw_se
>   			}
>   
>   			atomic_long_add(-adjustment, &sem->count);
> +
> +			if (waiter->handoff_set) {
> +				/*
> +				 * With HANDOFF set we must terminate all spinning.
> +				 */
> +				rwsem_set_nonspinnable(sem);
> +			}
>   			return;
>   		}
>   		/*
> @@ -844,7 +858,6 @@ static bool rwsem_optimistic_spin(struct
>   		 * Try to acquire the lock
>   		 */
>   		taken = rwsem_try_write_lock_unqueued(sem);
> -
>   		if (taken)
>   			break;
>   
> @@ -1159,22 +1172,6 @@ rwsem_down_write_slowpath(struct rw_sema
>   		if (signal_pending_state(state, current))
>   			goto out_nolock;
>   
> -		/*
> -		 * After setting the handoff bit and failing to acquire
> -		 * the lock, attempt to spin on owner to accelerate lock
> -		 * transfer. If the previous owner is a on-cpu writer and it
> -		 * has just released the lock, OWNER_NULL will be returned.
> -		 * In this case, we attempt to acquire the lock again
> -		 * without sleeping.
> -		 */
> -		if (waiter.handoff_set) {
> -			enum owner_state owner_state;
> -
> -			owner_state = rwsem_spin_on_owner(sem);
> -			if (owner_state == OWNER_NULL)
> -				goto trylock_again;
> -		}
> -
>   		schedule_preempt_disabled();
>   		lockevent_inc(rwsem_sleep_writer);
>   		set_current_state(state);
>
Thanks for the comments and suggested changes. I will adopt some of your 
suggestions to simplify the patchset. Will post a new version once I 
finish my testing.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-01-25  1:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-18  2:20 [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
2022-11-18  2:20 ` [PATCH v6 1/6] locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath Waiman Long
2022-12-16 15:02   ` Jiri Wiesner
2023-01-20 22:58   ` Waiman Long
2022-11-18  2:20 ` [PATCH v6 2/6] locking/rwsem: Disable preemption at all down_read*() and up_read() code paths Waiman Long
2022-12-16 15:03   ` Jiri Wiesner
2022-11-18  2:20 ` [PATCH v6 3/6] locking/rwsem: Disable preemption at all down_write*() and up_write() " Waiman Long
2022-12-16 15:03   ` Jiri Wiesner
2022-11-18  2:20 ` [PATCH v6 4/6] locking/rwsem: Change waiter->hanodff_set to a handoff_state enum Waiman Long
2022-11-18  2:20 ` [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff Waiman Long
2023-01-23 14:59   ` Peter Zijlstra
2023-01-23 17:30     ` Waiman Long
2023-01-23 22:07       ` Waiman Long
2023-01-24 12:58         ` Peter Zijlstra
2023-01-24 12:29       ` Peter Zijlstra
2023-01-25  1:53         ` Waiman Long
2022-11-18  2:20 ` [PATCH v6 6/6] locking/rwsem: Update handoff lock events tracking Waiman Long
2023-01-17 20:53 ` [PATCH v6 0/6] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff Waiman Long
2023-01-22 13:46 ` Peter Zijlstra
2023-01-23  3:40   ` Waiman Long
2023-01-23 21:10     ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.