All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2
@ 2019-05-20 20:58 Waiman Long
  2019-05-20 20:59 ` [PATCH v8 01/19] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER Waiman Long
                   ` (18 more replies)
  0 siblings, 19 replies; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:58 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

 v8:
  - Remove v7 patch 1 as it has been merged.
  - Add patch 14 to make rwsem->owner an atomic_long_t as suggested
    by PeterZ.
  - Incorporates other changes suggested by PeterZ like removing
    *_is_spinnable() and owner_without_flags() helpers.

 v7:
  - Fix a bug in patch 1 and include changes suggested by Linus.
  - Refresh the other patches accordingly.

 v6:
  - Add a new patch 1 to fix an existing rwsem bug that allows both
    readers and writer to become rwsem owners simultaneously.
  - Fix a missed wakeup bug (in patch 9) of the v5 series.
  - Fix boot up hang problem on some kernel configurations caused
    by the patch that merge owner into count.
  - Make the adaptive disabling of reader optimistic spinning more
    aggressive to further improve workloads that don't benefit from
    that.
  - Add another patch to disable preemption during down_read() when
    owner is merged into count to prevent the possiblity of reader
    count overflow.
  - Don't allow merging of owner into count if NR_CPUS is greater
    that the max supported reader count.
  - Allow readers dropped out from the optimistic spinning loop to
    attempt a trylock if it is still in the right read phase.
  - Fix incorrect DEBUG_RWSEMS_WARN_ON() warning.

This is part 2 of a 3-part (0/1/2) series to rearchitect the internal
operation of rwsem. Both part 0 and part 1 are merged into tip.

This patchset revamps the current rwsem-xadd implementation to make
it saner and easier to work with. It also implements the following 3
new features:

 1) Waiter lock handoff
 2) Reader optimistic spinning with adapative disabling
 3) Store write-lock owner in the atomic count (x86-64 only for now)

Waiter lock handoff is similar to the mechanism currently in the mutex
code. This ensures that lock starvation won't happen.

Reader optimistic spinning enables readers to acquire the lock more
quickly.  So workloads that use a mix of readers and writers should
see an increase in performance as long as the reader critical sections
are short. For those workloads that have long reader critical sections
reader optimistic spinning may hurt performance, so an adaptive disabling
mechanism is also implemented to disable it when reader-owned lock
spinning timeouts happen.

Finally, storing the write-lock owner into the count will allow
optimistic spinners to get to the lock holder's task structure more
quickly and eliminating the timing gap where the write lock is acquired
but the owner isn't known yet. This is important for RT tasks where
spinning on a lock with an unknown owner is not allowed.

Because of the fact that multiple readers can share the same lock,
there is a natural preference for readers when measuring in term of
locking throughput as more readers are likely to get into the locking
fast path than the writers. With waiter lock handoff, we are not going
to starve the writers.

On a 2-socket 40-core 80-thread Skylake system with 40 reader and writer
locking threads, the min/mean/max locking operations done in a 5-second
testing window before the patchset were:

  40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
  40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

After the patchset, they became:

  40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
  40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098

It can be seen that the performance improves for both readers and writers.
It also makes rwsem fairer to readers as well as among different threads
within the reader and writer groups.

Patch 1 makes owner a permanent member of the rw_semaphore structure and
set it irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Patch 2 removes rwsem_wake() wakeup optimization as it doesn't work
with lock handoff.

Patch 3 implements a new rwsem locking scheme similar to what qrwlock
is current doing. Write lock is done by atomic_cmpxchg() while read
lock is still being done by atomic_add().

Patch 4 merges the content of rwsem.h and rwsem-xadd.c into rwsem.c just
like the mutex. The rwsem-xadd.c is removed and a bare-bone rwsem.h is
left for internal function declaration needed by percpu-rwsem.c.

Patch 5 optimizes the merged rwsem.c file to generate smaller object
file and performs other miscellaneous code cleanups.

Patch 6 makes rwsem_spin_on_owner() returns owner state.

Patch 7 implments lock handoff to prevent lock starvation. It is expected
that throughput will be lower on workloads with highly contended rwsems
for better fairness.

Patch 8 makes sure that all wake_up_q() calls happened after dropping
the wait_lock.

Patch 9 makes RT task's handling of NULL owner more optimal.

Patch 10 makes reader wakeup to wake up almost all the readers in the
wait queue instead of just those in the front.

Patch 11 renames the RWSEM_ANONYMOUSLY_OWNED bit to RWSEM_NONSPINNABLE.

Patch 12 enables reader to spin on a writer-owned rwsem.

Patch 13 makes rwsem->owner an atomic_long_t as it is now holding some
flags that may need to be atomically updated.

Patch 14 enables a writer to spin on a reader-owned rwsem for at most
25us and extends the RWSEM_NONSPINNABLE bit to 2 separate ones - one
for readers and one for writers.

Patch 15 implements the adaptive disabling of reader optimistic spinning
when the reader-owned rwsem spinning timeouts happen.

Patch 16 handles the case of too many readers by reserving the sign
bit to designate that a reader lock attempt will fail and the locking
reader will be put to sleep. This will ensure that we will not overflow
the reader count.

Patch 17 merges the write-lock owner task pointer into the count.
Only 64-bit count has enough space to provide a reasonable number of
bits for reader count. This is for x86-64 only for the time being.

Patch 18 eliminates redundant computation of the merged owner-count.

Patch 19 disable preemption during the down_read() call to eliminate
the remote possibility that the reader count may overflow.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 2-socket Skylake system with equal numbers
of readers and writers (mixed) before and after this patchset were:

   # of Threads   Before Patch      After Patch
   ------------   ------------      -----------
        2            2,618             4,193
        4            1,202             3,726
        8              802             3,622
       16              729             3,359
       32              319             2,826
       64              102             2,744

On workloads where the rwsem reader critical section is relatively long
(longer than the spinning period), optimistic of writer on reader-owned
rwsem may not be that helpful. In fact, the performance may regress
in some cases like the will-it-sclae page_fault1 microbenchmark. This
is likely due to the fact that larger reader groups where the readers
acquire the lock together are broken into smaller ones. So more work
will be needed to better tune the rwsem code to that kind of workload.

The v7-to-v8 diff is attached at the end.

Waiman Long (19):
  locking/rwsem: Make owner available even if
    !CONFIG_RWSEM_SPIN_ON_OWNER
  locking/rwsem: Remove rwsem_wake() wakeup optimization
  locking/rwsem: Implement a new locking scheme
  locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
  locking/rwsem: Code cleanup after files merging
  locking/rwsem: Make rwsem_spin_on_owner() return owner state
  locking/rwsem: Implement lock handoff to prevent lock starvation
  locking/rwsem: Always release wait_lock before waking up tasks
  locking/rwsem: More optimal RT task handling of null owner
  locking/rwsem: Wake up almost all readers in wait queue
  locking/rwsem: Clarify usage of owner's nonspinaable bit
  locking/rwsem: Enable readers spinning on writer
  locking/rwsem: Make rwsem->owner an atomic_long_t
  locking/rwsem: Enable time-based spinning on reader-owned rwsem
  locking/rwsem: Adaptive disabling of reader optimistic spinning
  locking/rwsem: Guard against making count negative
  locking/rwsem: Merge owner into count on x86-64
  locking/rwsem: Remove redundant computation of writer lock word
  locking/rwsem: Disable preemption in down_read*() if owner in count

 arch/x86/Kconfig                  |    6 +
 include/linux/percpu-rwsem.h      |    4 +-
 include/linux/rwsem.h             |   16 +-
 include/linux/sched/wake_q.h      |    5 +
 kernel/Kconfig.locks              |   12 +
 kernel/locking/Makefile           |    2 +-
 kernel/locking/lock_events_list.h |   12 +-
 kernel/locking/rwsem-xadd.c       |  745 -------------
 kernel/locking/rwsem.c            | 1668 ++++++++++++++++++++++++++++-
 kernel/locking/rwsem.h            |  306 +-----
 lib/Kconfig.debug                 |    8 +-
 11 files changed, 1694 insertions(+), 1090 deletions(-)
 delete mode 100644 kernel/locking/rwsem-xadd.c

-- 

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 03cb4b6f842e..0a43830f1932 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -117,7 +117,7 @@ static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
 	lock_release(&sem->rw_sem.dep_map, 1, ip);
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	if (!read)
-		sem->rw_sem.owner = RWSEM_OWNER_UNKNOWN;
+		atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN);
 #endif
 }
 
@@ -127,7 +127,7 @@ static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
 	lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	if (!read)
-		sem->rw_sem.owner = current;
+		atomic_long_set(&sem->rw_sem.owner, (long)current);
 #endif
 }
 
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index bb76e82398b2..e401358c4e7e 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -35,10 +35,11 @@
 struct rw_semaphore {
 	atomic_long_t count;
 	/*
-	 * Write owner or one of the read owners. Can be used as a
-	 * speculative check to see if the owner is running on the cpu.
+	 * Write owner or one of the read owners as well flags regarding
+	 * the current state of the rwsem. Can be used as a speculative
+	 * check to see if the write owner is running on the cpu.
 	 */
-	struct task_struct *owner;
+	atomic_long_t owner;
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	struct optimistic_spin_queue osq; /* spinner MCS lock */
 #endif
@@ -53,7 +54,7 @@ struct rw_semaphore {
  * Setting all bits of the owner field except bit 0 will indicate
  * that the rwsem is writer-owned with an unknown owner.
  */
-#define RWSEM_OWNER_UNKNOWN	((struct task_struct *)-2L)
+#define RWSEM_OWNER_UNKNOWN	(-2L)
 
 /* In all implementations count != 0 means locked */
 static inline int rwsem_is_locked(struct rw_semaphore *sem)
@@ -80,7 +81,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 
 #define __RWSEM_INITIALIZER(name)				\
 	{ __RWSEM_INIT_COUNT(name),				\
-	  .owner = NULL,					\
+	  .owner = ATOMIC_LONG_INIT(0),				\
 	  .wait_list = LIST_HEAD_INIT((name).wait_list),	\
 	  .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock)	\
 	  __RWSEM_OPT_INIT(name)				\
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 60783267b50d..cede2f99220b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -107,7 +107,7 @@
 	if (!debug_locks_silent &&				\
 	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
 		#c, atomic_long_read(&(sem)->count),		\
-		(long)((sem)->owner), (long)current,		\
+		atomic_long_read(&(sem)->owner), (long)current,	\
 		list_empty(&(sem)->wait_list) ? "" : "not "))	\
 			debug_locks_off();			\
 	} while (0)
@@ -174,9 +174,9 @@
  * atomic_long_cmpxchg() will be used to obtain writer lock.
  *
  * There are three places where the lock handoff bit may be set or cleared.
- * 1) __rwsem_mark_wake() for readers.
+ * 1) rwsem_mark_wake() for readers.
  * 2) rwsem_try_write_lock() for writers.
- * 3) Error path of __rwsem_down_write_failed_common().
+ * 3) Error path of rwsem_down_write_slowpath().
  *
  * For all the above cases, wait_lock will be held. A writer must also
  * be the first one in the wait_list to be eligible for setting the handoff
@@ -259,12 +259,20 @@ static inline unsigned long rwsem_count_owner(long count)
 
 static inline void rwsem_set_owner(struct rw_semaphore *sem)
 {
-	WRITE_ONCE(sem->owner, current);
+	atomic_long_set(&sem->owner, (long)current);
 }
 
 static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 {
-	WRITE_ONCE(sem->owner, NULL);
+	atomic_long_set(&sem->owner, 0);
+}
+
+/*
+ * Test the flags in the owner field.
+ */
+static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
+{
+	return atomic_long_read(&sem->owner) & flags;
 }
 
 /*
@@ -280,11 +288,10 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED |
-			   ((unsigned long)READ_ONCE(sem->owner) &
-			     RWSEM_RD_NONSPINNABLE);
+	long val = (long)owner | RWSEM_READER_OWNED |
+		   (atomic_long_read(&sem->owner) & RWSEM_RD_NONSPINNABLE);
 
-	WRITE_ONCE(sem->owner, (struct task_struct *)val);
+	atomic_long_set(&sem->owner, val);
 }
 
 static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
@@ -292,34 +299,6 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
 	__rwsem_set_reader_owned(sem, current);
 }
 
-/*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock.
- * N.B. !owner is considered spinnable.
- */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner, bool wr)
-{
-	unsigned long bit = wr ? RWSEM_WR_NONSPINNABLE : RWSEM_RD_NONSPINNABLE;
-
-	return !((unsigned long)owner & bit);
-}
-
-/*
- * Return true if the rwsem is spinnable.
- */
-static inline bool is_rwsem_spinnable(struct rw_semaphore *sem, bool wr)
-{
-	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner), wr);
-}
-
-/*
- * Remove all the flag bits from owner.
- */
-static inline struct task_struct *owner_without_flags(struct task_struct *owner)
-{
-	return (struct task_struct *)((long)owner & ~RWSEM_OWNER_FLAGS_MASK);
-}
-
 /*
  * Return true if the rwsem is owned by a reader.
  */
@@ -334,7 +313,7 @@ static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
 	if (count & RWSEM_WRITER_MASK)
 		return false;
 #endif
-	return (unsigned long)sem->owner & RWSEM_READER_OWNED;
+	return rwsem_test_oflags(sem, RWSEM_READER_OWNED);
 }
 
 #ifdef CONFIG_DEBUG_RWSEMS
@@ -346,11 +325,13 @@ static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
  */
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
-	unsigned long owner = (unsigned long)READ_ONCE(sem->owner);
+	long val = atomic_long_read(&sem->owner);
 
-	if ((owner & ~RWSEM_OWNER_FLAGS_MASK) == (unsigned long)current)
-		cmpxchg_relaxed((unsigned long *)&sem->owner, owner,
-				owner & RWSEM_OWNER_FLAGS_MASK);
+	while ((val & ~RWSEM_OWNER_FLAGS_MASK) == (long)current) {
+		if (atomic_long_try_cmpxchg(&sem->owner, &val,
+					    val & RWSEM_OWNER_FLAGS_MASK))
+			return;
+	}
 }
 #else
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -364,13 +345,13 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
  */
 static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 {
-	long owner = (long)READ_ONCE(sem->owner);
+	long owner = atomic_long_read(&sem->owner);
 
 	while (owner & RWSEM_READER_OWNED) {
-		if (!is_rwsem_owner_spinnable((void *)owner, false))
+		if (owner & RWSEM_NONSPINNABLE)
 			break;
-		owner = cmpxchg((long *)&sem->owner, owner,
-				owner | RWSEM_NONSPINNABLE);
+		owner = atomic_long_cmpxchg(&sem->owner, owner,
+					    owner | RWSEM_NONSPINNABLE);
 	}
 }
 
@@ -395,16 +376,22 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 
 /*
  * Get the owner value from count to have early access to the task structure.
- * Owner from sem->count should includes the RWSEM_NONSPINNABLE bits
- * from sem->owner.
  */
-static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
 {
-	unsigned long cowner = rwsem_count_owner(atomic_long_read(&sem->count));
-	unsigned long sowner = (unsigned long)READ_ONCE(sem->owner);
+	return (struct task_struct *)
+		rwsem_count_owner(atomic_long_read(&sem->count));
+}
 
-	return (struct task_struct *) (cowner
-		? cowner | (sowner & RWSEM_NONSPINNABLE) : sowner);
+/*
+ * Return the real task structure pointer of the owner and the embedded
+ * flags in the owner.
+ */
+static inline struct task_struct *
+rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
+{
+	*pflags = atomic_long_read(&sem->owner) & RWSEM_OWNER_FLAGS_MASK;
+	return rwsem_read_owner(sem);
 }
 
 /*
@@ -448,15 +435,33 @@ static int __init rwsem_show_count_status(void)
 	return 0;
 }
 late_initcall(rwsem_show_count_status);
+
 #else /* !MERGE_OWNER_INTO_COUNT */
 
 #define rwsem_preempt_disable()
 #define rwsem_preempt_enable()
 #define rwsem_schedule_preempt_disabled()	schedule()
 
-static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+/*
+ * Return just the real task structure pointer of the owner
+ */
+static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
 {
-	return READ_ONCE(sem->owner);
+	return (struct task_struct *)(atomic_long_read(&sem->owner) &
+					~RWSEM_OWNER_FLAGS_MASK);
+}
+
+/*
+ * Return the real task structure pointer of the owner and the embedded
+ * flags in the owner. pflags must be non-NULL.
+ */
+static inline struct task_struct *
+rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
+{
+	long owner = atomic_long_read(&sem->owner);
+
+	*pflags = owner & RWSEM_OWNER_FLAGS_MASK;
+	return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
 }
 
 static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
@@ -499,7 +504,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
-	sem->owner = NULL;
+	atomic_long_set(&sem->owner, 0L);
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	osq_lock_init(&sem->osq);
 #endif
@@ -516,7 +521,7 @@ struct rwsem_waiter {
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
 	unsigned long timeout;
-	unsigned long last_rowner;
+	long last_rowner;
 };
 #define rwsem_first_waiter(sem) \
 	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -560,9 +565,9 @@ enum writer_wait_state {
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only marked woken if downgrading is false
  */
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
-			      enum rwsem_wake_type wake_type,
-			      struct wake_q_head *wake_q)
+static void rwsem_mark_wake(struct rw_semaphore *sem,
+			    enum rwsem_wake_type wake_type,
+			    struct wake_q_head *wake_q)
 {
 	struct rwsem_waiter *waiter, *tmp;
 	long oldcount, woken = 0, adjustment = 0;
@@ -722,25 +727,31 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
  * bit is set or the lock is acquired with handoff bit cleared.
  */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 					const long wlock,
 					enum writer_wait_state wstate)
 {
-	long new;
+	long count, new;
 
 	lockdep_assert_held(&sem->wait_lock);
+
+	count = atomic_long_read(&sem->count);
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
 
 		if (has_handoff && wstate == WRITER_NOT_FIRST)
 			return false;
 
+		new = count;
+
 		if (count & RWSEM_LOCK_MASK) {
 			if (has_handoff || (wstate != WRITER_HANDOFF))
 				return false;
-			new = count | RWSEM_FLAG_HANDOFF;
+
+			new |= RWSEM_FLAG_HANDOFF;
 		} else {
-			new = (count | wlock) & ~RWSEM_FLAG_HANDOFF;
+			new |= wlock;
+			new &= ~RWSEM_FLAG_HANDOFF;
 
 			if (list_is_singular(&sem->wait_list))
 				new &= ~RWSEM_FLAG_WAITERS;
@@ -804,11 +815,6 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem,
 
 static inline bool owner_on_cpu(struct task_struct *owner)
 {
-	/*
-	 * Clear all the flag bits in owner
-	 */
-	owner = owner_without_flags(owner);
-
 	/*
 	 * As lock holder preemption issue, we both skip spinning if
 	 * task is not on cpu or its cpu is preempted
@@ -816,12 +822,14 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 }
 
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+					   long nonspinnable)
 {
 	struct task_struct *owner;
+	long flags;
 	bool ret = true;
 
-	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN, true));
+	BUILD_BUG_ON(!(RWSEM_OWNER_UNKNOWN & RWSEM_NONSPINNABLE));
 
 	if (need_resched()) {
 		lockevent_inc(rwsem_opt_fail);
@@ -830,17 +838,12 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
 
 	preempt_disable();
 	rcu_read_lock();
-
-	owner = rwsem_get_owner(sem);
-	if (!is_rwsem_owner_spinnable(owner, wr))
+	owner = rwsem_read_owner_flags(sem, &flags);
+	if ((flags & nonspinnable) || (owner && !owner_on_cpu(owner)))
 		ret = false;
-	else if ((unsigned long)owner & RWSEM_READER_OWNED)
-		ret = true;
-	else if ((owner = owner_without_flags(owner)))
-		ret = owner_on_cpu(owner);
-
 	rcu_read_unlock();
 	preempt_enable();
+
 	lockevent_cond_inc(rwsem_opt_fail, !ret);
 	return ret;
 }
@@ -864,26 +867,27 @@ enum owner_state {
 };
 #define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER | OWNER_READER)
 
-static inline enum owner_state rwsem_owner_state(unsigned long owner, bool wr)
+static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
+						 long flags, long nonspinnable)
 {
-	if (!is_rwsem_owner_spinnable((void *)owner, wr))
+	if (flags & nonspinnable)
 		return OWNER_NONSPINNABLE;
 
-	if (owner & RWSEM_READER_OWNED)
+	if (flags & RWSEM_READER_OWNED)
 		return OWNER_READER;
 
-	if (!(owner & ~RWSEM_OWNER_FLAGS_MASK))
-		return OWNER_NULL;
-
-	return OWNER_WRITER;
+	return owner ? OWNER_WRITER : OWNER_NULL;
 }
 
-static noinline enum owner_state
-rwsem_spin_on_owner(struct rw_semaphore *sem, bool wr)
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem,
+						     long nonspinnable)
 {
-	struct task_struct *tmp, *owner = rwsem_get_owner(sem);
-	enum owner_state state = rwsem_owner_state((unsigned long)owner, wr);
+	struct task_struct *new, *owner;
+	long flags, new_flags;
+	enum owner_state state;
 
+	owner = rwsem_read_owner_flags(sem, &flags);
+	state = rwsem_owner_state(owner, flags, nonspinnable);
 	if (state != OWNER_WRITER)
 		return state;
 
@@ -894,9 +898,9 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, bool wr)
 			break;
 		}
 
-		tmp = rwsem_get_owner(sem);
-		if (tmp != owner) {
-			state = rwsem_owner_state((unsigned long)tmp, wr);
+		new = rwsem_read_owner_flags(sem, &new_flags);
+		if ((new != owner) || (new_flags != flags)) {
+			state = rwsem_owner_state(new, new_flags, nonspinnable);
 			break;
 		}
 
@@ -923,40 +927,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, bool wr)
 /*
  * Calculate reader-owned rwsem spinning threshold for writer
  *
- * It is assumed that the more readers own the rwsem, the longer it will
- * take for them to wind down and free the rwsem. So the formula to
- * determine the actual spinning time limit is:
- *
- * 1) RWSEM_FLAG_WAITERS set
- *    Spinning threshold = (10 + nr_readers/2)us
+ * The more readers own the rwsem, the longer it will take for them to
+ * wind down and free the rwsem. So the empirical formula used to
+ * determine the actual spinning time limit here is:
  *
- * 2) RWSEM_FLAG_WAITERS not set
- *    Spinning threshold = 25us
+ *   Spinning threshold = (10 + nr_readers/2)us
  *
- * In the first case when RWSEM_FLAG_WAITERS is set, no new reader can
- * become rwsem owner. It is assumed that the more readers own the rwsem,
- * the longer it will take for them to wind down and free the rwsem. In
- * addition, if it happens that a previous task that releases the lock
- * is in the process of waking up readers one-by-one, the process will
- * take longer when more readers needed to be woken up. This is subjected
- * to a maximum value of 25us.
- *
- * In the second case with RWSEM_FLAG_WAITERS off, new readers can join
- * and become one of the owners. So assuming for the worst case and spin
- * for at most 25us.
+ * The limit is capped to a maximum of 25us (30 readers). This is just
+ * a heuristic and is subjected to change in the future.
  */
 static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
 {
 	long count = atomic_long_read(&sem->count);
-	u64 delta = 25 * NSEC_PER_USEC;
+	int readers = count >> RWSEM_READER_SHIFT;
+	u64 delta;
 
-	if (count & RWSEM_FLAG_WAITERS) {
-		int readers = count >> RWSEM_READER_SHIFT;
-
-		if (readers > 30)
-			readers = 30;
-		delta = (20 + readers) * NSEC_PER_USEC / 2;
-	}
+	if (readers > 30)
+		readers = 30;
+	delta = (20 + readers) * NSEC_PER_USEC / 2;
 
 	return sched_clock() + delta;
 }
@@ -967,6 +955,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
 	int prev_owner_state = OWNER_NULL;
 	int loop = 0;
 	u64 rspin_threshold = 0;
+	long nonspinnable = wlock ? RWSEM_WR_NONSPINNABLE
+				  : RWSEM_RD_NONSPINNABLE;
 
 	preempt_disable();
 
@@ -981,8 +971,9 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
 	 *  2) readers own the lock and spinning time has exceeded limit.
 	 */
 	for (;;) {
-		enum owner_state owner_state = rwsem_spin_on_owner(sem, wlock);
+		enum owner_state owner_state;
 
+		owner_state = rwsem_spin_on_owner(sem, nonspinnable);
 		if (!(owner_state & OWNER_SPINNABLE))
 			break;
 
@@ -1007,7 +998,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
 			 * the beginning of the 2nd reader phase.
 			 */
 			if (prev_owner_state != OWNER_READER) {
-				if (!is_rwsem_spinnable(sem, wlock))
+				if (rwsem_test_oflags(sem, nonspinnable))
 					break;
 				rspin_threshold = rwsem_rspin_threshold(sem);
 				loop = 0;
@@ -1095,9 +1086,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
  */
 static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
 {
-	if (!is_rwsem_spinnable(sem, true))
-		atomic_long_andnot(RWSEM_WR_NONSPINNABLE,
-				  (atomic_long_t *)&sem->owner);
+	if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE))
+		atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner);
 }
 
 /*
@@ -1120,9 +1110,9 @@ static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
  * not be here at all.
  */
 static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
-					      unsigned long last_rowner)
+					      long last_rowner)
 {
-	unsigned long owner = (unsigned long)READ_ONCE(sem->owner);
+	long owner = atomic_long_read(&sem->owner);
 
 	if (!(owner & RWSEM_READER_OWNED))
 		return false;
@@ -1137,7 +1127,8 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
 	return false;
 }
 #else
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+					   long nonspinnable)
 {
 	return false;
 }
@@ -1180,7 +1171,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
 		 * holding or trying to acquire the lock. So disable
 		 * optimistic spinning and go directly into the wait list.
 		 */
-		if (is_rwsem_spinnable(sem, false))
+		if (rwsem_test_oflags(sem, RWSEM_RD_NONSPINNABLE))
 			rwsem_set_nonspinnable(sem);
 		goto queue;
 	}
@@ -1189,11 +1180,11 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
 	 * Save the current read-owner of rwsem, if available, and the
 	 * reader nonspinnable bit.
 	 */
-	waiter.last_rowner = (long)READ_ONCE(sem->owner);
+	waiter.last_rowner = atomic_long_read(&sem->owner);
 	if (!(waiter.last_rowner & RWSEM_READER_OWNED))
 		waiter.last_rowner &= RWSEM_RD_NONSPINNABLE;
 
-	if (!rwsem_can_spin_on_owner(sem, false))
+	if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE))
 		goto queue;
 
 	/*
@@ -1209,8 +1200,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
 		if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS)) {
 			raw_spin_lock_irq(&sem->wait_lock);
 			if (!list_empty(&sem->wait_list))
-				__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
-						  &wake_q);
+				rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
+						&wake_q);
 			raw_spin_unlock_irq(&sem->wait_lock);
 			wake_up_q(&wake_q);
 		}
@@ -1261,7 +1252,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
 	}
 	if (wake || (!(count & RWSEM_WRITER_MASK) &&
 		    (adjustment & RWSEM_FLAG_WAITERS)))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
 	wake_up_q(&wake_q);
@@ -1297,11 +1288,15 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
 	return ERR_PTR(-EINTR);
 }
 
+/*
+ * This function is called by the a write lock owner. So the owner value
+ * won't get changed by others.
+ */
 static inline void rwsem_disable_reader_optspin(struct rw_semaphore *sem,
 						bool disable)
 {
 	if (unlikely(disable)) {
-		*((unsigned long *)&sem->owner) |= RWSEM_RD_NONSPINNABLE;
+		atomic_long_or(RWSEM_RD_NONSPINNABLE, &sem->owner);
 		lockevent_inc(rwsem_opt_norspin);
 	}
 }
@@ -1321,7 +1316,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	const long wlock = RWSEM_WRITER_LOCKED;
 
 	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_can_spin_on_owner(sem, true) &&
+	if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) &&
 	    rwsem_optimistic_spin(sem, wlock))
 		return sem;
 
@@ -1330,7 +1325,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	 * acquiring the write lock when the setting of the nonspinnable
 	 * bits are observed.
 	 */
-	disable_rspin = (long)READ_ONCE(sem->owner) & RWSEM_NONSPINNABLE;
+	disable_rspin = atomic_long_read(&sem->owner) & RWSEM_NONSPINNABLE;
 
 	/*
 	 * Optimistic spinning failed, proceed to the slowpath
@@ -1362,7 +1357,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		if (count & RWSEM_WRITER_MASK)
 			goto wait;
 
-		__rwsem_mark_wake(sem, (count & RWSEM_READER_MASK)
+		rwsem_mark_wake(sem, (count & RWSEM_READER_MASK)
 					? RWSEM_WAKE_READERS
 					: RWSEM_WAKE_ANY, &wake_q);
 
@@ -1375,25 +1370,16 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 			wake_up_q(&wake_q);
 			wake_q_init(&wake_q);	/* Used again, reinit */
 			raw_spin_lock_irq(&sem->wait_lock);
-			/*
-			 * This waiter may have become first in the wait
-			 * list after re-acquring the wait_lock. The
-			 * rwsem_first_waiter() test in the main while
-			 * loop below will correctly detect that. We do
-			 * need to reload count to perform proper trylock
-			 * and avoid missed wakeup.
-			 */
-			count = atomic_long_read(&sem->count);
 		}
 	} else {
-		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+		atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count);
 	}
 
 wait:
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	while (true) {
-		if (rwsem_try_write_lock(count, sem, wlock, wstate))
+		if (rwsem_try_write_lock(sem, wlock, wstate))
 			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
@@ -1434,7 +1420,6 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		}
 
 		raw_spin_lock_irq(&sem->wait_lock);
-		count = atomic_long_read(&sem->count);
 	}
 	__set_current_state(TASK_RUNNING);
 	list_del(&waiter.list);
@@ -1455,7 +1440,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	if (list_empty(&sem->wait_list))
 		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	else
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	wake_up_q(&wake_q);
 	lockevent_inc(rwsem_wlock_fail);
@@ -1475,7 +1460,7 @@ static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count)
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 	wake_up_q(&wake_q);
@@ -1496,7 +1481,7 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 	wake_up_q(&wake_q);
@@ -1572,7 +1557,12 @@ static inline void __down_write(struct rw_semaphore *sem)
 	else
 		rwsem_set_owner(sem);
 #ifdef MERGE_OWNER_INTO_COUNT
-	DEBUG_RWSEMS_WARN_ON(sem->owner != rwsem_get_owner(sem), sem);
+	/*
+	 * Make sure that count<=>owner translation is correct.
+	 */
+	DEBUG_RWSEMS_WARN_ON(
+		(atomic_long_read(&sem->owner) & ~RWSEM_OWNER_FLAGS_MASK) !=
+		(long)rwsem_read_owner(sem), sem);
 #endif
 }
 
@@ -1631,8 +1621,8 @@ static inline void __up_write(struct rw_semaphore *sem)
 	 * sem->owner may differ from current if the ownership is transferred
 	 * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
 	 */
-	DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
-			    !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
+	DEBUG_RWSEMS_WARN_ON((rwsem_read_owner(sem) != current) &&
+			    !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
 	rwsem_clear_owner(sem);
 	tmp = atomic_long_fetch_and_release(~RWSEM_WRITER_MASK, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
@@ -1653,7 +1643,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	 * read-locked region is ok to be re-ordered into the
 	 * write side. As such, rely on RELEASE semantics.
 	 */
-	DEBUG_RWSEMS_WARN_ON(owner_without_flags(sem->owner) != current, sem);
+	DEBUG_RWSEMS_WARN_ON(rwsem_read_owner(sem) != current, sem);
 	tmp = atomic_long_fetch_add_release(
 		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
 	rwsem_set_reader_owned(sem);

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 01/19] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-17 14:23   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 02/19] locking/rwsem: Remove rwsem_wake() wakeup optimization Waiman Long
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

The owner field in the rw_semaphore structure is used primarily for
optimistic spinning. However, identifying the rwsem owner can also be
helpful in debugging as well as tracing locking related issues when
analyzing crash dump. The owner field may also store state information
that can be important to the operation of the rwsem.

So the owner field is now made a permanent member of the rw_semaphore
structure irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/rwsem.h       |  9 +++++----
 kernel/locking/rwsem-xadd.c |  2 +-
 kernel/locking/rwsem.h      | 23 -----------------------
 lib/Kconfig.debug           |  8 ++++----
 4 files changed, 10 insertions(+), 32 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 2ea18a3def04..148983e21d47 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -34,12 +34,12 @@
  */
 struct rw_semaphore {
 	atomic_long_t count;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	/*
-	 * Write owner. Used as a speculative check to see
-	 * if the owner is running on the cpu.
+	 * Write owner or one of the read owners. Can be used as a
+	 * speculative check to see if the owner is running on the cpu.
 	 */
 	struct task_struct *owner;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	struct optimistic_spin_queue osq; /* spinner MCS lock */
 #endif
 	raw_spinlock_t wait_lock;
@@ -73,13 +73,14 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #endif
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL
+#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED
 #else
 #define __RWSEM_OPT_INIT(lockname)
 #endif
 
 #define __RWSEM_INITIALIZER(name)				\
 	{ __RWSEM_INIT_COUNT(name),				\
+	  .owner = NULL,					\
 	  .wait_list = LIST_HEAD_INIT((name).wait_list),	\
 	  .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock)	\
 	  __RWSEM_OPT_INIT(name)				\
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 0b1f77957240..c0500679fd2f 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -86,8 +86,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	osq_lock_init(&sem->osq);
 #endif
 }
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 64877f5294e3..eb9c8534299b 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -61,7 +61,6 @@
 #define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
 #define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
 
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
  * store tearing can't happen as optimistic spinners may read and use
@@ -126,7 +125,6 @@ static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
  * real owner or one of the real owners. The only exception is when the
  * unlock is done by up_read_non_owner().
  */
-#define rwsem_clear_reader_owned rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
@@ -135,28 +133,7 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
 				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
 }
-#endif
-
 #else
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
-					   struct task_struct *owner)
-{
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-#ifndef rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 }
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index eae43952902e..60efaef4b6b2 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1094,7 +1094,7 @@ config PROVE_LOCKING
 	select DEBUG_SPINLOCK
 	select DEBUG_MUTEXES
 	select DEBUG_RT_MUTEXES if RT_MUTEXES
-	select DEBUG_RWSEMS if RWSEM_SPIN_ON_OWNER
+	select DEBUG_RWSEMS
 	select DEBUG_WW_MUTEX_SLOWPATH
 	select DEBUG_LOCK_ALLOC
 	select TRACE_IRQFLAGS
@@ -1198,10 +1198,10 @@ config DEBUG_WW_MUTEX_SLOWPATH
 
 config DEBUG_RWSEMS
 	bool "RW Semaphore debugging: basic checks"
-	depends on DEBUG_KERNEL && RWSEM_SPIN_ON_OWNER
+	depends on DEBUG_KERNEL
 	help
-	  This debugging feature allows mismatched rw semaphore locks and unlocks
-	  to be detected and reported.
+	  This debugging feature allows mismatched rw semaphore locks
+	  and unlocks to be detected and reported.
 
 config DEBUG_LOCK_ALLOC
 	bool "Lock debugging: detect incorrect freeing of live locks"
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 02/19] locking/rwsem: Remove rwsem_wake() wakeup optimization
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
  2019-05-20 20:59 ` [PATCH v8 01/19] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-17 14:24   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 03/19] locking/rwsem: Implement a new locking scheme Waiman Long
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

With the commit 59aabfc7e959 ("locking/rwsem: Reduce spinlock contention
in wakeup after up_read()/up_write()"), the rwsem_wake() forgoes doing
a wakeup if the wait_lock cannot be directly acquired and an optimistic
spinning locker is present.  This can help performance by avoiding
spinning on the wait_lock when it is contended.

With the later commit 133e89ef5ef3 ("locking/rwsem: Enable lockless
waiter wakeup(s)"), the performance advantage of the above optimization
diminishes as the average wait_lock hold time become much shorter.

With a later patch that supports rwsem lock handoff, we can no
longer relies on the fact that the presence of an optimistic spinning
locker will ensure that the lock will be acquired by a task soon and
rwsem_wake() will be called later on to wake up waiters. This can lead
to missed wakeup and application hang. So the commit 59aabfc7e959
("locking/rwsem: Reduce spinlock contention in wakeup after
up_read()/up_write()") will have to be reverted.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem-xadd.c | 72 -------------------------------------
 1 file changed, 72 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index c0500679fd2f..3083fdf50447 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -411,25 +411,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	lockevent_cond_inc(rwsem_opt_fail, !taken);
 	return taken;
 }
-
-/*
- * Return true if the rwsem has active spinner
- */
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-	return osq_is_locked(&sem->osq);
-}
-
 #else
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
 	return false;
 }
-
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-	return false;
-}
 #endif
 
 /*
@@ -651,65 +637,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
 
-	/*
-	* __rwsem_down_write_failed_common(sem)
-	*   rwsem_optimistic_spin(sem)
-	*     osq_unlock(sem->osq)
-	*   ...
-	*   atomic_long_add_return(&sem->count)
-	*
-	*      - VS -
-	*
-	*              __up_write()
-	*                if (atomic_long_sub_return_release(&sem->count) < 0)
-	*                  rwsem_wake(sem)
-	*                    osq_is_locked(&sem->osq)
-	*
-	* And __up_write() must observe !osq_is_locked() when it observes the
-	* atomic_long_add_return() in order to not miss a wakeup.
-	*
-	* This boils down to:
-	*
-	* [S.rel] X = 1                [RmW] r0 = (Y += 0)
-	*         MB                         RMB
-	* [RmW]   Y += 1               [L]   r1 = X
-	*
-	* exists (r0=1 /\ r1=0)
-	*/
-	smp_rmb();
-
-	/*
-	 * If a spinner is present, it is not necessary to do the wakeup.
-	 * Try to do wakeup only if the trylock succeeds to minimize
-	 * spinlock contention which may introduce too much delay in the
-	 * unlock operation.
-	 *
-	 *    spinning writer		up_write/up_read caller
-	 *    ---------------		-----------------------
-	 * [S]   osq_unlock()		[L]   osq
-	 *	 MB			      RMB
-	 * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
-	 *
-	 * Here, it is important to make sure that there won't be a missed
-	 * wakeup while the rwsem is free and the only spinning writer goes
-	 * to sleep without taking the rwsem. Even when the spinning writer
-	 * is just going to break out of the waiting loop, it will still do
-	 * a trylock in rwsem_down_write_failed() before sleeping. IOW, if
-	 * rwsem_has_spinner() is true, it will guarantee at least one
-	 * trylock attempt on the rwsem later on.
-	 */
-	if (rwsem_has_spinner(sem)) {
-		/*
-		 * The smp_rmb() here is to make sure that the spinner
-		 * state is consulted before reading the wait_lock.
-		 */
-		smp_rmb();
-		if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
-			return sem;
-		goto locked;
-	}
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-locked:
 
 	if (!list_empty(&sem->wait_list))
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 03/19] locking/rwsem: Implement a new locking scheme
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
  2019-05-20 20:59 ` [PATCH v8 01/19] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER Waiman Long
  2019-05-20 20:59 ` [PATCH v8 02/19] locking/rwsem: Remove rwsem_wake() wakeup optimization Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-17 14:24   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 04/19] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c Waiman Long
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.

To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used.  The atomic long count has the following
bit definitions:

  Bit  0   - writer locked bit
  Bit  1   - waiters present bit
  Bits 2-7 - reserved for future extension
  Bits 8-X - reader count (24/56 bits)

The cmpxchg instruction is now used to acquire the write lock. The read
lock is still acquired with xadd instruction, so there is no change here.
This scheme will allow up to 16M/64P active readers which should be
more than enough. We can always use some more reserved bits if necessary.

With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) of the benchmark on a 8-socket 120-core
IvyBridge-EX system before and after the patch were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        30,659   31,341   31,055   31,283
        2         8,909   16,457    9,884   17,659
        4         9,028   15,823    8,933   20,233
        8         8,410   14,212    7,230   17,140
       16         8,217   25,240    7,479   24,607

The locking rates of the benchmark on a Power8 system were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        12,963   13,647   13,275   13,601
        2         7,570   11,569    7,902   10,829
        4         5,232    5,516    5,466    5,435
        8         5,233    3,386    5,467    3,168

The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        21,495   21,046   21,524   21,074
        2         5,293   10,502    5,333   10,504
        4         5,325   11,463    5,358   11,631
        8         5,391   11,712    5,470   11,680

The performance are roughly the same before and after the patch. There
are run-to-run variations in performance. Runs with higher variances
usually have higher throughput.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem-xadd.c | 147 ++++++++++++------------------------
 kernel/locking/rwsem.h      |  74 +++++++++---------
 2 files changed, 85 insertions(+), 136 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3083fdf50447..7d537b50a849 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -9,6 +9,8 @@
  *
  * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
  * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition by Waiman Long <longman@redhat.com>.
  */
 #include <linux/rwsem.h>
 #include <linux/init.h>
@@ -22,52 +24,20 @@
 #include "rwsem.h"
 
 /*
- * Guide to the rw_semaphore's count field for common values.
- * (32-bit case illustrated, similar for 64-bit)
- *
- * 0x0000000X	(1) X readers active or attempting lock, no writer waiting
- *		    X = #active_readers + #readers attempting to lock
- *		    (X*ACTIVE_BIAS)
- *
- * 0x00000000	rwsem is unlocked, and no one is waiting for the lock or
- *		attempting to read lock or write lock.
- *
- * 0xffff000X	(1) X readers active or attempting lock, with waiters for lock
- *		    X = #active readers + # readers attempting lock
- *		    (X*ACTIVE_BIAS + WAITING_BIAS)
- *		(2) 1 writer attempting lock, no waiters for lock
- *		    X-1 = #active readers + #readers attempting lock
- *		    ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *		(3) 1 writer active, no waiters for lock
- *		    X-1 = #active readers + #readers attempting lock
- *		    ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *
- * 0xffff0001	(1) 1 reader active or attempting lock, waiters for lock
- *		    (WAITING_BIAS + ACTIVE_BIAS)
- *		(2) 1 writer active or attempting lock, no waiters for lock
- *		    (ACTIVE_WRITE_BIAS)
+ * Guide to the rw_semaphore's count field.
  *
- * 0xffff0000	(1) There are writers or readers queued but none active
- *		    or in the process of attempting lock.
- *		    (WAITING_BIAS)
- *		Note: writer can attempt to steal lock for this count by adding
- *		ACTIVE_WRITE_BIAS in cmpxchg and checking the old count
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
  *
- * 0xfffe0001	(1) 1 writer active, or attempting lock. Waiters on queue.
- *		    (ACTIVE_WRITE_BIAS + WAITING_BIAS)
- *
- * Note: Readers attempt to lock by adding ACTIVE_BIAS in down_read and checking
- *	 the count becomes more than 0 for successful lock acquisition,
- *	 i.e. the case where there are only readers or nobody has lock.
- *	 (1st and 2nd case above).
- *
- *	 Writers attempt to lock by adding ACTIVE_WRITE_BIAS in down_write and
- *	 checking the count becomes ACTIVE_WRITE_BIAS for successful lock
- *	 acquisition (i.e. nobody else has lock or attempts lock).  If
- *	 unsuccessful, in rwsem_down_write_failed, we'll check to see if there
- *	 are only waiters but none active (5th case above), and attempt to
- *	 steal the lock.
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
  *
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
  */
 
 /*
@@ -113,9 +83,8 @@ enum rwsem_wake_type {
 
 /*
  * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then:
- *   - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
- *   - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ *   have been set.
  * - there must be someone on the queue
  * - the wait_lock must be held by the caller
  * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
@@ -160,22 +129,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	 * so we can bail out early if a writer stole the lock.
 	 */
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
-		adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
+		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
-		if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-			/*
-			 * If the count is still less than RWSEM_WAITING_BIAS
-			 * after removing the adjustment, it is assumed that
-			 * a writer has stolen the lock. We have to undo our
-			 * reader grant.
-			 */
-			if (atomic_long_add_return(-adjustment, &sem->count) <
-			    RWSEM_WAITING_BIAS)
-				return;
-
-			/* Last active locker left. Retry waking readers. */
-			goto try_reader_grant;
+		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+			atomic_long_sub(adjustment, &sem->count);
+			return;
 		}
 		/*
 		 * Set it to reader-owned to give spinners an early
@@ -209,11 +167,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	}
 	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
 
-	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
+	adjustment = woken * RWSEM_READER_BIAS - adjustment;
 	lockevent_cond_inc(rwsem_wake_reader, woken);
 	if (list_empty(&sem->wait_list)) {
 		/* hit end of list above */
-		adjustment -= RWSEM_WAITING_BIAS;
+		adjustment -= RWSEM_FLAG_WAITERS;
 	}
 
 	if (adjustment)
@@ -248,22 +206,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  */
 static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
 {
-	/*
-	 * Avoid trying to acquire write lock if count isn't RWSEM_WAITING_BIAS.
-	 */
-	if (count != RWSEM_WAITING_BIAS)
+	long new;
+
+	if (count & RWSEM_LOCK_MASK)
 		return false;
 
-	/*
-	 * Acquire the lock by trying to set it to ACTIVE_WRITE_BIAS. If there
-	 * are other tasks on the wait list, we need to add on WAITING_BIAS.
-	 */
-	count = list_is_singular(&sem->wait_list) ?
-			RWSEM_ACTIVE_WRITE_BIAS :
-			RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS;
+	new = count + RWSEM_WRITER_LOCKED -
+	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
 
-	if (atomic_long_cmpxchg_acquire(&sem->count, RWSEM_WAITING_BIAS, count)
-							== RWSEM_WAITING_BIAS) {
+	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
 		rwsem_set_owner(sem);
 		return true;
 	}
@@ -279,9 +230,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 {
 	long count = atomic_long_read(&sem->count);
 
-	while (!count || count == RWSEM_WAITING_BIAS) {
+	while (!(count & RWSEM_LOCK_MASK)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_ACTIVE_WRITE_BIAS)) {
+					count + RWSEM_WRITER_LOCKED)) {
 			rwsem_set_owner(sem);
 			lockevent_inc(rwsem_opt_wlock);
 			return true;
@@ -424,7 +375,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 static inline struct rw_semaphore __sched *
 __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 {
-	long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
+	long count, adjustment = -RWSEM_READER_BIAS;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
@@ -436,16 +387,16 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 		/*
 		 * In case the wait queue is empty and the lock isn't owned
 		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_ACTIVE_READ_BIAS has already
-		 * been set in the count.
+		 * immediately as its RWSEM_READER_BIAS has already been
+		 * set in the count.
 		 */
-		if (atomic_long_read(&sem->count) >= 0) {
+		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
 			return sem;
 		}
-		adjustment += RWSEM_WAITING_BIAS;
+		adjustment += RWSEM_FLAG_WAITERS;
 	}
 	list_add_tail(&waiter.list, &sem->wait_list);
 
@@ -458,9 +409,8 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 	 * If there are no writers and we are first in the queue,
 	 * wake our own waiter to join the existing active readers !
 	 */
-	if (count == RWSEM_WAITING_BIAS ||
-	    (count > RWSEM_WAITING_BIAS &&
-	     adjustment != -RWSEM_ACTIVE_READ_BIAS))
+	if (!(count & RWSEM_LOCK_MASK) ||
+	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
@@ -488,7 +438,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 out_nolock:
 	list_del(&waiter.list);
 	if (list_empty(&sem->wait_list))
-		atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock_fail);
@@ -521,9 +471,6 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
 
-	/* undo write bias from down_write operation, stop active locking */
-	count = atomic_long_sub_return(RWSEM_ACTIVE_WRITE_BIAS, &sem->count);
-
 	/* do optimistic spinning and steal lock if possible */
 	if (rwsem_optimistic_spin(sem))
 		return sem;
@@ -543,16 +490,18 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 
 	list_add_tail(&waiter.list, &sem->wait_list);
 
-	/* we're now waiting on the lock, but no longer actively locking */
+	/* we're now waiting on the lock */
 	if (waiting) {
 		count = atomic_long_read(&sem->count);
 
 		/*
 		 * If there were already threads queued before us and there are
-		 * no active writers, the lock must be read owned; so we try to
-		 * wake any read locks that were queued ahead of us.
+		 * no active writers and some readers, the lock must be read
+		 * owned; so we try to  any read locks that were queued ahead
+		 * of us.
 		 */
-		if (count > RWSEM_WAITING_BIAS) {
+		if (!(count & RWSEM_WRITER_MASK) &&
+		     (count & RWSEM_READER_MASK)) {
 			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
 			/*
 			 * The wakeup is normally called _after_ the wait_lock
@@ -569,8 +518,9 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 			wake_q_init(&wake_q);
 		}
 
-	} else
-		count = atomic_long_add_return(RWSEM_WAITING_BIAS, &sem->count);
+	} else {
+		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+	}
 
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
@@ -587,7 +537,8 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 			schedule();
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
-		} while ((count = atomic_long_read(&sem->count)) & RWSEM_ACTIVE_MASK);
+			count = atomic_long_read(&sem->count);
+		} while (count & RWSEM_LOCK_MASK);
 
 		raw_spin_lock_irq(&sem->wait_lock);
 	}
@@ -603,7 +554,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	raw_spin_lock_irq(&sem->wait_lock);
 	list_del(&waiter.list);
 	if (list_empty(&sem->wait_list))
-		atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	else
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 	raw_spin_unlock_irq(&sem->wait_lock);
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index eb9c8534299b..499a9b2bda82 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -42,24 +42,24 @@
 #endif
 
 /*
- * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
- * Adapted largely from include/asm-i386/rwsem.h
- * by Paul Mackerras <paulus@samba.org>.
- */
-
-/*
- * the semaphore definition
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit  0   - writer locked bit
+ * Bit  1   - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
  */
-#ifdef CONFIG_64BIT
-# define RWSEM_ACTIVE_MASK		0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK		0x0000ffffL
-#endif
-
-#define RWSEM_ACTIVE_BIAS		0x00000001L
-#define RWSEM_WAITING_BIAS		(-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_WRITER_LOCKED	(1UL << 0)
+#define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_READER_SHIFT	8
+#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -151,7 +151,8 @@ extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
  */
 static inline void __down_read(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		rwsem_down_read_failed(sem);
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
@@ -162,7 +163,8 @@ static inline void __down_read(struct rw_semaphore *sem)
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
@@ -183,11 +185,11 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
 	lockevent_inc(rwsem_rtrylock);
 	do {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
-					tmp + RWSEM_ACTIVE_READ_BIAS)) {
+					tmp + RWSEM_READER_BIAS)) {
 			rwsem_set_reader_owned(sem);
 			return 1;
 		}
-	} while (tmp >= 0);
+	} while (!(tmp & RWSEM_READ_FAILED_MASK));
 	return 0;
 }
 
@@ -196,22 +198,16 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
  */
 static inline void __down_write(struct rw_semaphore *sem)
 {
-	long tmp;
-
-	tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
-					     &sem->count);
-	if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
 		rwsem_down_write_failed(sem);
 	rwsem_set_owner(sem);
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
 {
-	long tmp;
-
-	tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
-					     &sem->count);
-	if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
 		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
 			return -EINTR;
 	rwsem_set_owner(sem);
@@ -224,7 +220,7 @@ static inline int __down_write_trylock(struct rw_semaphore *sem)
 
 	lockevent_inc(rwsem_wtrylock);
 	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-		      RWSEM_ACTIVE_WRITE_BIAS);
+					  RWSEM_WRITER_LOCKED);
 	if (tmp == RWSEM_UNLOCKED_VALUE) {
 		rwsem_set_owner(sem);
 		return true;
@@ -242,8 +238,9 @@ static inline void __up_read(struct rw_semaphore *sem)
 	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
 				sem);
 	rwsem_clear_reader_owned(sem);
-	tmp = atomic_long_dec_return_release(&sem->count);
-	if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
+	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+			== RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -254,8 +251,8 @@ static inline void __up_write(struct rw_semaphore *sem)
 {
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
 	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
-						    &sem->count) < 0))
+	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+			&sem->count) & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -274,8 +271,9 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	 * write side. As such, rely on RELEASE semantics.
 	 */
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
+	tmp = atomic_long_fetch_add_release(
+		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
 	rwsem_set_reader_owned(sem);
-	if (tmp < 0)
+	if (tmp & RWSEM_FLAG_WAITERS)
 		rwsem_downgrade_wake(sem);
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 04/19] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (2 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 03/19] locking/rwsem: Implement a new locking scheme Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-17 14:25   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 05/19] locking/rwsem: Code cleanup after files merging Waiman Long
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Now we only have one implementation of rwsem. Even though we still use
xadd to handle reader locking, we use cmpxchg for writer instead. So
the filename rwsem-xadd.c is not strictly correct. Also no one outside
of the rwsem code need to know the internal implementation other than
function prototypes for two internal functions that are called directly
from percpu-rwsem.c.

So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
the following order:

  <upper part of rwsem.h>
  <rwsem-xadd.c>
  <lower part of rwsem.h>
  <rwsem.c>

The rwsem.h file now contains only 2 function declarations for
__up_read() and __down_read().

This is a code relocation patch with no code change at all except
making __up_read() and __down_read() non-static functions so they
can be used by percpu-rwsem.c.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/Makefile     |   2 +-
 kernel/locking/rwsem-xadd.c | 624 -------------------------
 kernel/locking/rwsem.c      | 884 ++++++++++++++++++++++++++++++++++++
 kernel/locking/rwsem.h      | 281 +-----------
 4 files changed, 891 insertions(+), 900 deletions(-)
 delete mode 100644 kernel/locking/rwsem-xadd.c

diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 6fe2f333aecb..45452facff3b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -3,7 +3,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT		:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o rwsem-xadd.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
deleted file mode 100644
index 7d537b50a849..000000000000
--- a/kernel/locking/rwsem-xadd.c
+++ /dev/null
@@ -1,624 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* rwsem.c: R/W semaphores: contention handling functions
- *
- * Written by David Howells (dhowells@redhat.com).
- * Derived from arch/i386/kernel/semaphore.c
- *
- * Writer lock-stealing by Alex Shi <alex.shi@intel.com>
- * and Michel Lespinasse <walken@google.com>
- *
- * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
- * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
- *
- * Rwsem count bit fields re-definition by Waiman Long <longman@redhat.com>.
- */
-#include <linux/rwsem.h>
-#include <linux/init.h>
-#include <linux/export.h>
-#include <linux/sched/signal.h>
-#include <linux/sched/rt.h>
-#include <linux/sched/wake_q.h>
-#include <linux/sched/debug.h>
-#include <linux/osq_lock.h>
-
-#include "rwsem.h"
-
-/*
- * Guide to the rw_semaphore's count field.
- *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
- *
- * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
- * (2) some of the reader bits are set in count, and
- * (3) the owner field has RWSEM_READ_OWNED bit set.
- *
- * Having some reader bits set is not enough to guarantee a readers owned
- * lock as the readers may be in the process of backing out from the count
- * and a writer has just released the lock. So another writer may steal
- * the lock immediately after that.
- */
-
-/*
- * Initialize an rwsem:
- */
-void __init_rwsem(struct rw_semaphore *sem, const char *name,
-		  struct lock_class_key *key)
-{
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-	/*
-	 * Make sure we are not reinitializing a held semaphore:
-	 */
-	debug_check_no_locks_freed((void *)sem, sizeof(*sem));
-	lockdep_init_map(&sem->dep_map, name, key, 0);
-#endif
-	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
-	raw_spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-	sem->owner = NULL;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-	osq_lock_init(&sem->osq);
-#endif
-}
-
-EXPORT_SYMBOL(__init_rwsem);
-
-enum rwsem_waiter_type {
-	RWSEM_WAITING_FOR_WRITE,
-	RWSEM_WAITING_FOR_READ
-};
-
-struct rwsem_waiter {
-	struct list_head list;
-	struct task_struct *task;
-	enum rwsem_waiter_type type;
-};
-
-enum rwsem_wake_type {
-	RWSEM_WAKE_ANY,		/* Wake whatever's at head of wait list */
-	RWSEM_WAKE_READERS,	/* Wake readers only */
-	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
-};
-
-/*
- * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
- *   have been set.
- * - there must be someone on the queue
- * - the wait_lock must be held by the caller
- * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
- *   to actually wakeup the blocked task(s) and drop the reference count,
- *   preferably when the wait_lock is released
- * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only marked woken if downgrading is false
- */
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
-			      enum rwsem_wake_type wake_type,
-			      struct wake_q_head *wake_q)
-{
-	struct rwsem_waiter *waiter, *tmp;
-	long oldcount, woken = 0, adjustment = 0;
-	struct list_head wlist;
-
-	/*
-	 * Take a peek at the queue head waiter such that we can determine
-	 * the wakeup(s) to perform.
-	 */
-	waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
-
-	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
-		if (wake_type == RWSEM_WAKE_ANY) {
-			/*
-			 * Mark writer at the front of the queue for wakeup.
-			 * Until the task is actually later awoken later by
-			 * the caller, other writers are able to steal it.
-			 * Readers, on the other hand, will block as they
-			 * will notice the queued writer.
-			 */
-			wake_q_add(wake_q, waiter->task);
-			lockevent_inc(rwsem_wake_writer);
-		}
-
-		return;
-	}
-
-	/*
-	 * Writers might steal the lock before we grant it to the next reader.
-	 * We prefer to do the first reader grant before counting readers
-	 * so we can bail out early if a writer stole the lock.
-	 */
-	if (wake_type != RWSEM_WAKE_READ_OWNED) {
-		adjustment = RWSEM_READER_BIAS;
-		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
-		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
-			atomic_long_sub(adjustment, &sem->count);
-			return;
-		}
-		/*
-		 * Set it to reader-owned to give spinners an early
-		 * indication that readers now have the lock.
-		 */
-		__rwsem_set_reader_owned(sem, waiter->task);
-	}
-
-	/*
-	 * Grant an infinite number of read locks to the readers at the front
-	 * of the queue. We know that woken will be at least 1 as we accounted
-	 * for above. Note we increment the 'active part' of the count by the
-	 * number of readers before waking any processes up.
-	 *
-	 * We have to do wakeup in 2 passes to prevent the possibility that
-	 * the reader count may be decremented before it is incremented. It
-	 * is because the to-be-woken waiter may not have slept yet. So it
-	 * may see waiter->task got cleared, finish its critical section and
-	 * do an unlock before the reader count increment.
-	 *
-	 * 1) Collect the read-waiters in a separate list, count them and
-	 *    fully increment the reader count in rwsem.
-	 * 2) For each waiters in the new list, clear waiter->task and
-	 *    put them into wake_q to be woken up later.
-	 */
-	list_for_each_entry(waiter, &sem->wait_list, list) {
-		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-			break;
-
-		woken++;
-	}
-	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
-
-	adjustment = woken * RWSEM_READER_BIAS - adjustment;
-	lockevent_cond_inc(rwsem_wake_reader, woken);
-	if (list_empty(&sem->wait_list)) {
-		/* hit end of list above */
-		adjustment -= RWSEM_FLAG_WAITERS;
-	}
-
-	if (adjustment)
-		atomic_long_add(adjustment, &sem->count);
-
-	/* 2nd pass */
-	list_for_each_entry_safe(waiter, tmp, &wlist, list) {
-		struct task_struct *tsk;
-
-		tsk = waiter->task;
-		get_task_struct(tsk);
-
-		/*
-		 * Ensure calling get_task_struct() before setting the reader
-		 * waiter to nil such that rwsem_down_read_failed() cannot
-		 * race with do_exit() by always holding a reference count
-		 * to the task to wakeup.
-		 */
-		smp_store_release(&waiter->task, NULL);
-		/*
-		 * Ensure issuing the wakeup (either by us or someone else)
-		 * after setting the reader waiter to nil.
-		 */
-		wake_q_add_safe(wake_q, tsk);
-	}
-}
-
-/*
- * This function must be called with the sem->wait_lock held to prevent
- * race conditions between checking the rwsem wait list and setting the
- * sem->count accordingly.
- */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
-{
-	long new;
-
-	if (count & RWSEM_LOCK_MASK)
-		return false;
-
-	new = count + RWSEM_WRITER_LOCKED -
-	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
-
-	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
-		rwsem_set_owner(sem);
-		return true;
-	}
-
-	return false;
-}
-
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-/*
- * Try to acquire write lock before the writer has been put on wait queue.
- */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
-{
-	long count = atomic_long_read(&sem->count);
-
-	while (!(count & RWSEM_LOCK_MASK)) {
-		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_WRITER_LOCKED)) {
-			rwsem_set_owner(sem);
-			lockevent_inc(rwsem_opt_wlock);
-			return true;
-		}
-	}
-	return false;
-}
-
-static inline bool owner_on_cpu(struct task_struct *owner)
-{
-	/*
-	 * As lock holder preemption issue, we both skip spinning if
-	 * task is not on cpu or its cpu is preempted
-	 */
-	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
-}
-
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
-{
-	struct task_struct *owner;
-	bool ret = true;
-
-	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
-
-	if (need_resched())
-		return false;
-
-	rcu_read_lock();
-	owner = READ_ONCE(sem->owner);
-	if (owner) {
-		ret = is_rwsem_owner_spinnable(owner) &&
-		      owner_on_cpu(owner);
-	}
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
- * Return true only if we can still spin on the owner field of the rwsem.
- */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
-{
-	struct task_struct *owner = READ_ONCE(sem->owner);
-
-	if (!is_rwsem_owner_spinnable(owner))
-		return false;
-
-	rcu_read_lock();
-	while (owner && (READ_ONCE(sem->owner) == owner)) {
-		/*
-		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking sem->owner still matches owner, if that fails,
-		 * owner might point to free()d memory, if it still matches,
-		 * the rcu_read_lock() ensures the memory stays valid.
-		 */
-		barrier();
-
-		/*
-		 * abort spinning when need_resched or owner is not running or
-		 * owner's cpu is preempted.
-		 */
-		if (need_resched() || !owner_on_cpu(owner)) {
-			rcu_read_unlock();
-			return false;
-		}
-
-		cpu_relax();
-	}
-	rcu_read_unlock();
-
-	/*
-	 * If there is a new owner or the owner is not set, we continue
-	 * spinning.
-	 */
-	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
-}
-
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
-	bool taken = false;
-
-	preempt_disable();
-
-	/* sem->wait_lock should not be held when doing optimistic spinning */
-	if (!rwsem_can_spin_on_owner(sem))
-		goto done;
-
-	if (!osq_lock(&sem->osq))
-		goto done;
-
-	/*
-	 * Optimistically spin on the owner field and attempt to acquire the
-	 * lock whenever the owner changes. Spinning will be stopped when:
-	 *  1) the owning writer isn't running; or
-	 *  2) readers own the lock as we can't determine if they are
-	 *     actively running or not.
-	 */
-	while (rwsem_spin_on_owner(sem)) {
-		/*
-		 * Try to acquire the lock
-		 */
-		if (rwsem_try_write_lock_unqueued(sem)) {
-			taken = true;
-			break;
-		}
-
-		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
-		 */
-		if (!sem->owner && (need_resched() || rt_task(current)))
-			break;
-
-		/*
-		 * The cpu_relax() call is a compiler barrier which forces
-		 * everything in this loop to be re-loaded. We don't need
-		 * memory barriers as we'll eventually observe the right
-		 * values at the cost of a few extra spins.
-		 */
-		cpu_relax();
-	}
-	osq_unlock(&sem->osq);
-done:
-	preempt_enable();
-	lockevent_cond_inc(rwsem_opt_fail, !taken);
-	return taken;
-}
-#else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
-	return false;
-}
-#endif
-
-/*
- * Wait for the read lock to be granted
- */
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
-{
-	long count, adjustment = -RWSEM_READER_BIAS;
-	struct rwsem_waiter waiter;
-	DEFINE_WAKE_Q(wake_q);
-
-	waiter.task = current;
-	waiter.type = RWSEM_WAITING_FOR_READ;
-
-	raw_spin_lock_irq(&sem->wait_lock);
-	if (list_empty(&sem->wait_list)) {
-		/*
-		 * In case the wait queue is empty and the lock isn't owned
-		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_READER_BIAS has already been
-		 * set in the count.
-		 */
-		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
-			raw_spin_unlock_irq(&sem->wait_lock);
-			rwsem_set_reader_owned(sem);
-			lockevent_inc(rwsem_rlock_fast);
-			return sem;
-		}
-		adjustment += RWSEM_FLAG_WAITERS;
-	}
-	list_add_tail(&waiter.list, &sem->wait_list);
-
-	/* we're now waiting on the lock, but no longer actively locking */
-	count = atomic_long_add_return(adjustment, &sem->count);
-
-	/*
-	 * If there are no active locks, wake the front queued process(es).
-	 *
-	 * If there are no writers and we are first in the queue,
-	 * wake our own waiter to join the existing active readers !
-	 */
-	if (!(count & RWSEM_LOCK_MASK) ||
-	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
-	raw_spin_unlock_irq(&sem->wait_lock);
-	wake_up_q(&wake_q);
-
-	/* wait to be given the lock */
-	while (true) {
-		set_current_state(state);
-		if (!waiter.task)
-			break;
-		if (signal_pending_state(state, current)) {
-			raw_spin_lock_irq(&sem->wait_lock);
-			if (waiter.task)
-				goto out_nolock;
-			raw_spin_unlock_irq(&sem->wait_lock);
-			break;
-		}
-		schedule();
-		lockevent_inc(rwsem_sleep_reader);
-	}
-
-	__set_current_state(TASK_RUNNING);
-	lockevent_inc(rwsem_rlock);
-	return sem;
-out_nolock:
-	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	__set_current_state(TASK_RUNNING);
-	lockevent_inc(rwsem_rlock_fail);
-	return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
-/*
- * Wait until we successfully acquire the write lock
- */
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
-{
-	long count;
-	bool waiting = true; /* any queued threads before us */
-	struct rwsem_waiter waiter;
-	struct rw_semaphore *ret = sem;
-	DEFINE_WAKE_Q(wake_q);
-
-	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_optimistic_spin(sem))
-		return sem;
-
-	/*
-	 * Optimistic spinning failed, proceed to the slowpath
-	 * and block until we can acquire the sem.
-	 */
-	waiter.task = current;
-	waiter.type = RWSEM_WAITING_FOR_WRITE;
-
-	raw_spin_lock_irq(&sem->wait_lock);
-
-	/* account for this before adding a new element to the list */
-	if (list_empty(&sem->wait_list))
-		waiting = false;
-
-	list_add_tail(&waiter.list, &sem->wait_list);
-
-	/* we're now waiting on the lock */
-	if (waiting) {
-		count = atomic_long_read(&sem->count);
-
-		/*
-		 * If there were already threads queued before us and there are
-		 * no active writers and some readers, the lock must be read
-		 * owned; so we try to  any read locks that were queued ahead
-		 * of us.
-		 */
-		if (!(count & RWSEM_WRITER_MASK) &&
-		     (count & RWSEM_READER_MASK)) {
-			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
-			/*
-			 * The wakeup is normally called _after_ the wait_lock
-			 * is released, but given that we are proactively waking
-			 * readers we can deal with the wake_q overhead as it is
-			 * similar to releasing and taking the wait_lock again
-			 * for attempting rwsem_try_write_lock().
-			 */
-			wake_up_q(&wake_q);
-
-			/*
-			 * Reinitialize wake_q after use.
-			 */
-			wake_q_init(&wake_q);
-		}
-
-	} else {
-		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
-	}
-
-	/* wait until we successfully acquire the lock */
-	set_current_state(state);
-	while (true) {
-		if (rwsem_try_write_lock(count, sem))
-			break;
-		raw_spin_unlock_irq(&sem->wait_lock);
-
-		/* Block until there are no active lockers. */
-		do {
-			if (signal_pending_state(state, current))
-				goto out_nolock;
-
-			schedule();
-			lockevent_inc(rwsem_sleep_writer);
-			set_current_state(state);
-			count = atomic_long_read(&sem->count);
-		} while (count & RWSEM_LOCK_MASK);
-
-		raw_spin_lock_irq(&sem->wait_lock);
-	}
-	__set_current_state(TASK_RUNNING);
-	list_del(&waiter.list);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	lockevent_inc(rwsem_wlock);
-
-	return ret;
-
-out_nolock:
-	__set_current_state(TASK_RUNNING);
-	raw_spin_lock_irq(&sem->wait_lock);
-	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-	else
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	wake_up_q(&wake_q);
-	lockevent_inc(rwsem_wlock_fail);
-
-	return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
-/*
- * handle waking up a waiter on the semaphore
- * - up_read/up_write has decremented the active part of count if we come here
- */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
-{
-	unsigned long flags;
-	DEFINE_WAKE_Q(wake_q);
-
-	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
-	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
-	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
-	wake_up_q(&wake_q);
-
-	return sem;
-}
-EXPORT_SYMBOL(rwsem_wake);
-
-/*
- * downgrade a write lock into a read lock
- * - caller incremented waiting part of count and discovered it still negative
- * - just wake up any readers at the front of the queue
- */
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
-{
-	unsigned long flags;
-	DEFINE_WAKE_Q(wake_q);
-
-	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
-	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
-
-	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
-	wake_up_q(&wake_q);
-
-	return sem;
-}
-EXPORT_SYMBOL(rwsem_downgrade_wake);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ccbf18f560ff..8317bcdf063b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -3,17 +3,901 @@
  *
  * Written by David Howells (dhowells@redhat.com).
  * Derived from asm-i386/semaphore.h
+ *
+ * Writer lock-stealing by Alex Shi <alex.shi@intel.com>
+ * and Michel Lespinasse <walken@google.com>
+ *
+ * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
+ * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition and rwsem rearchitecture
+ * by Waiman Long <longman@redhat.com>.
  */
 
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/sched/rt.h>
+#include <linux/sched/task.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched/signal.h>
 #include <linux/export.h>
 #include <linux/rwsem.h>
 #include <linux/atomic.h>
 
 #include "rwsem.h"
+#include "lock_events.h"
+
+/*
+ * The least significant 2 bits of the owner value has the following
+ * meanings when set.
+ *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
+ *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
+ *    i.e. the owner(s) cannot be readily determined. It can be reader
+ *    owned or the owning writer is indeterminate.
+ *
+ * When a writer acquires a rwsem, it puts its task_struct pointer
+ * into the owner field. It is cleared after an unlock.
+ *
+ * When a reader acquires a rwsem, it will also puts its task_struct
+ * pointer into the owner field with both the RWSEM_READER_OWNED and
+ * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * largely be left untouched. So for a free or reader-owned rwsem,
+ * the owner value may contain information about the last reader that
+ * acquires the rwsem. The anonymous bit is set because that particular
+ * reader may or may not still own the lock.
+ *
+ * That information may be helpful in debugging cases where the system
+ * seems to hang on a reader owned rwsem especially if only one reader
+ * is involved. Ideally we would like to track all the readers that own
+ * a rwsem, but the overhead is simply too big.
+ */
+#define RWSEM_READER_OWNED	(1UL << 0)
+#define RWSEM_ANONYMOUSLY_OWNED	(1UL << 1)
+
+#ifdef CONFIG_DEBUG_RWSEMS
+# define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
+	if (!debug_locks_silent &&				\
+	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
+		#c, atomic_long_read(&(sem)->count),		\
+		(long)((sem)->owner), (long)current,		\
+		list_empty(&(sem)->wait_list) ? "" : "not "))	\
+			debug_locks_off();			\
+	} while (0)
+#else
+# define DEBUG_RWSEMS_WARN_ON(c, sem)
+#endif
+
+/*
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit  0   - writer locked bit
+ * Bit  1   - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
+ */
+#define RWSEM_WRITER_LOCKED	(1UL << 0)
+#define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_READER_SHIFT	8
+#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+
+/*
+ * All writes to owner are protected by WRITE_ONCE() to make sure that
+ * store tearing can't happen as optimistic spinners may read and use
+ * the owner value concurrently without lock. Read from owner, however,
+ * may not need READ_ONCE() as long as the pointer value is only used
+ * for comparison and isn't being dereferenced.
+ */
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+	WRITE_ONCE(sem->owner, current);
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+	WRITE_ONCE(sem->owner, NULL);
+}
+
+/*
+ * The task_struct pointer of the last owning reader will be left in
+ * the owner field.
+ *
+ * Note that the owner value just indicates the task has owned the rwsem
+ * previously, it may not be the real owner or one of the real owners
+ * anymore when that field is examined, so take it with a grain of salt.
+ */
+static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
+					    struct task_struct *owner)
+{
+	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
+						 | RWSEM_ANONYMOUSLY_OWNED;
+
+	WRITE_ONCE(sem->owner, (struct task_struct *)val);
+}
+
+static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
+{
+	__rwsem_set_reader_owned(sem, current);
+}
+
+/*
+ * Return true if the a rwsem waiter can spin on the rwsem's owner
+ * and steal the lock, i.e. the lock is not anonymously owned.
+ * N.B. !owner is considered spinnable.
+ */
+static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+{
+	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
+}
+
+/*
+ * Return true if rwsem is owned by an anonymous writer or readers.
+ */
+static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
+{
+	return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+}
+
+#ifdef CONFIG_DEBUG_RWSEMS
+/*
+ * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
+ * is a task pointer in owner of a reader-owned rwsem, it will be the
+ * real owner or one of the real owners. The only exception is when the
+ * unlock is done by up_read_non_owner().
+ */
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
+						   | RWSEM_ANONYMOUSLY_OWNED;
+	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
+		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
+				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+}
+#else
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+}
+#endif
+
+/*
+ * Guide to the rw_semaphore's count field.
+ *
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
+ *
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
+ *
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
+ */
+
+/*
+ * Initialize an rwsem:
+ */
+void __init_rwsem(struct rw_semaphore *sem, const char *name,
+		  struct lock_class_key *key)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	/*
+	 * Make sure we are not reinitializing a held semaphore:
+	 */
+	debug_check_no_locks_freed((void *)sem, sizeof(*sem));
+	lockdep_init_map(&sem->dep_map, name, key, 0);
+#endif
+	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+	raw_spin_lock_init(&sem->wait_lock);
+	INIT_LIST_HEAD(&sem->wait_list);
+	sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+	osq_lock_init(&sem->osq);
+#endif
+}
+
+EXPORT_SYMBOL(__init_rwsem);
+
+enum rwsem_waiter_type {
+	RWSEM_WAITING_FOR_WRITE,
+	RWSEM_WAITING_FOR_READ
+};
+
+struct rwsem_waiter {
+	struct list_head list;
+	struct task_struct *task;
+	enum rwsem_waiter_type type;
+};
+
+enum rwsem_wake_type {
+	RWSEM_WAKE_ANY,		/* Wake whatever's at head of wait list */
+	RWSEM_WAKE_READERS,	/* Wake readers only */
+	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
+};
+
+/*
+ * handle the lock release when processes blocked on it that can now run
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ *   have been set.
+ * - there must be someone on the queue
+ * - the wait_lock must be held by the caller
+ * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
+ *   to actually wakeup the blocked task(s) and drop the reference count,
+ *   preferably when the wait_lock is released
+ * - woken process blocks are discarded from the list after having task zeroed
+ * - writers are only marked woken if downgrading is false
+ */
+static void __rwsem_mark_wake(struct rw_semaphore *sem,
+			      enum rwsem_wake_type wake_type,
+			      struct wake_q_head *wake_q)
+{
+	struct rwsem_waiter *waiter, *tmp;
+	long oldcount, woken = 0, adjustment = 0;
+	struct list_head wlist;
+
+	/*
+	 * Take a peek at the queue head waiter such that we can determine
+	 * the wakeup(s) to perform.
+	 */
+	waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
+
+	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
+		if (wake_type == RWSEM_WAKE_ANY) {
+			/*
+			 * Mark writer at the front of the queue for wakeup.
+			 * Until the task is actually later awoken later by
+			 * the caller, other writers are able to steal it.
+			 * Readers, on the other hand, will block as they
+			 * will notice the queued writer.
+			 */
+			wake_q_add(wake_q, waiter->task);
+			lockevent_inc(rwsem_wake_writer);
+		}
+
+		return;
+	}
+
+	/*
+	 * Writers might steal the lock before we grant it to the next reader.
+	 * We prefer to do the first reader grant before counting readers
+	 * so we can bail out early if a writer stole the lock.
+	 */
+	if (wake_type != RWSEM_WAKE_READ_OWNED) {
+		adjustment = RWSEM_READER_BIAS;
+		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
+		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+			atomic_long_sub(adjustment, &sem->count);
+			return;
+		}
+		/*
+		 * Set it to reader-owned to give spinners an early
+		 * indication that readers now have the lock.
+		 */
+		__rwsem_set_reader_owned(sem, waiter->task);
+	}
+
+	/*
+	 * Grant an infinite number of read locks to the readers at the front
+	 * of the queue. We know that woken will be at least 1 as we accounted
+	 * for above. Note we increment the 'active part' of the count by the
+	 * number of readers before waking any processes up.
+	 *
+	 * We have to do wakeup in 2 passes to prevent the possibility that
+	 * the reader count may be decremented before it is incremented. It
+	 * is because the to-be-woken waiter may not have slept yet. So it
+	 * may see waiter->task got cleared, finish its critical section and
+	 * do an unlock before the reader count increment.
+	 *
+	 * 1) Collect the read-waiters in a separate list, count them and
+	 *    fully increment the reader count in rwsem.
+	 * 2) For each waiters in the new list, clear waiter->task and
+	 *    put them into wake_q to be woken up later.
+	 */
+	list_for_each_entry(waiter, &sem->wait_list, list) {
+		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
+			break;
+
+		woken++;
+	}
+	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
+
+	adjustment = woken * RWSEM_READER_BIAS - adjustment;
+	lockevent_cond_inc(rwsem_wake_reader, woken);
+	if (list_empty(&sem->wait_list)) {
+		/* hit end of list above */
+		adjustment -= RWSEM_FLAG_WAITERS;
+	}
+
+	if (adjustment)
+		atomic_long_add(adjustment, &sem->count);
+
+	/* 2nd pass */
+	list_for_each_entry_safe(waiter, tmp, &wlist, list) {
+		struct task_struct *tsk;
+
+		tsk = waiter->task;
+		get_task_struct(tsk);
+
+		/*
+		 * Ensure calling get_task_struct() before setting the reader
+		 * waiter to nil such that rwsem_down_read_failed() cannot
+		 * race with do_exit() by always holding a reference count
+		 * to the task to wakeup.
+		 */
+		smp_store_release(&waiter->task, NULL);
+		/*
+		 * Ensure issuing the wakeup (either by us or someone else)
+		 * after setting the reader waiter to nil.
+		 */
+		wake_q_add_safe(wake_q, tsk);
+	}
+}
+
+/*
+ * This function must be called with the sem->wait_lock held to prevent
+ * race conditions between checking the rwsem wait list and setting the
+ * sem->count accordingly.
+ */
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+	long new;
+
+	if (count & RWSEM_LOCK_MASK)
+		return false;
+
+	new = count + RWSEM_WRITER_LOCKED -
+	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+
+	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
+		rwsem_set_owner(sem);
+		return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire write lock before the writer has been put on wait queue.
+ */
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+
+	while (!(count & RWSEM_LOCK_MASK)) {
+		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
+					count + RWSEM_WRITER_LOCKED)) {
+			rwsem_set_owner(sem);
+			lockevent_inc(rwsem_opt_wlock);
+			return true;
+		}
+	}
+	return false;
+}
+
+static inline bool owner_on_cpu(struct task_struct *owner)
+{
+	/*
+	 * As lock holder preemption issue, we both skip spinning if
+	 * task is not on cpu or its cpu is preempted
+	 */
+	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+}
+
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	struct task_struct *owner;
+	bool ret = true;
+
+	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
+
+	if (need_resched())
+		return false;
+
+	rcu_read_lock();
+	owner = READ_ONCE(sem->owner);
+	if (owner) {
+		ret = is_rwsem_owner_spinnable(owner) &&
+		      owner_on_cpu(owner);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Return true only if we can still spin on the owner field of the rwsem.
+ */
+static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+	struct task_struct *owner = READ_ONCE(sem->owner);
+
+	if (!is_rwsem_owner_spinnable(owner))
+		return false;
+
+	rcu_read_lock();
+	while (owner && (READ_ONCE(sem->owner) == owner)) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking sem->owner still matches owner, if that fails,
+		 * owner might point to free()d memory, if it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		/*
+		 * abort spinning when need_resched or owner is not running or
+		 * owner's cpu is preempted.
+		 */
+		if (need_resched() || !owner_on_cpu(owner)) {
+			rcu_read_unlock();
+			return false;
+		}
+
+		cpu_relax();
+	}
+	rcu_read_unlock();
+
+	/*
+	 * If there is a new owner or the owner is not set, we continue
+	 * spinning.
+	 */
+	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+}
+
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	bool taken = false;
+
+	preempt_disable();
+
+	/* sem->wait_lock should not be held when doing optimistic spinning */
+	if (!rwsem_can_spin_on_owner(sem))
+		goto done;
+
+	if (!osq_lock(&sem->osq))
+		goto done;
+
+	/*
+	 * Optimistically spin on the owner field and attempt to acquire the
+	 * lock whenever the owner changes. Spinning will be stopped when:
+	 *  1) the owning writer isn't running; or
+	 *  2) readers own the lock as we can't determine if they are
+	 *     actively running or not.
+	 */
+	while (rwsem_spin_on_owner(sem)) {
+		/*
+		 * Try to acquire the lock
+		 */
+		if (rwsem_try_write_lock_unqueued(sem)) {
+			taken = true;
+			break;
+		}
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!sem->owner && (need_resched() || rt_task(current)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax();
+	}
+	osq_unlock(&sem->osq);
+done:
+	preempt_enable();
+	lockevent_cond_inc(rwsem_opt_fail, !taken);
+	return taken;
+}
+#else
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	return false;
+}
+#endif
+
+/*
+ * Wait for the read lock to be granted
+ */
+static inline struct rw_semaphore __sched *
+__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+{
+	long count, adjustment = -RWSEM_READER_BIAS;
+	struct rwsem_waiter waiter;
+	DEFINE_WAKE_Q(wake_q);
+
+	waiter.task = current;
+	waiter.type = RWSEM_WAITING_FOR_READ;
+
+	raw_spin_lock_irq(&sem->wait_lock);
+	if (list_empty(&sem->wait_list)) {
+		/*
+		 * In case the wait queue is empty and the lock isn't owned
+		 * by a writer, this reader can exit the slowpath and return
+		 * immediately as its RWSEM_READER_BIAS has already been
+		 * set in the count.
+		 */
+		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+			raw_spin_unlock_irq(&sem->wait_lock);
+			rwsem_set_reader_owned(sem);
+			lockevent_inc(rwsem_rlock_fast);
+			return sem;
+		}
+		adjustment += RWSEM_FLAG_WAITERS;
+	}
+	list_add_tail(&waiter.list, &sem->wait_list);
+
+	/* we're now waiting on the lock, but no longer actively locking */
+	count = atomic_long_add_return(adjustment, &sem->count);
+
+	/*
+	 * If there are no active locks, wake the front queued process(es).
+	 *
+	 * If there are no writers and we are first in the queue,
+	 * wake our own waiter to join the existing active readers !
+	 */
+	if (!(count & RWSEM_LOCK_MASK) ||
+	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+	raw_spin_unlock_irq(&sem->wait_lock);
+	wake_up_q(&wake_q);
+
+	/* wait to be given the lock */
+	while (true) {
+		set_current_state(state);
+		if (!waiter.task)
+			break;
+		if (signal_pending_state(state, current)) {
+			raw_spin_lock_irq(&sem->wait_lock);
+			if (waiter.task)
+				goto out_nolock;
+			raw_spin_unlock_irq(&sem->wait_lock);
+			break;
+		}
+		schedule();
+		lockevent_inc(rwsem_sleep_reader);
+	}
+
+	__set_current_state(TASK_RUNNING);
+	lockevent_inc(rwsem_rlock);
+	return sem;
+out_nolock:
+	list_del(&waiter.list);
+	if (list_empty(&sem->wait_list))
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	__set_current_state(TASK_RUNNING);
+	lockevent_inc(rwsem_rlock_fail);
+	return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed(struct rw_semaphore *sem)
+{
+	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed_killable(struct rw_semaphore *sem)
+{
+	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed_killable);
+
+/*
+ * Wait until we successfully acquire the write lock
+ */
+static inline struct rw_semaphore *
+__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+{
+	long count;
+	bool waiting = true; /* any queued threads before us */
+	struct rwsem_waiter waiter;
+	struct rw_semaphore *ret = sem;
+	DEFINE_WAKE_Q(wake_q);
+
+	/* do optimistic spinning and steal lock if possible */
+	if (rwsem_optimistic_spin(sem))
+		return sem;
+
+	/*
+	 * Optimistic spinning failed, proceed to the slowpath
+	 * and block until we can acquire the sem.
+	 */
+	waiter.task = current;
+	waiter.type = RWSEM_WAITING_FOR_WRITE;
+
+	raw_spin_lock_irq(&sem->wait_lock);
+
+	/* account for this before adding a new element to the list */
+	if (list_empty(&sem->wait_list))
+		waiting = false;
+
+	list_add_tail(&waiter.list, &sem->wait_list);
+
+	/* we're now waiting on the lock */
+	if (waiting) {
+		count = atomic_long_read(&sem->count);
+
+		/*
+		 * If there were already threads queued before us and there are
+		 * no active writers and some readers, the lock must be read
+		 * owned; so we try to  any read locks that were queued ahead
+		 * of us.
+		 */
+		if (!(count & RWSEM_WRITER_MASK) &&
+		     (count & RWSEM_READER_MASK)) {
+			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
+			/*
+			 * The wakeup is normally called _after_ the wait_lock
+			 * is released, but given that we are proactively waking
+			 * readers we can deal with the wake_q overhead as it is
+			 * similar to releasing and taking the wait_lock again
+			 * for attempting rwsem_try_write_lock().
+			 */
+			wake_up_q(&wake_q);
+
+			/*
+			 * Reinitialize wake_q after use.
+			 */
+			wake_q_init(&wake_q);
+		}
+
+	} else {
+		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+	}
+
+	/* wait until we successfully acquire the lock */
+	set_current_state(state);
+	while (true) {
+		if (rwsem_try_write_lock(count, sem))
+			break;
+		raw_spin_unlock_irq(&sem->wait_lock);
+
+		/* Block until there are no active lockers. */
+		do {
+			if (signal_pending_state(state, current))
+				goto out_nolock;
+
+			schedule();
+			lockevent_inc(rwsem_sleep_writer);
+			set_current_state(state);
+			count = atomic_long_read(&sem->count);
+		} while (count & RWSEM_LOCK_MASK);
+
+		raw_spin_lock_irq(&sem->wait_lock);
+	}
+	__set_current_state(TASK_RUNNING);
+	list_del(&waiter.list);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	lockevent_inc(rwsem_wlock);
+
+	return ret;
+
+out_nolock:
+	__set_current_state(TASK_RUNNING);
+	raw_spin_lock_irq(&sem->wait_lock);
+	list_del(&waiter.list);
+	if (list_empty(&sem->wait_list))
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+	else
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	wake_up_q(&wake_q);
+	lockevent_inc(rwsem_wlock_fail);
+
+	return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed(struct rw_semaphore *sem)
+{
+	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed_killable(struct rw_semaphore *sem)
+{
+	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed_killable);
+
+/*
+ * handle waking up a waiter on the semaphore
+ * - up_read/up_write has decremented the active part of count if we come here
+ */
+__visible
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+{
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+	if (!list_empty(&sem->wait_list))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+	wake_up_q(&wake_q);
+
+	return sem;
+}
+EXPORT_SYMBOL(rwsem_wake);
+
+/*
+ * downgrade a write lock into a read lock
+ * - caller incremented waiting part of count and discovered it still negative
+ * - just wake up any readers at the front of the queue
+ */
+__visible
+struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+{
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+	if (!list_empty(&sem->wait_list))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+
+	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+	wake_up_q(&wake_q);
+
+	return sem;
+}
+EXPORT_SYMBOL(rwsem_downgrade_wake);
+
+/*
+ * lock for reading
+ */
+inline void __down_read(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
+		rwsem_down_read_failed(sem);
+		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+					RWSEM_READER_OWNED), sem);
+	} else {
+		rwsem_set_reader_owned(sem);
+	}
+}
+
+static inline int __down_read_killable(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
+		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+			return -EINTR;
+		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+					RWSEM_READER_OWNED), sem);
+	} else {
+		rwsem_set_reader_owned(sem);
+	}
+	return 0;
+}
+
+static inline int __down_read_trylock(struct rw_semaphore *sem)
+{
+	/*
+	 * Optimize for the case when the rwsem is not locked at all.
+	 */
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	lockevent_inc(rwsem_rtrylock);
+	do {
+		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+					tmp + RWSEM_READER_BIAS)) {
+			rwsem_set_reader_owned(sem);
+			return 1;
+		}
+	} while (!(tmp & RWSEM_READ_FAILED_MASK));
+	return 0;
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
+		rwsem_down_write_failed(sem);
+	rwsem_set_owner(sem);
+}
+
+static inline int __down_write_killable(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
+		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+			return -EINTR;
+	rwsem_set_owner(sem);
+	return 0;
+}
+
+static inline int __down_write_trylock(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	lockevent_inc(rwsem_wtrylock);
+	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
+					  RWSEM_WRITER_LOCKED);
+	if (tmp == RWSEM_UNLOCKED_VALUE) {
+		rwsem_set_owner(sem);
+		return true;
+	}
+	return false;
+}
+
+/*
+ * unlock after reading
+ */
+inline void __up_read(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
+				sem);
+	rwsem_clear_reader_owned(sem);
+	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+			== RWSEM_FLAG_WAITERS))
+		rwsem_wake(sem);
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	rwsem_clear_owner(sem);
+	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+			&sem->count) & RWSEM_FLAG_WAITERS))
+		rwsem_wake(sem);
+}
+
+/*
+ * downgrade write lock to read lock
+ */
+static inline void __downgrade_write(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	/*
+	 * When downgrading from exclusive to shared ownership,
+	 * anything inside the write-locked region cannot leak
+	 * into the read side. In contrast, anything in the
+	 * read-locked region is ok to be re-ordered into the
+	 * write side. As such, rely on RELEASE semantics.
+	 */
+	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	tmp = atomic_long_fetch_add_release(
+		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
+	rwsem_set_reader_owned(sem);
+	if (tmp & RWSEM_FLAG_WAITERS)
+		rwsem_downgrade_wake(sem);
+}
 
 /*
  * lock for reading
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 499a9b2bda82..2534ce49f648 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -1,279 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-/*
- * The least significant 2 bits of the owner value has the following
- * meanings when set.
- *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- *    i.e. the owner(s) cannot be readily determined. It can be reader
- *    owned or the owning writer is indeterminate.
- *
- * When a writer acquires a rwsem, it puts its task_struct pointer
- * into the owner field. It is cleared after an unlock.
- *
- * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
- *
- * That information may be helpful in debugging cases where the system
- * seems to hang on a reader owned rwsem especially if only one reader
- * is involved. Ideally we would like to track all the readers that own
- * a rwsem, but the overhead is simply too big.
- */
-#include "lock_events.h"
 
-#define RWSEM_READER_OWNED	(1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED	(1UL << 1)
+#ifndef __INTERNAL_RWSEM_H
+#define __INTERNAL_RWSEM_H
+#include <linux/rwsem.h>
 
-#ifdef CONFIG_DEBUG_RWSEMS
-# define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
-	if (!debug_locks_silent &&				\
-	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
-		#c, atomic_long_read(&(sem)->count),		\
-		(long)((sem)->owner), (long)current,		\
-		list_empty(&(sem)->wait_list) ? "" : "not "))	\
-			debug_locks_off();			\
-	} while (0)
-#else
-# define DEBUG_RWSEMS_WARN_ON(c, sem)
-#endif
+extern void __down_read(struct rw_semaphore *sem);
+extern void __up_read(struct rw_semaphore *sem);
 
-/*
- * The definition of the atomic counter in the semaphore:
- *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bits 2-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
- *
- * atomic_long_fetch_add() is used to obtain reader lock, whereas
- * atomic_long_cmpxchg() will be used to obtain writer lock.
- */
-#define RWSEM_WRITER_LOCKED	(1UL << 0)
-#define RWSEM_FLAG_WAITERS	(1UL << 1)
-#define RWSEM_READER_SHIFT	8
-#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
-#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
-#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
-
-/*
- * All writes to owner are protected by WRITE_ONCE() to make sure that
- * store tearing can't happen as optimistic spinners may read and use
- * the owner value concurrently without lock. Read from owner, however,
- * may not need READ_ONCE() as long as the pointer value is only used
- * for comparison and isn't being dereferenced.
- */
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
-	WRITE_ONCE(sem->owner, current);
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
-	WRITE_ONCE(sem->owner, NULL);
-}
-
-/*
- * The task_struct pointer of the last owning reader will be left in
- * the owner field.
- *
- * Note that the owner value just indicates the task has owned the rwsem
- * previously, it may not be the real owner or one of the real owners
- * anymore when that field is examined, so take it with a grain of salt.
- */
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
-					    struct task_struct *owner)
-{
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-						 | RWSEM_ANONYMOUSLY_OWNED;
-
-	WRITE_ONCE(sem->owner, (struct task_struct *)val);
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
-	__rwsem_set_reader_owned(sem, current);
-}
-
-/*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
- * N.B. !owner is considered spinnable.
- */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
-{
-	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
-	return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
-}
-
-#ifdef CONFIG_DEBUG_RWSEMS
-/*
- * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
- * is a task pointer in owner of a reader-owned rwsem, it will be the
- * real owner or one of the real owners. The only exception is when the
- * unlock is done by up_read_non_owner().
- */
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
-	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
-						   | RWSEM_ANONYMOUSLY_OWNED;
-	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
-		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
-				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
-}
-#else
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_read_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_failed(sem);
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
-	} else {
-		rwsem_set_reader_owned(sem);
-	}
-}
-
-static inline int __down_read_killable(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
-			return -EINTR;
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
-	} else {
-		rwsem_set_reader_owned(sem);
-	}
-	return 0;
-}
-
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
-	/*
-	 * Optimize for the case when the rwsem is not locked at all.
-	 */
-	long tmp = RWSEM_UNLOCKED_VALUE;
-
-	lockevent_inc(rwsem_rtrylock);
-	do {
-		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
-					tmp + RWSEM_READER_BIAS)) {
-			rwsem_set_reader_owned(sem);
-			return 1;
-		}
-	} while (!(tmp & RWSEM_READ_FAILED_MASK));
-	return 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		rwsem_down_write_failed(sem);
-	rwsem_set_owner(sem);
-}
-
-static inline int __down_write_killable(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
-			return -EINTR;
-	rwsem_set_owner(sem);
-	return 0;
-}
-
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	lockevent_inc(rwsem_wtrylock);
-	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-					  RWSEM_WRITER_LOCKED);
-	if (tmp == RWSEM_UNLOCKED_VALUE) {
-		rwsem_set_owner(sem);
-		return true;
-	}
-	return false;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
-	rwsem_clear_reader_owned(sem);
-	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
-	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
-			== RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
-			&sem->count) & RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	/*
-	 * When downgrading from exclusive to shared ownership,
-	 * anything inside the write-locked region cannot leak
-	 * into the read side. In contrast, anything in the
-	 * read-locked region is ok to be re-ordered into the
-	 * write side. As such, rely on RELEASE semantics.
-	 */
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	tmp = atomic_long_fetch_add_release(
-		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
-	rwsem_set_reader_owned(sem);
-	if (tmp & RWSEM_FLAG_WAITERS)
-		rwsem_downgrade_wake(sem);
-}
+#endif /* __INTERNAL_RWSEM_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 05/19] locking/rwsem: Code cleanup after files merging
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (3 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 04/19] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-17 14:26   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 06/19] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

After merging all the relevant rwsem code into one single file, there
are a number of optimizations and cleanups that can be done:

 1) Remove all the EXPORT_SYMBOL() calls for functions that are not
    accessed elsewhere.
 2) Remove all the __visible tags as none of the functions will be
    called from assembly code anymore.
 3) Make all the internal functions static.
 4) Remove some unneeded blank lines.
 5) Remove the intermediate rwsem_down_{read|write}_failed*() functions
    and rename __rwsem_down_{read|write}_failed_common() to
    rwsem_down_{read|write}_slowpath().
 6) Remove "__" prefix of __rwsem_mark_wake().
 7) Use atomic_long_try_cmpxchg_acquire() as much as possible.
 8) Remove the rwsem_rtrylock and rwsem_wtrylock lock events as they
    are not that useful.

That enables the compiler to do better optimization and reduce code
size. The text+data size of rwsem.o on an x86-64 machine with gcc8 was
reduced from 10237 bytes to 5030 bytes with this change.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |   2 -
 kernel/locking/rwsem.c            | 135 ++++++++++--------------------
 2 files changed, 42 insertions(+), 95 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index ad7668cfc9da..11187a1d40b8 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,7 +61,5 @@ LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
-LOCK_EVENT(rwsem_rtrylock)	/* # of read trylock calls		*/
 LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
 LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
-LOCK_EVENT(rwsem_wtrylock)	/* # of write trylock calls		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8317bcdf063b..f56329240ef1 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -205,7 +205,6 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	osq_lock_init(&sem->osq);
 #endif
 }
-
 EXPORT_SYMBOL(__init_rwsem);
 
 enum rwsem_waiter_type {
@@ -237,9 +236,9 @@ enum rwsem_wake_type {
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only marked woken if downgrading is false
  */
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
-			      enum rwsem_wake_type wake_type,
-			      struct wake_q_head *wake_q)
+static void rwsem_mark_wake(struct rw_semaphore *sem,
+			    enum rwsem_wake_type wake_type,
+			    struct wake_q_head *wake_q)
 {
 	struct rwsem_waiter *waiter, *tmp;
 	long oldcount, woken = 0, adjustment = 0;
@@ -330,7 +329,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 
 		/*
 		 * Ensure calling get_task_struct() before setting the reader
-		 * waiter to nil such that rwsem_down_read_failed() cannot
+		 * waiter to nil such that rwsem_down_read_slowpath() cannot
 		 * race with do_exit() by always holding a reference count
 		 * to the task to wakeup.
 		 */
@@ -516,8 +515,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 /*
  * Wait for the read lock to be granted
  */
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore __sched *
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count, adjustment = -RWSEM_READER_BIAS;
 	struct rwsem_waiter waiter;
@@ -555,7 +554,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 	 */
 	if (!(count & RWSEM_LOCK_MASK) ||
 	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
 	wake_up_q(&wake_q);
@@ -589,25 +588,11 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 	return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
 /*
  * Wait until we successfully acquire the write lock
  */
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore *
+rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count;
 	bool waiting = true; /* any queued threads before us */
@@ -646,7 +631,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 		 */
 		if (!(count & RWSEM_WRITER_MASK) &&
 		     (count & RWSEM_READER_MASK)) {
-			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
+			rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
 			/*
 			 * The wakeup is normally called _after_ the wait_lock
 			 * is released, but given that we are proactively waking
@@ -700,7 +685,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	if (list_empty(&sem->wait_list))
 		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	else
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	wake_up_q(&wake_q);
 	lockevent_inc(rwsem_wlock_fail);
@@ -708,26 +693,11 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
 /*
  * handle waking up a waiter on the semaphore
  * - up_read/up_write has decremented the active part of count if we come here
  */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -735,22 +705,20 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 	wake_up_q(&wake_q);
 
 	return sem;
 }
-EXPORT_SYMBOL(rwsem_wake);
 
 /*
  * downgrade a write lock into a read lock
  * - caller incremented waiting part of count and discovered it still negative
  * - just wake up any readers at the front of the queue
  */
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -758,14 +726,13 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 	wake_up_q(&wake_q);
 
 	return sem;
 }
-EXPORT_SYMBOL(rwsem_downgrade_wake);
 
 /*
  * lock for reading
@@ -774,7 +741,7 @@ inline void __down_read(struct rw_semaphore *sem)
 {
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_failed(sem);
+		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
 	} else {
@@ -786,7 +753,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
 {
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
@@ -803,7 +770,6 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
 	 */
 	long tmp = RWSEM_UNLOCKED_VALUE;
 
-	lockevent_inc(rwsem_rtrylock);
 	do {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
 					tmp + RWSEM_READER_BIAS)) {
@@ -819,30 +785,33 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
  */
 static inline void __down_write(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		rwsem_down_write_failed(sem);
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+						      RWSEM_WRITER_LOCKED)))
+		rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
 	rwsem_set_owner(sem);
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+						      RWSEM_WRITER_LOCKED))) {
+		if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
+	}
 	rwsem_set_owner(sem);
 	return 0;
 }
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-	long tmp;
+	long tmp = RWSEM_UNLOCKED_VALUE;
 
-	lockevent_inc(rwsem_wtrylock);
-	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-					  RWSEM_WRITER_LOCKED);
-	if (tmp == RWSEM_UNLOCKED_VALUE) {
+	if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+					    RWSEM_WRITER_LOCKED)) {
 		rwsem_set_owner(sem);
 		return true;
 	}
@@ -856,12 +825,11 @@ inline void __up_read(struct rw_semaphore *sem)
 {
 	long tmp;
 
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
+	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
-	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
-			== RWSEM_FLAG_WAITERS))
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
+		      RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -870,10 +838,12 @@ inline void __up_read(struct rw_semaphore *sem)
  */
 static inline void __up_write(struct rw_semaphore *sem)
 {
+	long tmp;
+
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
 	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
-			&sem->count) & RWSEM_FLAG_WAITERS))
+	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -909,7 +879,6 @@ void __sched down_read(struct rw_semaphore *sem)
 
 	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
 }
-
 EXPORT_SYMBOL(down_read);
 
 int __sched down_read_killable(struct rw_semaphore *sem)
@@ -924,7 +893,6 @@ int __sched down_read_killable(struct rw_semaphore *sem)
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_read_killable);
 
 /*
@@ -938,7 +906,6 @@ int down_read_trylock(struct rw_semaphore *sem)
 		rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_);
 	return ret;
 }
-
 EXPORT_SYMBOL(down_read_trylock);
 
 /*
@@ -948,10 +915,8 @@ void __sched down_write(struct rw_semaphore *sem)
 {
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(down_write);
 
 /*
@@ -962,14 +927,14 @@ int __sched down_write_killable(struct rw_semaphore *sem)
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
-	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+				  __down_write_killable)) {
 		rwsem_release(&sem->dep_map, 1, _RET_IP_);
 		return -EINTR;
 	}
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_write_killable);
 
 /*
@@ -984,7 +949,6 @@ int down_write_trylock(struct rw_semaphore *sem)
 
 	return ret;
 }
-
 EXPORT_SYMBOL(down_write_trylock);
 
 /*
@@ -993,10 +957,8 @@ EXPORT_SYMBOL(down_write_trylock);
 void up_read(struct rw_semaphore *sem)
 {
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
 	__up_read(sem);
 }
-
 EXPORT_SYMBOL(up_read);
 
 /*
@@ -1005,10 +967,8 @@ EXPORT_SYMBOL(up_read);
 void up_write(struct rw_semaphore *sem)
 {
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
 	__up_write(sem);
 }
-
 EXPORT_SYMBOL(up_write);
 
 /*
@@ -1017,10 +977,8 @@ EXPORT_SYMBOL(up_write);
 void downgrade_write(struct rw_semaphore *sem)
 {
 	lock_downgrade(&sem->dep_map, _RET_IP_);
-
 	__downgrade_write(sem);
 }
-
 EXPORT_SYMBOL(downgrade_write);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -1029,40 +987,32 @@ void down_read_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
 	rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
 }
-
 EXPORT_SYMBOL(down_read_nested);
 
 void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
 {
 	might_sleep();
 	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(_down_write_nest_lock);
 
 void down_read_non_owner(struct rw_semaphore *sem)
 {
 	might_sleep();
-
 	__down_read(sem);
 	__rwsem_set_reader_owned(sem, NULL);
 }
-
 EXPORT_SYMBOL(down_read_non_owner);
 
 void down_write_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(down_write_nested);
 
 int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
@@ -1070,14 +1020,14 @@ int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
-	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+				  __down_write_killable)) {
 		rwsem_release(&sem->dep_map, 1, _RET_IP_);
 		return -EINTR;
 	}
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_write_killable_nested);
 
 void up_read_non_owner(struct rw_semaphore *sem)
@@ -1086,7 +1036,6 @@ void up_read_non_owner(struct rw_semaphore *sem)
 				sem);
 	__up_read(sem);
 }
-
 EXPORT_SYMBOL(up_read_non_owner);
 
 #endif
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 06/19] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (4 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 05/19] locking/rwsem: Code cleanup after files merging Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-17 14:27   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

This patch modifies rwsem_spin_on_owner() to return four possible
values to better reflect the state of lock holder which enables us to
make a better decision of what to do next.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 65 ++++++++++++++++++++++++++++++------------
 1 file changed, 47 insertions(+), 18 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index f56329240ef1..8d0f2acfe13d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -414,17 +414,54 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 }
 
 /*
- * Return true only if we can still spin on the owner field of the rwsem.
+ * The rwsem_spin_on_owner() function returns the folowing 4 values
+ * depending on the lock owner state.
+ *   OWNER_NULL  : owner is currently NULL
+ *   OWNER_WRITER: when owner changes and is a writer
+ *   OWNER_READER: when owner changes and the new owner may be a reader.
+ *   OWNER_NONSPINNABLE:
+ *		   when optimistic spinning has to stop because either the
+ *		   owner stops running, is unknown, or its timeslice has
+ *		   been used up.
  */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+enum owner_state {
+	OWNER_NULL		= 1 << 0,
+	OWNER_WRITER		= 1 << 1,
+	OWNER_READER		= 1 << 2,
+	OWNER_NONSPINNABLE	= 1 << 3,
+};
+#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
+
+static inline enum owner_state rwsem_owner_state(unsigned long owner)
 {
-	struct task_struct *owner = READ_ONCE(sem->owner);
+	if (!owner)
+		return OWNER_NULL;
 
-	if (!is_rwsem_owner_spinnable(owner))
-		return false;
+	if (owner & RWSEM_ANONYMOUSLY_OWNED)
+		return OWNER_NONSPINNABLE;
+
+	if (owner & RWSEM_READER_OWNED)
+		return OWNER_READER;
+
+	return OWNER_WRITER;
+}
+
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
+	enum owner_state state = rwsem_owner_state((unsigned long)owner);
+
+	if (state != OWNER_WRITER)
+		return state;
 
 	rcu_read_lock();
-	while (owner && (READ_ONCE(sem->owner) == owner)) {
+	for (;;) {
+		tmp = READ_ONCE(sem->owner);
+		if (tmp != owner) {
+			state = rwsem_owner_state((unsigned long)tmp);
+			break;
+		}
+
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
 		 * checking sem->owner still matches owner, if that fails,
@@ -433,24 +470,16 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
 		 */
 		barrier();
 
-		/*
-		 * abort spinning when need_resched or owner is not running or
-		 * owner's cpu is preempted.
-		 */
 		if (need_resched() || !owner_on_cpu(owner)) {
-			rcu_read_unlock();
-			return false;
+			state = OWNER_NONSPINNABLE;
+			break;
 		}
 
 		cpu_relax();
 	}
 	rcu_read_unlock();
 
-	/*
-	 * If there is a new owner or the owner is not set, we continue
-	 * spinning.
-	 */
-	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+	return state;
 }
 
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
@@ -473,7 +502,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	 *  2) readers own the lock as we can't determine if they are
 	 *     actively running or not.
 	 */
-	while (rwsem_spin_on_owner(sem)) {
+	while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
 		/*
 		 * Try to acquire the lock
 		 */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (5 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 06/19] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-04  3:03   ` Yuyang Du
  2019-06-17 14:27   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 08/19] locking/rwsem: Always release wait_lock before waking up tasks Waiman Long
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.

This patch implements a lock handoff mechanism to disable lock stealing
and force lock handoff to the first waiter or waiters (for readers)
in the queue after at least a 4ms waiting period unless it is a RT
writer task which doesn't need to wait. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.

The setting and clearing of the handoff bit is serialized by the
wait_lock. So racing is not possible.

A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
80-thread Skylake system with a v5.1 based kernel and 240 write_lock
threads with 5us sleep critical section.

Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,792/173,696. After the patch, the figures
became 5,842/6,542/7,458.  It can be seen that the rwsem became much
more fair, though there was a drop of about 16% in the mean locking
operations done which was a tradeoff of having better fairness.

Making the waiter set the handoff bit right after the first wakeup can
impact performance especially with a mixed reader/writer workload. With
the same microbenchmark with short critical section and equal number of
reader and writer threads (40/40), the reader/writer locking operation
counts with the current patch were:

  40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
  40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081

By making waiter set handoff bit immediately after wakeup:

  40 readers, Iterations Min/Mean/Max = 43/44/46
  40 writers, Iterations Min/Mean/Max = 43/1,263/3,191

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |   2 +
 kernel/locking/rwsem.c            | 225 +++++++++++++++++++++++-------
 2 files changed, 173 insertions(+), 54 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 11187a1d40b8..634b47fd8b5e 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,5 +61,7 @@ LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
+LOCK_EVENT(rwsem_rlock_handoff)	/* # of read lock handoffs		*/
 LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
 LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
+LOCK_EVENT(rwsem_wlock_handoff)	/* # of write lock handoffs		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8d0f2acfe13d..0c8aef065acb 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -10,8 +10,9 @@
  * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
  * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
  *
- * Rwsem count bit fields re-definition and rwsem rearchitecture
- * by Waiman Long <longman@redhat.com>.
+ * Rwsem count bit fields re-definition and rwsem rearchitecture by
+ * Waiman Long <longman@redhat.com> and
+ * Peter Zijlstra <peterz@infradead.org>.
  */
 
 #include <linux/types.h>
@@ -74,20 +75,33 @@
  *
  * Bit  0   - writer locked bit
  * Bit  1   - waiters present bit
- * Bits 2-7 - reserved
+ * Bit  2   - lock handoff bit
+ * Bits 3-7 - reserved
  * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
+ *
+ * There are three places where the lock handoff bit may be set or cleared.
+ * 1) rwsem_mark_wake() for readers.
+ * 2) rwsem_try_write_lock() for writers.
+ * 3) Error path of rwsem_down_write_slowpath().
+ *
+ * For all the above cases, wait_lock will be held. A writer must also
+ * be the first one in the wait_list to be eligible for setting the handoff
+ * bit. So concurrent setting/clearing of handoff bit is not possible.
  */
 #define RWSEM_WRITER_LOCKED	(1UL << 0)
 #define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_FLAG_HANDOFF	(1UL << 2)
+
 #define RWSEM_READER_SHIFT	8
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
 #define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
 #define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
+				 RWSEM_FLAG_HANDOFF)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -216,7 +230,10 @@ struct rwsem_waiter {
 	struct list_head list;
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
+	unsigned long timeout;
 };
+#define rwsem_first_waiter(sem) \
+	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
 
 enum rwsem_wake_type {
 	RWSEM_WAKE_ANY,		/* Wake whatever's at head of wait list */
@@ -224,6 +241,19 @@ enum rwsem_wake_type {
 	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
 };
 
+enum writer_wait_state {
+	WRITER_NOT_FIRST,	/* Writer is not first in wait list */
+	WRITER_FIRST,		/* Writer is first in wait list     */
+	WRITER_HANDOFF		/* Writer is first & handoff needed */
+};
+
+/*
+ * The typical HZ value is either 250 or 1000. So set the minimum waiting
+ * time to at least 4ms or 1 jiffy (if it is higher than 4ms) in the wait
+ * queue before initiating the handoff protocol.
+ */
+#define RWSEM_WAIT_TIMEOUT	DIV_ROUND_UP(HZ, 250)
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -244,11 +274,13 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 	long oldcount, woken = 0, adjustment = 0;
 	struct list_head wlist;
 
+	lockdep_assert_held(&sem->wait_lock);
+
 	/*
 	 * Take a peek at the queue head waiter such that we can determine
 	 * the wakeup(s) to perform.
 	 */
-	waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
+	waiter = rwsem_first_waiter(sem);
 
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
 		if (wake_type == RWSEM_WAKE_ANY) {
@@ -275,7 +307,18 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
 		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
-			atomic_long_sub(adjustment, &sem->count);
+			/*
+			 * When we've been waiting "too" long (for writers
+			 * to give up the lock), request a * HANDOFF to
+			 * force the issue.
+			 */
+			if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
+			    time_after(jiffies, waiter->timeout)) {
+				adjustment -= RWSEM_FLAG_HANDOFF;
+				lockevent_inc(rwsem_rlock_handoff);
+			}
+
+			atomic_long_add(-adjustment, &sem->count);
 			return;
 		}
 		/*
@@ -317,6 +360,13 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		adjustment -= RWSEM_FLAG_WAITERS;
 	}
 
+	/*
+	 * When we've woken a reader, we no longer need to force writers
+	 * to give up the lock and we can clear HANDOFF.
+	 */
+	if (woken && (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF))
+		adjustment -= RWSEM_FLAG_HANDOFF;
+
 	if (adjustment)
 		atomic_long_add(adjustment, &sem->count);
 
@@ -346,23 +396,48 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
  * This function must be called with the sem->wait_lock held to prevent
  * race conditions between checking the rwsem wait list and setting the
  * sem->count accordingly.
+ *
+ * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
+ * bit is set or the lock is acquired with handoff bit cleared.
  */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+					enum writer_wait_state wstate)
 {
 	long new;
 
-	if (count & RWSEM_LOCK_MASK)
-		return false;
+	lockdep_assert_held(&sem->wait_lock);
 
-	new = count + RWSEM_WRITER_LOCKED -
-	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+	do {
+		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
 
-	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
-		rwsem_set_owner(sem);
-		return true;
-	}
+		if (has_handoff && wstate == WRITER_NOT_FIRST)
+			return false;
 
-	return false;
+		new = count;
+
+		if (count & RWSEM_LOCK_MASK) {
+			if (has_handoff || (wstate != WRITER_HANDOFF))
+				return false;
+
+			new |= RWSEM_FLAG_HANDOFF;
+		} else {
+			new |= RWSEM_WRITER_LOCKED;
+			new &= ~RWSEM_FLAG_HANDOFF;
+
+			if (list_is_singular(&sem->wait_list))
+				new &= ~RWSEM_FLAG_WAITERS;
+		}
+	} while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
+
+	/*
+	 * We have either acquired the lock with handoff bit cleared or
+	 * set the handoff bit.
+	 */
+	if (new & RWSEM_FLAG_HANDOFF)
+		return false;
+
+	rwsem_set_owner(sem);
+	return true;
 }
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
@@ -373,9 +448,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 {
 	long count = atomic_long_read(&sem->count);
 
-	while (!(count & RWSEM_LOCK_MASK)) {
+	while (!(count & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_WRITER_LOCKED)) {
+					count | RWSEM_WRITER_LOCKED)) {
 			rwsem_set_owner(sem);
 			lockevent_inc(rwsem_opt_wlock);
 			return true;
@@ -456,6 +531,11 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 
 	rcu_read_lock();
 	for (;;) {
+		if (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF) {
+			state = OWNER_NONSPINNABLE;
+			break;
+		}
+
 		tmp = READ_ONCE(sem->owner);
 		if (tmp != owner) {
 			state = rwsem_owner_state((unsigned long)tmp);
@@ -553,16 +633,18 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_READ;
+	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list)) {
 		/*
 		 * In case the wait queue is empty and the lock isn't owned
-		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_READER_BIAS has already been
-		 * set in the count.
+		 * by a writer or has the handoff bit set, this reader can
+		 * exit the slowpath and return immediately as its
+		 * RWSEM_READER_BIAS has already been set in the count.
 		 */
-		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+		if (!(atomic_long_read(&sem->count) &
+		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
@@ -609,8 +691,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	return sem;
 out_nolock:
 	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+	if (list_empty(&sem->wait_list)) {
+		atomic_long_andnot(RWSEM_FLAG_WAITERS|RWSEM_FLAG_HANDOFF,
+				   &sem->count);
+	}
 	raw_spin_unlock_irq(&sem->wait_lock);
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock_fail);
@@ -624,7 +708,7 @@ static struct rw_semaphore *
 rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count;
-	bool waiting = true; /* any queued threads before us */
+	enum writer_wait_state wstate;
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
@@ -639,66 +723,95 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	 */
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_WRITE;
+	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 
 	/* account for this before adding a new element to the list */
-	if (list_empty(&sem->wait_list))
-		waiting = false;
+	wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
 
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock */
-	if (waiting) {
+	if (wstate == WRITER_NOT_FIRST) {
 		count = atomic_long_read(&sem->count);
 
 		/*
-		 * If there were already threads queued before us and there are
-		 * no active writers and some readers, the lock must be read
-		 * owned; so we try to  any read locks that were queued ahead
-		 * of us.
+		 * If there were already threads queued before us and:
+		 *  1) there are no no active locks, wake the front
+		 *     queued process(es) as the handoff bit might be set.
+		 *  2) there are no active writers and some readers, the lock
+		 *     must be read owned; so we try to wake any read lock
+		 *     waiters that were queued ahead of us.
 		 */
-		if (!(count & RWSEM_WRITER_MASK) &&
-		     (count & RWSEM_READER_MASK)) {
-			rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
-			/*
-			 * The wakeup is normally called _after_ the wait_lock
-			 * is released, but given that we are proactively waking
-			 * readers we can deal with the wake_q overhead as it is
-			 * similar to releasing and taking the wait_lock again
-			 * for attempting rwsem_try_write_lock().
-			 */
-			wake_up_q(&wake_q);
+		if (count & RWSEM_WRITER_MASK)
+			goto wait;
 
-			/*
-			 * Reinitialize wake_q after use.
-			 */
-			wake_q_init(&wake_q);
-		}
+		rwsem_mark_wake(sem, (count & RWSEM_READER_MASK)
+					? RWSEM_WAKE_READERS
+					: RWSEM_WAKE_ANY, &wake_q);
 
+		/*
+		 * The wakeup is normally called _after_ the wait_lock
+		 * is released, but given that we are proactively waking
+		 * readers we can deal with the wake_q overhead as it is
+		 * similar to releasing and taking the wait_lock again
+		 * for attempting rwsem_try_write_lock().
+		 */
+		wake_up_q(&wake_q);
+
+		/* We need wake_q again below, reinitialize */
+		wake_q_init(&wake_q);
 	} else {
 		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
 	}
 
+wait:
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	while (true) {
-		if (rwsem_try_write_lock(count, sem))
+		if (rwsem_try_write_lock(count, sem, wstate))
 			break;
+
 		raw_spin_unlock_irq(&sem->wait_lock);
 
 		/* Block until there are no active lockers. */
-		do {
+		for (;;) {
 			if (signal_pending_state(state, current))
 				goto out_nolock;
 
 			schedule();
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
+			/*
+			 * If HANDOFF bit is set, unconditionally do
+			 * a trylock.
+			 */
+			if (wstate == WRITER_HANDOFF)
+				break;
+
+			if ((wstate == WRITER_NOT_FIRST) &&
+			    (rwsem_first_waiter(sem) == &waiter))
+				wstate = WRITER_FIRST;
+
 			count = atomic_long_read(&sem->count);
-		} while (count & RWSEM_LOCK_MASK);
+			if (!(count & RWSEM_LOCK_MASK))
+				break;
+
+			/*
+			 * The setting of the handoff bit is deferred
+			 * until rwsem_try_write_lock() is called.
+			 */
+			if ((wstate == WRITER_FIRST) && (rt_task(current) ||
+			    time_after(jiffies, waiter.timeout))) {
+				wstate = WRITER_HANDOFF;
+				lockevent_inc(rwsem_wlock_handoff);
+				break;
+			}
+		}
 
 		raw_spin_lock_irq(&sem->wait_lock);
+		count = atomic_long_read(&sem->count);
 	}
 	__set_current_state(TASK_RUNNING);
 	list_del(&waiter.list);
@@ -711,6 +824,10 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	__set_current_state(TASK_RUNNING);
 	raw_spin_lock_irq(&sem->wait_lock);
 	list_del(&waiter.list);
+
+	if (unlikely(wstate == WRITER_HANDOFF))
+		atomic_long_add(-RWSEM_FLAG_HANDOFF,  &sem->count);
+
 	if (list_empty(&sem->wait_list))
 		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	else
@@ -726,7 +843,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
  * handle waking up a waiter on the semaphore
  * - up_read/up_write has decremented the active part of count if we come here
  */
-static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -859,7 +976,7 @@ inline void __up_read(struct rw_semaphore *sem)
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
 		      RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
+		rwsem_wake(sem, tmp);
 }
 
 /*
@@ -873,7 +990,7 @@ static inline void __up_write(struct rw_semaphore *sem)
 	rwsem_clear_owner(sem);
 	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
+		rwsem_wake(sem, tmp);
 }
 
 /*
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 08/19] locking/rwsem: Always release wait_lock before waking up tasks
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (6 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-17 14:28   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 09/19] locking/rwsem: More optimal RT task handling of null owner Waiman Long
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

With the use of wake_q, we can do task wakeups without holding the
wait_lock. There is one exception in the rwsem code, though. It is
when the writer in the slowpath detects that there are waiters ahead
but the rwsem is not held by a writer. This can lead to a long wait_lock
hold time especially when a large number of readers are to be woken up.

Remediate this situation by releasing the wait_lock before waking
up tasks and re-acquiring it afterward. The rwsem_try_write_lock()
function is also modified to read the rwsem count directly to avoid
stale count value.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched/wake_q.h |  5 +++++
 kernel/locking/rwsem.c       | 31 +++++++++++++++----------------
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index ad826d2a4557..26a2013ac39c 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,6 +51,11 @@ static inline void wake_q_init(struct wake_q_head *head)
 	head->lastp = &head->first;
 }
 
+static inline bool wake_q_empty(struct wake_q_head *head)
+{
+	return head->first == WAKE_Q_TAIL;
+}
+
 extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
 extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
 extern void wake_up_q(struct wake_q_head *head);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 0c8aef065acb..36aed5236bd2 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -400,13 +400,14 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
  * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
  * bit is set or the lock is acquired with handoff bit cleared.
  */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 					enum writer_wait_state wstate)
 {
-	long new;
+	long count, new;
 
 	lockdep_assert_held(&sem->wait_lock);
 
+	count = atomic_long_read(&sem->count);
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
 
@@ -751,26 +752,25 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 					? RWSEM_WAKE_READERS
 					: RWSEM_WAKE_ANY, &wake_q);
 
-		/*
-		 * The wakeup is normally called _after_ the wait_lock
-		 * is released, but given that we are proactively waking
-		 * readers we can deal with the wake_q overhead as it is
-		 * similar to releasing and taking the wait_lock again
-		 * for attempting rwsem_try_write_lock().
-		 */
-		wake_up_q(&wake_q);
-
-		/* We need wake_q again below, reinitialize */
-		wake_q_init(&wake_q);
+		if (!wake_q_empty(&wake_q)) {
+			/*
+			 * We want to minimize wait_lock hold time especially
+			 * when a large number of readers are to be woken up.
+			 */
+			raw_spin_unlock_irq(&sem->wait_lock);
+			wake_up_q(&wake_q);
+			wake_q_init(&wake_q);	/* Used again, reinit */
+			raw_spin_lock_irq(&sem->wait_lock);
+		}
 	} else {
-		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+		atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count);
 	}
 
 wait:
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	while (true) {
-		if (rwsem_try_write_lock(count, sem, wstate))
+		if (rwsem_try_write_lock(sem, wstate))
 			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
@@ -811,7 +811,6 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		}
 
 		raw_spin_lock_irq(&sem->wait_lock);
-		count = atomic_long_read(&sem->count);
 	}
 	__set_current_state(TASK_RUNNING);
 	list_del(&waiter.list);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 09/19] locking/rwsem: More optimal RT task handling of null owner
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (7 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 08/19] locking/rwsem: Always release wait_lock before waking up tasks Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-17 14:29   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 10/19] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.

When the owner is temporarily set to NULL, there are two cases
where we may want to continue spinning:

 1) The lock owner is in the process of releasing the lock, sem->owner
    is cleared but the lock has not been released yet.

 2) The lock was free and owner cleared, but another task just comes
    in and acquire the lock before we try to get it. The new owner may
    be a spinnable writer.

So an RT task is now made to retry one more time to see if it can
acquire the lock or continue spinning on the new owning writer.

When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.

    Locking threads     Pre-patch     Post-patch
    ---------------     ---------     -----------
            4             2,753          2,608
            8             2,529          2,520
           16             1,727          1,918
           32             1,263          1,956
           64               889          1,343

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 51 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 36aed5236bd2..eb43201b89b4 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -566,6 +566,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
 	bool taken = false;
+	int prev_owner_state = OWNER_NULL;
 
 	preempt_disable();
 
@@ -583,7 +584,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	 *  2) readers own the lock as we can't determine if they are
 	 *     actively running or not.
 	 */
-	while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
+	for (;;) {
+		enum owner_state owner_state = rwsem_spin_on_owner(sem);
+
+		if (!(owner_state & OWNER_SPINNABLE))
+			break;
+
 		/*
 		 * Try to acquire the lock
 		 */
@@ -593,13 +599,44 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		}
 
 		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
+		 * An RT task cannot do optimistic spinning if it cannot
+		 * be sure the lock holder is running or live-lock may
+		 * happen if the current task and the lock holder happen
+		 * to run in the same CPU. However, aborting optimistic
+		 * spinning while a NULL owner is detected may miss some
+		 * opportunity where spinning can continue without causing
+		 * problem.
+		 *
+		 * There are 2 possible cases where an RT task may be able
+		 * to continue spinning.
+		 *
+		 * 1) The lock owner is in the process of releasing the
+		 *    lock, sem->owner is cleared but the lock has not
+		 *    been released yet.
+		 * 2) The lock was free and owner cleared, but another
+		 *    task just comes in and acquire the lock before
+		 *    we try to get it. The new owner may be a spinnable
+		 *    writer.
+		 *
+		 * To take advantage of two scenarios listed agove, the RT
+		 * task is made to retry one more time to see if it can
+		 * acquire the lock or continue spinning on the new owning
+		 * writer. Of course, if the time lag is long enough or the
+		 * new owner is not a writer or spinnable, the RT task will
+		 * quit spinning.
+		 *
+		 * If the owner is a writer, the need_resched() check is
+		 * done inside rwsem_spin_on_owner(). If the owner is not
+		 * a writer, need_resched() check needs to be done here.
 		 */
-		if (!sem->owner && (need_resched() || rt_task(current)))
-			break;
+		if (owner_state != OWNER_WRITER) {
+			if (need_resched())
+				break;
+			if (rt_task(current) &&
+			   (prev_owner_state != OWNER_WRITER))
+				break;
+		}
+		prev_owner_state = owner_state;
 
 		/*
 		 * The cpu_relax() call is a compiler barrier which forces
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 10/19] locking/rwsem: Wake up almost all readers in wait queue
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (8 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 09/19] locking/rwsem: More optimal RT task handling of null owner Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-17 14:29   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 11/19] locking/rwsem: Clarify usage of owner's nonspinaable bit Waiman Long
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.

Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.

Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput. This is
somewhat similar in concept to a phase-fair R/W lock.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-Patch   Post-patch
   ------------  ---------   ----------
        4          1,641        1,674
        8            731        1,062
       16            564          924
       32             78          300
       64             38          195
      240             50          149

There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index eb43201b89b4..b8e209c5fa55 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -254,6 +254,14 @@ enum writer_wait_state {
  */
 #define RWSEM_WAIT_TIMEOUT	DIV_ROUND_UP(HZ, 250)
 
+/*
+ * Magic number to batch-wakeup waiting readers, even when writers are
+ * also present in the queue. This both limits the amount of work the
+ * waking thread must do and also prevents any potential counter overflow,
+ * however unlikely.
+ */
+#define MAX_READERS_WAKEUP	0x100
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -329,11 +337,17 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 	}
 
 	/*
-	 * Grant an infinite number of read locks to the readers at the front
-	 * of the queue. We know that woken will be at least 1 as we accounted
+	 * Grant up to MAX_READERS_WAKEUP read locks to all the readers in the
+	 * queue. We know that the woken will be at least 1 as we accounted
 	 * for above. Note we increment the 'active part' of the count by the
 	 * number of readers before waking any processes up.
 	 *
+	 * This is an adaptation of the phase-fair R/W locks where at the
+	 * reader phase (first waiter is a reader), all readers are eligible
+	 * to acquire the lock at the same time irrespective of their order
+	 * in the queue. The writers acquire the lock according to their
+	 * order in the queue.
+	 *
 	 * We have to do wakeup in 2 passes to prevent the possibility that
 	 * the reader count may be decremented before it is incremented. It
 	 * is because the to-be-woken waiter may not have slept yet. So it
@@ -345,13 +359,20 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 	 * 2) For each waiters in the new list, clear waiter->task and
 	 *    put them into wake_q to be woken up later.
 	 */
-	list_for_each_entry(waiter, &sem->wait_list, list) {
+	INIT_LIST_HEAD(&wlist);
+	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
 		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-			break;
+			continue;
 
 		woken++;
+		list_move_tail(&waiter->list, &wlist);
+
+		/*
+		 * Limit # of readers that can be woken up per wakeup call.
+		 */
+		if (woken >= MAX_READERS_WAKEUP)
+			break;
 	}
-	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
 
 	adjustment = woken * RWSEM_READER_BIAS - adjustment;
 	lockevent_cond_inc(rwsem_wake_reader, woken);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 11/19] locking/rwsem: Clarify usage of owner's nonspinaable bit
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (9 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 10/19] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-17 14:30   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 12/19] locking/rwsem: Enable readers spinning on writer Waiman Long
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
anonymous owner - readers or an anonymous writer. The setting of this
anonymous bit is used as an indicator that optimistic spinning cannot
be done on this rwsem.

With the upcoming reader optimistic spinning patches, a reader-owned
rwsem can be spinned on for a limit period of time. We still need
this bit to indicate a rwsem is nonspinnable, but not setting this
bit loses its meaning that the owner is known. So rename the bit
to RWSEM_NONSPINNABLE to clarify its meaning.

This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/rwsem.h  |  2 +-
 kernel/locking/rwsem.c | 43 +++++++++++++++++++++---------------------
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 148983e21d47..bb76e82398b2 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -50,7 +50,7 @@ struct rw_semaphore {
 };
 
 /*
- * Setting bit 1 of the owner field but not bit 0 will indicate
+ * Setting all bits of the owner field except bit 0 will indicate
  * that the rwsem is writer-owned with an unknown owner.
  */
 #define RWSEM_OWNER_UNKNOWN	((struct task_struct *)-2L)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index b8e209c5fa55..be939accd60c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -33,17 +33,18 @@
 /*
  * The least significant 2 bits of the owner value has the following
  * meanings when set.
- *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- *    i.e. the owner(s) cannot be readily determined. It can be reader
- *    owned or the owning writer is indeterminate.
+ *  - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
+ *  - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
+ *    The rwsem is anonymously owned, i.e. the owner(s) cannot be
+ *    readily determined. It can be reader owned or the owning writer
+ *    is indeterminate.
  *
  * When a writer acquires a rwsem, it puts its task_struct pointer
  * into the owner field. It is cleared after an unlock.
  *
  * When a reader acquires a rwsem, it will also puts its task_struct
  * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
  * largely be left untouched. So for a free or reader-owned rwsem,
  * the owner value may contain information about the last reader that
  * acquires the rwsem. The anonymous bit is set because that particular
@@ -55,7 +56,8 @@
  * a rwsem, but the overhead is simply too big.
  */
 #define RWSEM_READER_OWNED	(1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED	(1UL << 1)
+#define RWSEM_NONSPINNABLE	(1UL << 1)
+#define RWSEM_OWNER_FLAGS_MASK	(RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
 
 #ifdef CONFIG_DEBUG_RWSEMS
 # define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
@@ -132,7 +134,7 @@ static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
 	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-						 | RWSEM_ANONYMOUSLY_OWNED;
+						 | RWSEM_NONSPINNABLE;
 
 	WRITE_ONCE(sem->owner, (struct task_struct *)val);
 }
@@ -144,20 +146,12 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
 
 /*
  * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
+ * and steal the lock.
  * N.B. !owner is considered spinnable.
  */
 static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
 {
-	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
-	return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+	return !((unsigned long)owner & RWSEM_NONSPINNABLE);
 }
 
 #ifdef CONFIG_DEBUG_RWSEMS
@@ -170,10 +164,10 @@ static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
-						   | RWSEM_ANONYMOUSLY_OWNED;
+						   | RWSEM_NONSPINNABLE;
 	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
 		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
-				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+				RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
 }
 #else
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -495,7 +489,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	struct task_struct *owner;
 	bool ret = true;
 
-	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
+	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
 
 	if (need_resched())
 		return false;
@@ -534,7 +528,7 @@ static inline enum owner_state rwsem_owner_state(unsigned long owner)
 	if (!owner)
 		return OWNER_NULL;
 
-	if (owner & RWSEM_ANONYMOUSLY_OWNED)
+	if (owner & RWSEM_NONSPINNABLE)
 		return OWNER_NONSPINNABLE;
 
 	if (owner & RWSEM_READER_OWNED)
@@ -1043,7 +1037,12 @@ static inline void __up_write(struct rw_semaphore *sem)
 {
 	long tmp;
 
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	/*
+	 * sem->owner may differ from current if the ownership is transferred
+	 * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
+	 */
+	DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
+			    !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
 	rwsem_clear_owner(sem);
 	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 12/19] locking/rwsem: Enable readers spinning on writer
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (10 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 11/19] locking/rwsem: Clarify usage of owner's nonspinaable bit Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-17 14:31   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t Waiman Long
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly.  The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
rwsem_down_read_slowpath() and rwsem_down_write_slowpath().

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        4          1,674        1,684
        8          1,062        1,074
       16            924          900
       32            300          458
       64            195          208
      128            164          168
      240            149          143

The performance change wasn't significant in this case, but this change
is required by a follow-on patch.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |  1 +
 kernel/locking/rwsem.c            | 86 ++++++++++++++++++++++++++-----
 2 files changed, 75 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 634b47fd8b5e..ca954e4e00e4 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,6 +56,7 @@ LOCK_EVENT(rwsem_sleep_reader)	/* # of reader sleeps			*/
 LOCK_EVENT(rwsem_sleep_writer)	/* # of writer sleeps			*/
 LOCK_EVENT(rwsem_wake_reader)	/* # of reader wakeups			*/
 LOCK_EVENT(rwsem_wake_writer)	/* # of writer wakeups			*/
+LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index be939accd60c..9eb46ab9edaa 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -457,6 +457,30 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 }
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire read lock before the reader is put on wait queue.
+ * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
+ * is ongoing.
+ */
+static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+
+	if (count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))
+		return false;
+
+	count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+	if (!(count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+		rwsem_set_reader_owned(sem);
+		lockevent_inc(rwsem_opt_rlock);
+		return true;
+	}
+
+	/* Back out the change */
+	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+	return false;
+}
+
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
@@ -491,9 +515,12 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 
 	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
 
-	if (need_resched())
+	if (need_resched()) {
+		lockevent_inc(rwsem_opt_fail);
 		return false;
+	}
 
+	preempt_disable();
 	rcu_read_lock();
 	owner = READ_ONCE(sem->owner);
 	if (owner) {
@@ -501,6 +528,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 		      owner_on_cpu(owner);
 	}
 	rcu_read_unlock();
+	preempt_enable();
+
+	lockevent_cond_inc(rwsem_opt_fail, !ret);
 	return ret;
 }
 
@@ -578,7 +608,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 	return state;
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	bool taken = false;
 	int prev_owner_state = OWNER_NULL;
@@ -586,9 +616,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	preempt_disable();
 
 	/* sem->wait_lock should not be held when doing optimistic spinning */
-	if (!rwsem_can_spin_on_owner(sem))
-		goto done;
-
 	if (!osq_lock(&sem->osq))
 		goto done;
 
@@ -608,10 +635,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		/*
 		 * Try to acquire the lock
 		 */
-		if (rwsem_try_write_lock_unqueued(sem)) {
-			taken = true;
+		taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+			      : rwsem_try_read_lock_unqueued(sem);
+
+		if (taken)
 			break;
-		}
 
 		/*
 		 * An RT task cannot do optimistic spinning if it cannot
@@ -668,7 +696,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	return taken;
 }
 #else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	return false;
+}
+
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	return false;
 }
@@ -684,6 +717,31 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	if (!rwsem_can_spin_on_owner(sem))
+		goto queue;
+
+	/*
+	 * Undo read bias from down_read() and do optimistic spinning.
+	 */
+	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+	adjustment = 0;
+	if (rwsem_optimistic_spin(sem, false)) {
+		/*
+		 * Wake up other readers in the wait list if the front
+		 * waiter is a reader.
+		 */
+		if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS)) {
+			raw_spin_lock_irq(&sem->wait_lock);
+			if (!list_empty(&sem->wait_list))
+				rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
+						&wake_q);
+			raw_spin_unlock_irq(&sem->wait_lock);
+			wake_up_q(&wake_q);
+		}
+		return sem;
+	}
+
+queue:
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_READ;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
@@ -696,7 +754,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 		 * exit the slowpath and return immediately as its
 		 * RWSEM_READER_BIAS has already been set in the count.
 		 */
-		if (!(atomic_long_read(&sem->count) &
+		if (adjustment && !(atomic_long_read(&sem->count) &
 		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
@@ -708,7 +766,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
-	count = atomic_long_add_return(adjustment, &sem->count);
+	if (adjustment)
+		count = atomic_long_add_return(adjustment, &sem->count);
+	else
+		count = atomic_long_read(&sem->count);
 
 	/*
 	 * If there are no active locks, wake the front queued process(es).
@@ -767,7 +828,8 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	DEFINE_WAKE_Q(wake_q);
 
 	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_optimistic_spin(sem))
+	if (rwsem_can_spin_on_owner(sem) &&
+	    rwsem_optimistic_spin(sem, true))
 		return sem;
 
 	/*
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (11 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 12/19] locking/rwsem: Enable readers spinning on writer Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-04  8:52   ` Peter Zijlstra
                     ` (2 more replies)
  2019-05-20 20:59 ` [PATCH v8 14/19] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
                   ` (5 subsequent siblings)
  18 siblings, 3 replies; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

The rwsem->owner contains not just the task structure pointer, it also
holds some flags for storing the current state of the rwsem. Some of
the flags may have to be atomically updated. To reflect the new reality,
the owner is now changed to an atomic_long_t type.

New helper functions are added to properly separate out the task
structure pointer and the embedded flags.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/percpu-rwsem.h |   4 +-
 include/linux/rwsem.h        |  11 +--
 kernel/locking/rwsem.c       | 125 ++++++++++++++++++++++-------------
 3 files changed, 88 insertions(+), 52 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 03cb4b6f842e..0a43830f1932 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -117,7 +117,7 @@ static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
 	lock_release(&sem->rw_sem.dep_map, 1, ip);
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	if (!read)
-		sem->rw_sem.owner = RWSEM_OWNER_UNKNOWN;
+		atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN);
 #endif
 }
 
@@ -127,7 +127,7 @@ static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
 	lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	if (!read)
-		sem->rw_sem.owner = current;
+		atomic_long_set(&sem->rw_sem.owner, (long)current);
 #endif
 }
 
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index bb76e82398b2..e401358c4e7e 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -35,10 +35,11 @@
 struct rw_semaphore {
 	atomic_long_t count;
 	/*
-	 * Write owner or one of the read owners. Can be used as a
-	 * speculative check to see if the owner is running on the cpu.
+	 * Write owner or one of the read owners as well flags regarding
+	 * the current state of the rwsem. Can be used as a speculative
+	 * check to see if the write owner is running on the cpu.
 	 */
-	struct task_struct *owner;
+	atomic_long_t owner;
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	struct optimistic_spin_queue osq; /* spinner MCS lock */
 #endif
@@ -53,7 +54,7 @@ struct rw_semaphore {
  * Setting all bits of the owner field except bit 0 will indicate
  * that the rwsem is writer-owned with an unknown owner.
  */
-#define RWSEM_OWNER_UNKNOWN	((struct task_struct *)-2L)
+#define RWSEM_OWNER_UNKNOWN	(-2L)
 
 /* In all implementations count != 0 means locked */
 static inline int rwsem_is_locked(struct rw_semaphore *sem)
@@ -80,7 +81,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 
 #define __RWSEM_INITIALIZER(name)				\
 	{ __RWSEM_INIT_COUNT(name),				\
-	  .owner = NULL,					\
+	  .owner = ATOMIC_LONG_INIT(0),				\
 	  .wait_list = LIST_HEAD_INIT((name).wait_list),	\
 	  .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock)	\
 	  __RWSEM_OPT_INIT(name)				\
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 9eb46ab9edaa..555da4868e54 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -64,7 +64,7 @@
 	if (!debug_locks_silent &&				\
 	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
 		#c, atomic_long_read(&(sem)->count),		\
-		(long)((sem)->owner), (long)current,		\
+		atomic_long_read(&(sem)->owner), (long)current,	\
 		list_empty(&(sem)->wait_list) ? "" : "not "))	\
 			debug_locks_off();			\
 	} while (0)
@@ -114,12 +114,20 @@
  */
 static inline void rwsem_set_owner(struct rw_semaphore *sem)
 {
-	WRITE_ONCE(sem->owner, current);
+	atomic_long_set(&sem->owner, (long)current);
 }
 
 static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 {
-	WRITE_ONCE(sem->owner, NULL);
+	atomic_long_set(&sem->owner, 0);
+}
+
+/*
+ * Test the flags in the owner field.
+ */
+static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
+{
+	return atomic_long_read(&sem->owner) & flags;
 }
 
 /*
@@ -133,10 +141,9 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-						 | RWSEM_NONSPINNABLE;
+	long val = (long)owner | RWSEM_READER_OWNED | RWSEM_NONSPINNABLE;
 
-	WRITE_ONCE(sem->owner, (struct task_struct *)val);
+	atomic_long_set(&sem->owner, val);
 }
 
 static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
@@ -145,13 +152,20 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
 }
 
 /*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock.
- * N.B. !owner is considered spinnable.
+ * Return true if the rwsem is owned by a reader.
  */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
 {
-	return !((unsigned long)owner & RWSEM_NONSPINNABLE);
+#ifdef CONFIG_DEBUG_RWSEMS
+	/*
+	 * Check the count to see if it is write-locked.
+	 */
+	long count = atomic_long_read(&sem->count);
+
+	if (count & RWSEM_WRITER_MASK)
+		return false;
+#endif
+	return rwsem_test_oflags(sem, RWSEM_READER_OWNED);
 }
 
 #ifdef CONFIG_DEBUG_RWSEMS
@@ -163,11 +177,13 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
  */
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
-	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
-						   | RWSEM_NONSPINNABLE;
-	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
-		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
-				RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
+	long val = atomic_long_read(&sem->owner);
+
+	while ((val & ~RWSEM_OWNER_FLAGS_MASK) == (long)current) {
+		if (atomic_long_try_cmpxchg(&sem->owner, &val,
+					    val & RWSEM_OWNER_FLAGS_MASK))
+			return;
+	}
 }
 #else
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -175,6 +191,28 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 }
 #endif
 
+/*
+ * Return just the real task structure pointer of the owner
+ */
+static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
+{
+	return (struct task_struct *)(atomic_long_read(&sem->owner) &
+					~RWSEM_OWNER_FLAGS_MASK);
+}
+
+/*
+ * Return the real task structure pointer of the owner and the embedded
+ * flags in the owner. pflags must be non-NULL.
+ */
+static inline struct task_struct *
+rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
+{
+	long owner = atomic_long_read(&sem->owner);
+
+	*pflags = owner & RWSEM_OWNER_FLAGS_MASK;
+	return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
+}
+
 /*
  * Guide to the rw_semaphore's count field.
  *
@@ -208,7 +246,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
-	sem->owner = NULL;
+	atomic_long_set(&sem->owner, 0L);
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	osq_lock_init(&sem->osq);
 #endif
@@ -511,9 +549,10 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 {
 	struct task_struct *owner;
+	long flags;
 	bool ret = true;
 
-	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
+	BUILD_BUG_ON(!(RWSEM_OWNER_UNKNOWN & RWSEM_NONSPINNABLE));
 
 	if (need_resched()) {
 		lockevent_inc(rwsem_opt_fail);
@@ -522,11 +561,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 
 	preempt_disable();
 	rcu_read_lock();
-	owner = READ_ONCE(sem->owner);
-	if (owner) {
-		ret = is_rwsem_owner_spinnable(owner) &&
-		      owner_on_cpu(owner);
-	}
+	owner = rwsem_read_owner_flags(sem, &flags);
+	if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
+		ret = false;
 	rcu_read_unlock();
 	preempt_enable();
 
@@ -553,25 +590,26 @@ enum owner_state {
 };
 #define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
 
-static inline enum owner_state rwsem_owner_state(unsigned long owner)
+static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
+						 long flags)
 {
-	if (!owner)
-		return OWNER_NULL;
-
-	if (owner & RWSEM_NONSPINNABLE)
+	if (flags & RWSEM_NONSPINNABLE)
 		return OWNER_NONSPINNABLE;
 
-	if (owner & RWSEM_READER_OWNED)
+	if (flags & RWSEM_READER_OWNED)
 		return OWNER_READER;
 
-	return OWNER_WRITER;
+	return owner ? OWNER_WRITER : OWNER_NULL;
 }
 
 static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 {
-	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
-	enum owner_state state = rwsem_owner_state((unsigned long)owner);
+	struct task_struct *new, *owner;
+	long flags, new_flags;
+	enum owner_state state;
 
+	owner = rwsem_read_owner_flags(sem, &flags);
+	state = rwsem_owner_state(owner, flags);
 	if (state != OWNER_WRITER)
 		return state;
 
@@ -582,9 +620,9 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 			break;
 		}
 
-		tmp = READ_ONCE(sem->owner);
-		if (tmp != owner) {
-			state = rwsem_owner_state((unsigned long)tmp);
+		new = rwsem_read_owner_flags(sem, &new_flags);
+		if ((new != owner) || (new_flags != flags)) {
+			state = rwsem_owner_state(new, new_flags);
 			break;
 		}
 
@@ -1001,8 +1039,7 @@ inline void __down_read(struct rw_semaphore *sem)
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
+		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
@@ -1014,8 +1051,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
+		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
@@ -1084,7 +1120,7 @@ inline void __up_read(struct rw_semaphore *sem)
 {
 	long tmp;
 
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
+	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
@@ -1103,8 +1139,8 @@ static inline void __up_write(struct rw_semaphore *sem)
 	 * sem->owner may differ from current if the ownership is transferred
 	 * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
 	 */
-	DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
-			    !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
+	DEBUG_RWSEMS_WARN_ON((rwsem_read_owner(sem) != current) &&
+			    !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
 	rwsem_clear_owner(sem);
 	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
@@ -1125,7 +1161,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	 * read-locked region is ok to be re-ordered into the
 	 * write side. As such, rely on RELEASE semantics.
 	 */
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	DEBUG_RWSEMS_WARN_ON(rwsem_read_owner(sem) != current, sem);
 	tmp = atomic_long_fetch_add_release(
 		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
 	rwsem_set_reader_owned(sem);
@@ -1296,8 +1332,7 @@ EXPORT_SYMBOL(down_write_killable_nested);
 
 void up_read_non_owner(struct rw_semaphore *sem)
 {
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
+	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	__up_read(sem);
 }
 EXPORT_SYMBOL(up_read_non_owner);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 14/19] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (12 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-04  9:03   ` Peter Zijlstra
  2019-06-17 14:32   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-05-20 20:59 ` [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning Waiman Long
                   ` (4 subsequent siblings)
  18 siblings, 2 replies; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.

This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.

When the time threshold is exceeded, the nonspinnable bits will be set
in the owner field to indicate that no more optimistic spinning will
be allowed on this rwsem until it becomes writer owned again. Not even
readers is allowed to acquire the reader-locked rwsem by optimistic
spinning for fairness.

We also want a writer to acquire the lock after the readers hold the
lock for a relatively long time. In order to give preference to writers
under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
into two - one for reader and one for writer. When optimistic spinning
is disabled, both bits will be set. When the reader count drop down
to 0, the writer nonspinnable bit will be cleared to allow writers to
spin on the lock, but not the readers. When a writer acquires the lock,
it will write its own task structure pointer into sem->owner and clear
the reader nonspinnable bit in the process.

The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.

      System                 Time for 16 Iterations
      ------                 ----------------------
  1-socket Skylake                  ~800ns
  4-socket Broadwell                ~300ns
  2-socket ThunderX2 (arm64)        ~250ns

When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time.  So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        2          1,759        6,684
        4          1,684        6,738
        8          1,074        7,222
       16            900        7,163
       32            458        7,316
       64            208          520
      128            168          425
      240            143          474

This patch gives a big boost in performance for mixed reader/writer
workloads.

With 32 locking threads, the rwsem lock event data were:

rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663

With 64 locking threads, the data looked like:

rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281

So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |   1 +
 kernel/locking/rwsem.c            | 173 ++++++++++++++++++++++++------
 2 files changed, 144 insertions(+), 30 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index ca954e4e00e4..baa998401052 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -59,6 +59,7 @@ LOCK_EVENT(rwsem_wake_writer)	/* # of writer wakeups			*/
 LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
+LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled reader opt-spinnings	*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 555da4868e54..ec4c26b353c9 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -23,6 +23,7 @@
 #include <linux/sched/debug.h>
 #include <linux/sched/wake_q.h>
 #include <linux/sched/signal.h>
+#include <linux/sched/clock.h>
 #include <linux/export.h>
 #include <linux/rwsem.h>
 #include <linux/atomic.h>
@@ -31,24 +32,28 @@
 #include "lock_events.h"
 
 /*
- * The least significant 2 bits of the owner value has the following
+ * The least significant 3 bits of the owner value has the following
  * meanings when set.
  *  - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
- *  - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
- *    The rwsem is anonymously owned, i.e. the owner(s) cannot be
- *    readily determined. It can be reader owned or the owning writer
- *    is indeterminate.
+ *  - Bit 1: RWSEM_RD_NONSPINNABLE - Readers cannot spin on this lock.
+ *  - Bit 2: RWSEM_WR_NONSPINNABLE - Writers cannot spin on this lock.
  *
+ * When the rwsem is either owned by an anonymous writer, or it is
+ * reader-owned, but a spinning writer has timed out, both nonspinnable
+ * bits will be set to disable optimistic spinning by readers and writers.
+ * In the later case, the last unlocking reader should then check the
+ * writer nonspinnable bit and clear it only to give writers preference
+ * to acquire the lock via optimistic spinning, but not readers. Similar
+ * action is also done in the reader slowpath.
+
  * When a writer acquires a rwsem, it puts its task_struct pointer
  * into the owner field. It is cleared after an unlock.
  *
  * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
+ * pointer into the owner field with the RWSEM_READER_OWNED bit set.
+ * On unlock, the owner field will largely be left untouched. So
+ * for a free or reader-owned rwsem, the owner value may contain
+ * information about the last reader that acquires the rwsem.
  *
  * That information may be helpful in debugging cases where the system
  * seems to hang on a reader owned rwsem especially if only one reader
@@ -56,7 +61,9 @@
  * a rwsem, but the overhead is simply too big.
  */
 #define RWSEM_READER_OWNED	(1UL << 0)
-#define RWSEM_NONSPINNABLE	(1UL << 1)
+#define RWSEM_RD_NONSPINNABLE	(1UL << 1)
+#define RWSEM_WR_NONSPINNABLE	(1UL << 2)
+#define RWSEM_NONSPINNABLE	(RWSEM_RD_NONSPINNABLE | RWSEM_WR_NONSPINNABLE)
 #define RWSEM_OWNER_FLAGS_MASK	(RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
 
 #ifdef CONFIG_DEBUG_RWSEMS
@@ -141,7 +148,7 @@ static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
 static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
-	long val = (long)owner | RWSEM_READER_OWNED | RWSEM_NONSPINNABLE;
+	long val = (long)owner | RWSEM_READER_OWNED;
 
 	atomic_long_set(&sem->owner, val);
 }
@@ -191,6 +198,22 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 }
 #endif
 
+/*
+ * Set the RWSEM_NONSPINNABLE bits if the RWSEM_READER_OWNED flag
+ * remains set. Otherwise, the operation will be aborted.
+ */
+static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
+{
+	long owner = atomic_long_read(&sem->owner);
+
+	while (owner & RWSEM_READER_OWNED) {
+		if (owner & RWSEM_NONSPINNABLE)
+			break;
+		owner = atomic_long_cmpxchg(&sem->owner, owner,
+					    owner | RWSEM_NONSPINNABLE);
+	}
+}
+
 /*
  * Return just the real task structure pointer of the owner
  */
@@ -546,7 +569,8 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 }
 
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+					   long nonspinnable)
 {
 	struct task_struct *owner;
 	long flags;
@@ -562,7 +586,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	preempt_disable();
 	rcu_read_lock();
 	owner = rwsem_read_owner_flags(sem, &flags);
-	if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
+	if ((flags & nonspinnable) || (owner && !owner_on_cpu(owner)))
 		ret = false;
 	rcu_read_unlock();
 	preempt_enable();
@@ -588,12 +612,12 @@ enum owner_state {
 	OWNER_READER		= 1 << 2,
 	OWNER_NONSPINNABLE	= 1 << 3,
 };
-#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
+#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER | OWNER_READER)
 
 static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
-						 long flags)
+						 long flags, long nonspinnable)
 {
-	if (flags & RWSEM_NONSPINNABLE)
+	if (flags & nonspinnable)
 		return OWNER_NONSPINNABLE;
 
 	if (flags & RWSEM_READER_OWNED)
@@ -602,14 +626,15 @@ static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
 	return owner ? OWNER_WRITER : OWNER_NULL;
 }
 
-static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem,
+						     long nonspinnable)
 {
 	struct task_struct *new, *owner;
 	long flags, new_flags;
 	enum owner_state state;
 
 	owner = rwsem_read_owner_flags(sem, &flags);
-	state = rwsem_owner_state(owner, flags);
+	state = rwsem_owner_state(owner, flags, nonspinnable);
 	if (state != OWNER_WRITER)
 		return state;
 
@@ -622,7 +647,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 
 		new = rwsem_read_owner_flags(sem, &new_flags);
 		if ((new != owner) || (new_flags != flags)) {
-			state = rwsem_owner_state(new, new_flags);
+			state = rwsem_owner_state(new, new_flags, nonspinnable);
 			break;
 		}
 
@@ -646,10 +671,39 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 	return state;
 }
 
+/*
+ * Calculate reader-owned rwsem spinning threshold for writer
+ *
+ * The more readers own the rwsem, the longer it will take for them to
+ * wind down and free the rwsem. So the empirical formula used to
+ * determine the actual spinning time limit here is:
+ *
+ *   Spinning threshold = (10 + nr_readers/2)us
+ *
+ * The limit is capped to a maximum of 25us (30 readers). This is just
+ * a heuristic and is subjected to change in the future.
+ */
+static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+	int readers = count >> RWSEM_READER_SHIFT;
+	u64 delta;
+
+	if (readers > 30)
+		readers = 30;
+	delta = (20 + readers) * NSEC_PER_USEC / 2;
+
+	return sched_clock() + delta;
+}
+
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	bool taken = false;
 	int prev_owner_state = OWNER_NULL;
+	int loop = 0;
+	u64 rspin_threshold = 0;
+	long nonspinnable = wlock ? RWSEM_WR_NONSPINNABLE
+				  : RWSEM_RD_NONSPINNABLE;
 
 	preempt_disable();
 
@@ -661,12 +715,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 	 * Optimistically spin on the owner field and attempt to acquire the
 	 * lock whenever the owner changes. Spinning will be stopped when:
 	 *  1) the owning writer isn't running; or
-	 *  2) readers own the lock as we can't determine if they are
-	 *     actively running or not.
+	 *  2) readers own the lock and spinning time has exceeded limit.
 	 */
 	for (;;) {
-		enum owner_state owner_state = rwsem_spin_on_owner(sem);
+		enum owner_state owner_state;
 
+		owner_state = rwsem_spin_on_owner(sem, nonspinnable);
 		if (!(owner_state & OWNER_SPINNABLE))
 			break;
 
@@ -679,6 +733,39 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 		if (taken)
 			break;
 
+		/*
+		 * Time-based reader-owned rwsem optimistic spinning
+		 */
+		if (wlock && (owner_state == OWNER_READER)) {
+			/*
+			 * Re-initialize rspin_threshold every time when
+			 * the owner state changes from non-reader to reader.
+			 * This allows a writer to steal the lock in between
+			 * 2 reader phases and have the threshold reset at
+			 * the beginning of the 2nd reader phase.
+			 */
+			if (prev_owner_state != OWNER_READER) {
+				if (rwsem_test_oflags(sem, nonspinnable))
+					break;
+				rspin_threshold = rwsem_rspin_threshold(sem);
+				loop = 0;
+			}
+
+			/*
+			 * Check time threshold once every 16 iterations to
+			 * avoid calling sched_clock() too frequently so
+			 * as to reduce the average latency between the times
+			 * when the lock becomes free and when the spinner
+			 * is ready to do a trylock.
+			 */
+			else if (!(++loop & 0xf) &&
+				 (sched_clock() > rspin_threshold)) {
+				rwsem_set_nonspinnable(sem);
+				lockevent_inc(rwsem_opt_nospin);
+				break;
+			}
+		}
+
 		/*
 		 * An RT task cannot do optimistic spinning if it cannot
 		 * be sure the lock holder is running or live-lock may
@@ -733,8 +820,25 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 	lockevent_cond_inc(rwsem_opt_fail, !taken);
 	return taken;
 }
+
+/*
+ * Clear the owner's RWSEM_WR_NONSPINNABLE bit if it is set. This should
+ * only be called when the reader count reaches 0.
+ *
+ * This give writers better chance to acquire the rwsem first before
+ * readers when the rwsem was being held by readers for a relatively long
+ * period of time. Race can happen that an optimistic spinner may have
+ * just stolen the rwsem and set the owner, but just clearing the
+ * RWSEM_WR_NONSPINNABLE bit will do no harm anyway.
+ */
+static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
+{
+	if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE))
+		atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner);
+}
 #else
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+					   long nonspinnable)
 {
 	return false;
 }
@@ -743,6 +847,8 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	return false;
 }
+
+static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
 #endif
 
 /*
@@ -752,10 +858,11 @@ static struct rw_semaphore __sched *
 rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count, adjustment = -RWSEM_READER_BIAS;
+	bool wake = false;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
-	if (!rwsem_can_spin_on_owner(sem))
+	if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE))
 		goto queue;
 
 	/*
@@ -815,8 +922,12 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	 * If there are no writers and we are first in the queue,
 	 * wake our own waiter to join the existing active readers !
 	 */
-	if (!(count & RWSEM_LOCK_MASK) ||
-	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
+	if (!(count & RWSEM_LOCK_MASK)) {
+		clear_wr_nonspinnable(sem);
+		wake = true;
+	}
+	if (wake || (!(count & RWSEM_WRITER_MASK) &&
+		    (adjustment & RWSEM_FLAG_WAITERS)))
 		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
@@ -866,7 +977,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	DEFINE_WAKE_Q(wake_q);
 
 	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_can_spin_on_owner(sem) &&
+	if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) &&
 	    rwsem_optimistic_spin(sem, true))
 		return sem;
 
@@ -1124,8 +1235,10 @@ inline void __up_read(struct rw_semaphore *sem)
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
-		      RWSEM_FLAG_WAITERS))
+		      RWSEM_FLAG_WAITERS)) {
+		clear_wr_nonspinnable(sem);
 		rwsem_wake(sem, tmp);
+	}
 }
 
 /*
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (13 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 14/19] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-04  9:10   ` Peter Zijlstra
                     ` (4 more replies)
  2019-05-20 20:59 ` [PATCH v8 16/19] locking/rwsem: Guard against making count negative Waiman Long
                   ` (3 subsequent siblings)
  18 siblings, 5 replies; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It makes readers
relatively more preferred than writers. When a writer times out spinning
on a reader-owned lock and set the nospinnable bits, there are two main
reasons for that.

 1) The reader critical section is long, perhaps the task sleeps after
    acquiring the read lock.
 2) There are just too many readers contending the lock causing it to
    take a while to service all of them.

In the former case, long reader critical section will impede the progress
of writers which is usually more important for system performance.
In the later case, reader optimistic spinning tends to make the reader
groups that contain readers that acquire the lock together smaller
leading to more of them. That may hurt performance in some cases. In
other words, the setting of nonspinnable bits indicates that reader
optimistic spinning may not be helpful for those workloads that cause it.

Therefore, any writers that have observed the setting of the writer
nonspinnable bit for a given rwsem after they fail to acquire the lock
via optimistic spinning will set the reader nonspinnable bit once they
acquire the write lock. Similarly, readers that observe the setting
of reader nonspinnable bit at slowpath entry will also set the reader
nonspinnable bit when they acquire the read lock via the wakeup path.

Once the reader nonspinnable bit is on, it will only be reset when
a writer is able to acquire the rwsem in the fast path or somehow a
reader or writer in the slowpath doesn't observe the nonspinable bit.

This is to discourage reader optmistic spinning on that particular
rwsem and make writers more preferred. This adaptive disabling of reader
optimistic spinning will alleviate some of the negative side effect of
this feature.

In addition, this patch tries to make readers in the spinning queue
follow the phase-fair principle after quitting optimistic spinning
by checking if another reader has somehow acquired a read lock after
this reader enters the optimistic spinning queue. If so and the rwsem
is still reader-owned, this reader is in the right read-phase and can
attempt to acquire the lock.

On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
the will-it-scale benchmark was run with various number of threads. The
number of operations done before reader optimistic spinning patches,
this patch and after this patch were:

  Threads  Before rspin  Before patch  After patch    %change
  -------  ------------  ------------  -----------    -------
    20        5541068      5345484       5455667    -3.5%/ +2.1%
    40       10185150      7292313       9219276   -28.5%/+26.4%
    60        8196733      6460517       7181209   -21.2%/+11.2%
    80        9508864      6739559       8107025   -29.1%/+20.3%

This patch doesn't recover all the lost performance, but it is more
than half. Given the fact that reader optimistic spinning does benefit
some workloads, this is a good compromise.

Using the rwsem locking microbenchmark with very short critical section,
this patch doesn't have too much impact on locking performance as shown
by the locking rates (kops/s) below with equal numbers of readers and
writers before and after this patch:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        2          4,730        4,969
        4          4,814        4,786
        8          4,866        4,815
       16          4,715        4,511
       32          3,338        3,500
       64          3,212        3,389
       80          3,110        3,044

When running the locking microbenchmark with 40 dedicated reader and writer
threads, however, the reader performance is curtailed to favor the writer.

Before patch:

  40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
  40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644

After patch:

  40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
  40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |  10 ++-
 kernel/locking/rwsem.c            | 134 +++++++++++++++++++++++++++++-
 2 files changed, 136 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index baa998401052..239039d0ce21 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,10 +56,12 @@ LOCK_EVENT(rwsem_sleep_reader)	/* # of reader sleeps			*/
 LOCK_EVENT(rwsem_sleep_writer)	/* # of writer sleeps			*/
 LOCK_EVENT(rwsem_wake_reader)	/* # of reader wakeups			*/
 LOCK_EVENT(rwsem_wake_writer)	/* # of writer wakeups			*/
-LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
-LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
-LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
-LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled reader opt-spinnings	*/
+LOCK_EVENT(rwsem_opt_rlock)	/* # of opt-acquired read locks		*/
+LOCK_EVENT(rwsem_opt_wlock)	/* # of opt-acquired write locks	*/
+LOCK_EVENT(rwsem_opt_fail)	/* # of failed optspins			*/
+LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled optspins		*/
+LOCK_EVENT(rwsem_opt_norspin)	/* # of disabled reader-only optspins	*/
+LOCK_EVENT(rwsem_opt_rlock2)	/* # of opt-acquired 2ndary read locks	*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ec4c26b353c9..743476f386b2 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -59,6 +59,42 @@
  * seems to hang on a reader owned rwsem especially if only one reader
  * is involved. Ideally we would like to track all the readers that own
  * a rwsem, but the overhead is simply too big.
+ *
+ * Reader optimistic spinning is helpful when the reader critical section
+ * is short and there aren't that many readers around. It makes readers
+ * relatively more preferred than writers. When a writer times out spinning
+ * on a reader-owned lock and set the nospinnable bits, there are two main
+ * reasons for that.
+ *
+ *  1) The reader critical section is long, perhaps the task sleeps after
+ *     acquiring the read lock.
+ *  2) There are just too many readers contending the lock causing it to
+ *     take a while to service all of them.
+ *
+ * In the former case, long reader critical section will impede the progress
+ * of writers which is usually more important for system performance. In
+ * the later case, reader optimistic spinning tends to make the reader
+ * groups that contain readers that acquire the lock together smaller
+ * leading to more of them. That may hurt performance in some cases. In
+ * other words, the setting of nonspinnable bits indicates that reader
+ * optimistic spinning may not be helpful for those workloads that cause
+ * it.
+ *
+ * Therefore, any writers that had observed the setting of the writer
+ * nonspinnable bit for a given rwsem after they fail to acquire the lock
+ * via optimistic spinning will set the reader nonspinnable bit once they
+ * acquire the write lock. Similarly, readers that observe the setting
+ * of reader nonspinnable bit at slowpath entry will set the reader
+ * nonspinnable bits when they acquire the read lock via the wakeup path.
+ *
+ * Once the reader nonspinnable bit is on, it will only be reset when
+ * a writer is able to acquire the rwsem in the fast path or somehow a
+ * reader or writer in the slowpath doesn't observe the nonspinable bit.
+ *
+ * This is to discourage reader optmistic spinning on that particular
+ * rwsem and make writers more preferred. This adaptive disabling of reader
+ * optimistic spinning will alleviate the negative side effect of this
+ * feature.
  */
 #define RWSEM_READER_OWNED	(1UL << 0)
 #define RWSEM_RD_NONSPINNABLE	(1UL << 1)
@@ -144,11 +180,14 @@ static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
  * Note that the owner value just indicates the task has owned the rwsem
  * previously, it may not be the real owner or one of the real owners
  * anymore when that field is examined, so take it with a grain of salt.
+ *
+ * The reader non-spinnable bit is preserved.
  */
 static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
-	long val = (long)owner | RWSEM_READER_OWNED;
+	long val = (long)owner | RWSEM_READER_OWNED |
+		   (atomic_long_read(&sem->owner) & RWSEM_RD_NONSPINNABLE);
 
 	atomic_long_set(&sem->owner, val);
 }
@@ -286,6 +325,7 @@ struct rwsem_waiter {
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
 	unsigned long timeout;
+	long last_rowner;
 };
 #define rwsem_first_waiter(sem) \
 	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -367,6 +407,8 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 	 * so we can bail out early if a writer stole the lock.
 	 */
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
+		struct task_struct *owner;
+
 		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
 		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
@@ -387,8 +429,15 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		/*
 		 * Set it to reader-owned to give spinners an early
 		 * indication that readers now have the lock.
+		 * The reader nonspinnable bit seen at slowpath entry of
+		 * the reader is copied over.
 		 */
-		__rwsem_set_reader_owned(sem, waiter->task);
+		owner = waiter->task;
+		if (waiter->last_rowner & RWSEM_RD_NONSPINNABLE) {
+			owner = (void *)((long)owner | RWSEM_RD_NONSPINNABLE);
+			lockevent_inc(rwsem_opt_norspin);
+		}
+		__rwsem_set_reader_owned(sem, owner);
 	}
 
 	/*
@@ -836,6 +885,43 @@ static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
 	if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE))
 		atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner);
 }
+
+/*
+ * This function is called when the reader fails to acquire the lock via
+ * optimistic spinning. In this case we will still attempt to do a trylock
+ * when comparing the rwsem state right now with the state when entering
+ * the slowpath indicates that the reader is still in a valid reader phase.
+ * This happens when the following conditions are true:
+ *
+ * 1) The lock is currently reader owned, and
+ * 2) The lock is previously not reader-owned or the last read owner changes.
+ *
+ * In the former case, we have transitioned from a writer phase to a
+ * reader-phase while spinning. In the latter case, it means the reader
+ * phase hasn't ended when we entered the optimistic spinning loop. In
+ * both cases, the reader is eligible to acquire the lock. This is the
+ * secondary path where a read lock is acquired optimistically.
+ *
+ * The reader non-spinnable bit wasn't set at time of entry or it will
+ * not be here at all.
+ */
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+					      long last_rowner)
+{
+	long owner = atomic_long_read(&sem->owner);
+
+	if (!(owner & RWSEM_READER_OWNED))
+		return false;
+
+	owner	    &= ~RWSEM_OWNER_FLAGS_MASK;
+	last_rowner &= ~RWSEM_OWNER_FLAGS_MASK;
+	if ((owner != last_rowner) && rwsem_try_read_lock_unqueued(sem)) {
+		lockevent_inc(rwsem_opt_rlock2);
+		lockevent_add(rwsem_opt_fail, -1);
+		return true;
+	}
+	return false;
+}
 #else
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
 					   long nonspinnable)
@@ -849,6 +935,12 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 }
 
 static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
+
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+					      unsigned long last_rowner)
+{
+	return false;
+}
 #endif
 
 /*
@@ -862,6 +954,14 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	/*
+	 * Save the current read-owner of rwsem, if available, and the
+	 * reader nonspinnable bit.
+	 */
+	waiter.last_rowner = atomic_long_read(&sem->owner);
+	if (!(waiter.last_rowner & RWSEM_READER_OWNED))
+		waiter.last_rowner &= RWSEM_RD_NONSPINNABLE;
+
 	if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE))
 		goto queue;
 
@@ -884,6 +984,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 			wake_up_q(&wake_q);
 		}
 		return sem;
+	} else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
+		return sem;
 	}
 
 queue:
@@ -964,6 +1066,19 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	return ERR_PTR(-EINTR);
 }
 
+/*
+ * This function is called by the a write lock owner. So the owner value
+ * won't get changed by others.
+ */
+static inline void rwsem_disable_reader_optspin(struct rw_semaphore *sem,
+						bool disable)
+{
+	if (unlikely(disable)) {
+		atomic_long_or(RWSEM_RD_NONSPINNABLE, &sem->owner);
+		lockevent_inc(rwsem_opt_norspin);
+	}
+}
+
 /*
  * Wait until we successfully acquire the write lock
  */
@@ -971,6 +1086,7 @@ static struct rw_semaphore *
 rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count;
+	bool disable_rspin;
 	enum writer_wait_state wstate;
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
@@ -981,6 +1097,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	    rwsem_optimistic_spin(sem, true))
 		return sem;
 
+	/*
+	 * Disable reader optimistic spinning for this rwsem after
+	 * acquiring the write lock when the setting of the nonspinnable
+	 * bits are observed.
+	 */
+	disable_rspin = atomic_long_read(&sem->owner) & RWSEM_NONSPINNABLE;
+
 	/*
 	 * Optimistic spinning failed, proceed to the slowpath
 	 * and block until we can acquire the sem.
@@ -1077,6 +1200,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	}
 	__set_current_state(TASK_RUNNING);
 	list_del(&waiter.list);
+	rwsem_disable_reader_optspin(sem, disable_rspin);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	lockevent_inc(rwsem_wlock);
 
@@ -1196,7 +1320,8 @@ static inline void __down_write(struct rw_semaphore *sem)
 	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
 						      RWSEM_WRITER_LOCKED)))
 		rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
-	rwsem_set_owner(sem);
+	else
+		rwsem_set_owner(sem);
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
@@ -1207,8 +1332,9 @@ static inline int __down_write_killable(struct rw_semaphore *sem)
 						      RWSEM_WRITER_LOCKED))) {
 		if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
+	} else {
+		rwsem_set_owner(sem);
 	}
-	rwsem_set_owner(sem);
 	return 0;
 }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 16/19] locking/rwsem: Guard against making count negative
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (14 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-11 13:11   ` Peter Zijlstra
                     ` (2 more replies)
  2019-05-20 20:59 ` [PATCH v8 17/19] locking/rwsem: Merge owner into count on x86-64 Waiman Long
                   ` (2 subsequent siblings)
  18 siblings, 3 replies; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.

To prevent this count overflow from happening, the most significant
bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock
attempts will now fail for both the fast and slow paths whenever this
bit is set. So all those extra readers will be put to sleep in the wait
list. Wakeup will not happen until the reader count reaches 0.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 95 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 80 insertions(+), 15 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 743476f386b2..028f29b39045 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -116,13 +116,28 @@
 #endif
 
 /*
- * The definition of the atomic counter in the semaphore:
+ * On 64-bit architectures, the bit definitions of the count are:
  *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bit  2   - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * Bit  0    - writer locked bit
+ * Bit  1    - waiters present bit
+ * Bit  2    - lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-62 - 55-bit reader count
+ * Bit  63   - read fail bit
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit  0    - writer locked bit
+ * Bit  1    - waiters present bit
+ * Bit  2    - lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-30 - 23-bit reader count
+ * Bit  31   - read fail bit
+ *
+ * It is not likely that the most significant bit (read fail bit) will ever
+ * be set. This guard bit is still checked anyway in the down_read() fastpath
+ * just in case we need to use up more of the reader bits for other purpose
+ * in the future.
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
@@ -139,6 +154,7 @@
 #define RWSEM_WRITER_LOCKED	(1UL << 0)
 #define RWSEM_FLAG_WAITERS	(1UL << 1)
 #define RWSEM_FLAG_HANDOFF	(1UL << 2)
+#define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
 
 #define RWSEM_READER_SHIFT	8
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
@@ -146,7 +162,7 @@
 #define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
-				 RWSEM_FLAG_HANDOFF)
+				 RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -253,6 +269,28 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 	}
 }
 
+/*
+ * This function does a read trylock by incrementing the reader count
+ * and then decrementing it immediately if too many readers are present
+ * (count becomes negative) in order to prevent the remote possibility
+ * of overflowing the count with minimal delay between the increment
+ * and decrement.
+ *
+ * It returns the adjustment that should be added back to the count
+ * in the slowpath.
+ */
+static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
+{
+	long adjustment = -RWSEM_READER_BIAS;
+
+	*cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+	if (unlikely(*cnt < 0)) {
+		atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+		adjustment = 0;
+	}
+	return adjustment;
+}
+
 /*
  * Return just the real task structure pointer of the owner
  */
@@ -401,6 +439,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		return;
 	}
 
+	/*
+	 * No reader wakeup if there are too many of them already.
+	 */
+	if (unlikely(atomic_long_read(&sem->count) < 0))
+		return;
+
 	/*
 	 * Writers might steal the lock before we grant it to the next reader.
 	 * We prefer to do the first reader grant before counting readers
@@ -947,13 +991,30 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
  * Wait for the read lock to be granted
  */
 static struct rw_semaphore __sched *
-rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
 {
-	long count, adjustment = -RWSEM_READER_BIAS;
+	long count;
 	bool wake = false;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	if (unlikely(!adjustment)) {
+		/*
+		 * This shouldn't happen. If it does, there is probably
+		 * something wrong in the system.
+		 */
+		WARN_ON_ONCE(1);
+
+		/*
+		 * An adjustment of 0 means that there are too many readers
+		 * holding or trying to acquire the lock. So disable
+		 * optimistic spinning and go directly into the wait list.
+		 */
+		if (rwsem_test_oflags(sem, RWSEM_RD_NONSPINNABLE))
+			rwsem_set_nonspinnable(sem);
+		goto queue;
+	}
+
 	/*
 	 * Save the current read-owner of rwsem, if available, and the
 	 * reader nonspinnable bit.
@@ -1271,9 +1332,10 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
  */
 inline void __down_read(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
+	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+
+	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
+		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, adjustment);
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
@@ -1282,9 +1344,11 @@ inline void __down_read(struct rw_semaphore *sem)
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
+	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+
+	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
+		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
+						    adjustment)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
@@ -1360,6 +1424,7 @@ inline void __up_read(struct rw_semaphore *sem)
 	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
 		      RWSEM_FLAG_WAITERS)) {
 		clear_wr_nonspinnable(sem);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 17/19] locking/rwsem: Merge owner into count on x86-64
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (15 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 16/19] locking/rwsem: Guard against making count negative Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-06-04  9:45   ` Peter Zijlstra
  2019-05-20 20:59 ` [PATCH v8 18/19] locking/rwsem: Remove redundant computation of writer lock word Waiman Long
  2019-05-20 20:59 ` [PATCH v8 19/19] locking/rwsem: Disable preemption in down_read*() if owner in count Waiman Long
  18 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

With separate count and owner, there are timing windows where the two
values are inconsistent. That can cause problem when trying to figure
out the exact state of the rwsem. For instance, a RT task will stop
optimistic spinning if the lock is acquired by a writer but the owner
field isn't set yet. That can be solved by combining the count and
owner together in a single atomic value.

On 32-bit architectures, there aren't enough bits to hold both.
64-bit architectures, however, can have enough bits to do that. For
x86-64, the physical address can use up to 52 bits. That is 4PB of
memory. That leaves 12 bits available for other use. The task structure
pointer is aligned to the L1 cache size. That means another 6 bits
(64 bytes cacheline) will be available. Reserving 2 bits for status
flags, we will have 16 bits for the reader count and the read fail bit.
That can supports up to (32k-1) readers. Without 5-level page table,
we can supports up to (2M-1) readers.

The owner value will still be duplicated in the owner field as that will
ease debugging when looking at core dump. There may be a slight overhead
in transforming the task pointer to fit into a smaller number of bits,
but that shouldn't be noticeable in real workloads.

This change is currently enabled for x86-64 only. Other 64-bit
architectures may be enabled in the future if the need arises.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
writer-only locking threads and then equal numbers of readers and writers
(mixed) before patch and after this and subsequent related patches were
as follows:

                  Before Patch      After Patch
   # of Threads  wlock    mixed    wlock    mixed
   ------------  -----    -----    -----    -----
        1        30,422   31,034   30,323   30,379
        2         6,427    6,684    7,804    9,436
        4         6,742    6,738    7,568    8,268
        8         7,092    7,222    5,679    7,041
       16         6,882    7,163    6,848    7,652
       32         7,458    7,316    7,975    2,189
       64         7,906      520    8,269      534
      128         1,680      425    8,047      448

In the single thread case, the complex write-locking operation does
introduce a little bit of overhead (about 0.3%). For the contended cases,
except for some anomalies in the data, there is no evidence that this
change will adversely impact performance.

When running the same microbenchmark with RT locking threads instead,
we got the following results:

                  Before Patch      After Patch
   # of Threads  wlock    mixed    wlock    mixed
   ------------  -----    -----    -----    -----
        2         4,065    3,642    4,756    5,062
        4         2,254    1,907    3,460    2,496
        8         2,386      964    3,012    1,964
       16         2,095    1,596    3,083    1,862
       32         2,388      530    3,717      359
       64         1,424      322    4,060      401
      128         1,642      510    4,488      628

It is obvious that RT tasks can benefit pretty significantly with this set
of patches.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/x86/Kconfig       |   6 ++
 kernel/Kconfig.locks   |  12 +++
 kernel/locking/rwsem.c | 162 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 171 insertions(+), 9 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bbbd4d1ba31..141be11a3a6a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -91,6 +91,7 @@ config X86
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
+	select ARCH_USE_RWSEM_OWNER_COUNT	if X86_64
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_WANTS_THP_SWAP		if X86_64
@@ -353,6 +354,11 @@ config PGTABLE_LEVELS
 	default 3 if X86_PAE
 	default 2
 
+config RWSEM_OWNER_COUNT_PA_BITS
+	int
+	default 52 if X86_5LEVEL
+	default 46 if X86_64
+
 config CC_HAS_SANE_STACKPROTECTOR
 	bool
 	default $(success,$(srctree)/scripts/gcc-x86_64-has-stack-protector.sh $(CC)) if 64BIT
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index bf770d7556f7..9cd5f8547674 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -258,3 +258,15 @@ config ARCH_HAS_MMIOWB
 config MMIOWB
 	def_bool y if ARCH_HAS_MMIOWB
 	depends on SMP
+
+#
+# An 64-bit architecture that wants to merge rwsem write-owner into
+# count should select ARCH_USE_RWSEM_OWNER_COUNT and define
+# RWSEM_OWNER_COUNT_PA_BITS as the correct number of physical address
+# bits. In addition, the number of bits available for reader count
+# should allow all the CPUs as defined in NR_CPUS to acquire the same
+# read lock without overflowing it.
+#
+config ARCH_USE_RWSEM_OWNER_COUNT
+	bool
+	depends on SMP && 64BIT
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 028f29b39045..8196ace2d4a2 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -116,7 +116,38 @@
 #endif
 
 /*
- * On 64-bit architectures, the bit definitions of the count are:
+ * With separate count and owner, there are timing windows where the two
+ * values are inconsistent. That can cause problem when trying to figure
+ * out the exact state of the rwsem. That can be solved by combining
+ * the count and owner together in a single atomic value.
+ *
+ * On 64-bit architectures, the owner task structure pointer can be
+ * compressed and combined with reader count and other status flags.
+ * A simple compression method is to map the virtual address back to
+ * the physical address by subtracting PAGE_OFFSET. On 32-bit
+ * architectures, the long integer value just isn't big enough for
+ * combining owner and count. So they remain separate.
+ *
+ * For x86-64, the physical address can use up to 52 bits if
+ * CONFIG_X86_5LEVEL. That is 4PB of memory. That leaves 12 bits
+ * available for other use. The task structure pointer is also aligned
+ * to the L1 cache size. That means another 6 bits (64 bytes cacheline)
+ * will be available. Reserving 2 bits for status flags, we will have
+ * 16 bits for the reader count and read fail bit. That can supports up
+ * to (32k-1) active readers. If 5-level page table support isn't
+ * configured, we can supports up to (2M-1) active readers.
+ *
+ * On x86-64 with CONFIG_X86_5LEVEL and CONFIG_ARCH_USE_RWSEM_OWNER_COUNT,
+ * the bit definitions of the count are:
+ *
+ * Bit   0    - waiters present bit
+ * Bit   1    - lock handoff bit
+ * Bits  2-47 - compressed task structure pointer
+ * Bits 48-62 - 15-bit reader counts
+ * Bit  63    - read fail bit
+ *
+ * On other 64-bit architectures without MERGE_OWNER_INTO_COUNT, the bit
+ * definitions are:
  *
  * Bit  0    - writer locked bit
  * Bit  1    - waiters present bit
@@ -151,26 +182,81 @@
  * be the first one in the wait_list to be eligible for setting the handoff
  * bit. So concurrent setting/clearing of handoff bit is not possible.
  */
-#define RWSEM_WRITER_LOCKED	(1UL << 0)
-#define RWSEM_FLAG_WAITERS	(1UL << 1)
-#define RWSEM_FLAG_HANDOFF	(1UL << 2)
+#define RWSEM_FLAG_WAITERS	(1UL << 0)
+#define RWSEM_FLAG_HANDOFF	(1UL << 1)
 #define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
 
+/*
+ * The MERGE_OWNER_INTO_COUNT macro will only be defined if the following
+ * conditions are true:
+ *  1) Both CONFIG_ARCH_USE_RWSEM_OWNER_COUNT and
+ *     CONFIG_RWSEM_OWNER_COUNT_PA_BITS are defined.
+ *  2) The number of reader count bits available is able to hold the
+ *     maximum number of CPUs as defined in NR_CPUS.
+ */
+#if	defined(CONFIG_ARCH_USE_RWSEM_OWNER_COUNT) && \
+	defined(CONFIG_RWSEM_OWNER_COUNT_PA_BITS)
+# define __READER_SHIFT		(CONFIG_RWSEM_OWNER_COUNT_PA_BITS -\
+				 L1_CACHE_SHIFT + 2)
+# define __READER_COUNT_BITS	(BITS_PER_LONG - __READER_SHIFT - 1)
+# define __READER_COUNT_MAX	((1UL << __READER_COUNT_BITS) - 1)
+# if (NR_CPUS <= __READER_COUNT_MAX)
+#  define MERGE_OWNER_INTO_COUNT
+# endif
+#endif
+
+#ifdef MERGE_OWNER_INTO_COUNT
+#define RWSEM_READER_SHIFT	__READER_SHIFT
+#define RWSEM_WRITER_MASK	((1UL << RWSEM_READER_SHIFT) - 4)
+#define RWSEM_WRITER_LOCKED	rwsem_owner_count(current)
+#else /* !MERGE_OWNER_INTO_COUNT */
 #define RWSEM_READER_SHIFT	8
+#define RWSEM_WRITER_MASK	(1UL << 7)
+#define RWSEM_WRITER_LOCKED	RWSEM_WRITER_MASK
+#endif /* MERGE_OWNER_INTO_COUNT */
+
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
 #define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
 				 RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
 
+/*
+ * Task structure pointer compression (64-bit only):
+ * (owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2)
+ *
+ * However, init_task may lie outside of the linearly mapped physical
+ * to virtual memory range and so has to be handled separately.
+ */
+static inline unsigned long rwsem_owner_count(struct task_struct *owner)
+{
+	if (unlikely(owner == &init_task))
+		return RWSEM_WRITER_MASK;
+
+	return ((unsigned long)owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2);
+}
+
+static inline unsigned long rwsem_count_owner(long count)
+{
+	unsigned long writer = (unsigned long)count & RWSEM_WRITER_MASK;
+
+	if (unlikely(writer == RWSEM_WRITER_MASK))
+		return (unsigned long)&init_task;
+
+	return writer ? (writer << (L1_CACHE_SHIFT - 2)) + PAGE_OFFSET : 0;
+}
+
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
  * store tearing can't happen as optimistic spinners may read and use
  * the owner value concurrently without lock. Read from owner, however,
  * may not need READ_ONCE() as long as the pointer value is only used
  * for comparison and isn't being dereferenced.
+ *
+ * With MERGE_OWNER_INTO_COUNT defined, the writer task structure pointer
+ * is written to the count as well in addition to the owner field.
  */
+
 static inline void rwsem_set_owner(struct rw_semaphore *sem)
 {
 	atomic_long_set(&sem->owner, (long)current);
@@ -269,6 +355,27 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 	}
 }
 
+#ifdef MERGE_OWNER_INTO_COUNT
+/*
+ * Get the owner value from count to have early access to the task structure.
+ */
+static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
+{
+	return (struct task_struct *)
+		rwsem_count_owner(atomic_long_read(&sem->count));
+}
+
+/*
+ * Return the real task structure pointer of the owner and the embedded
+ * flags in the owner.
+ */
+static inline struct task_struct *
+rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
+{
+	*pflags = atomic_long_read(&sem->owner) & RWSEM_OWNER_FLAGS_MASK;
+	return rwsem_read_owner(sem);
+}
+
 /*
  * This function does a read trylock by incrementing the reader count
  * and then decrementing it immediately if too many readers are present
@@ -276,6 +383,18 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
  * of overflowing the count with minimal delay between the increment
  * and decrement.
  *
+ * When the owner task structure pointer is merged into couunt, less bits
+ * will be available for readers (down to 15 bits for x86-64). There is a
+ * very slight chance that preemption may happen in the middle of the
+ * inc-check-dec sequence leaving the reader count incremented for a
+ * certain period of time until the reader wakes up and move on. Still
+ * the chance of having enough of these unfortunate sequence of events to
+ * overflow the reader count is infinitesimally small.
+ *
+ * If MERGE_OWNER_INTO_COUNT isn't defined, we don't really need to
+ * worry about the possibility of overflowing the reader counts even
+ * for 32-bit architectures which can support up to 8M readers.
+ *
  * It returns the adjustment that should be added back to the count
  * in the slowpath.
  */
@@ -291,6 +410,16 @@ static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
 	return adjustment;
 }
 
+static int __init rwsem_show_count_status(void)
+{
+	pr_info("RW Semaphores: Write-owner in count & %d bits for readers.\n",
+		__READER_COUNT_BITS);
+	return 0;
+}
+late_initcall(rwsem_show_count_status);
+
+#else /* !MERGE_OWNER_INTO_COUNT */
+
 /*
  * Return just the real task structure pointer of the owner
  */
@@ -313,14 +442,21 @@ rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
 	return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
 }
 
+static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
+{
+	*cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+	return -RWSEM_READER_BIAS;
+}
+#endif /* MERGE_OWNER_INTO_COUNT */
+
 /*
  * Guide to the rw_semaphore's count field.
  *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
+ * When any of the RWSEM_WRITER_MASK bits in count is set, the lock is
+ * owned by a writer.
  *
  * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (1) none of the RWSEM_WRITER_MASK bits is set in count,
  * (2) some of the reader bits are set in count, and
  * (3) the owner field has RWSEM_READ_OWNED bit set.
  *
@@ -1386,6 +1522,14 @@ static inline void __down_write(struct rw_semaphore *sem)
 		rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
 	else
 		rwsem_set_owner(sem);
+#ifdef MERGE_OWNER_INTO_COUNT
+	/*
+	 * Make sure that count<=>owner translation is correct.
+	 */
+	DEBUG_RWSEMS_WARN_ON(
+		(atomic_long_read(&sem->owner) & ~RWSEM_OWNER_FLAGS_MASK) !=
+		(long)rwsem_read_owner(sem), sem);
+#endif
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
@@ -1446,7 +1590,7 @@ static inline void __up_write(struct rw_semaphore *sem)
 	DEBUG_RWSEMS_WARN_ON((rwsem_read_owner(sem) != current) &&
 			    !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
 	rwsem_clear_owner(sem);
-	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+	tmp = atomic_long_fetch_and_release(~RWSEM_WRITER_MASK, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem, tmp);
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 18/19] locking/rwsem: Remove redundant computation of writer lock word
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (16 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 17/19] locking/rwsem: Merge owner into count on x86-64 Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  2019-05-20 20:59 ` [PATCH v8 19/19] locking/rwsem: Disable preemption in down_read*() if owner in count Waiman Long
  18 siblings, 0 replies; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

On 64-bit architectures, each rwsem writer will have its unique lock
word for acquiring the lock. Right now, the writer code recomputes the
lock word every time it tries to acquire the lock. This is a waste of
time. The lock word is now cached and reused when it is needed.

When CONFIG_RWSEM_OWNER_COUNT isn't defined, the extra constant argument
to rwsem_try_write_lock() and rwsem_try_write_lock_unqueued() should
be optimized out by the compiler.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8196ace2d4a2..29f0e0e5b62e 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -706,6 +706,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
  * bit is set or the lock is acquired with handoff bit cleared.
  */
 static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
+					const long wlock,
 					enum writer_wait_state wstate)
 {
 	long count, new;
@@ -727,7 +728,7 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 
 			new |= RWSEM_FLAG_HANDOFF;
 		} else {
-			new |= RWSEM_WRITER_LOCKED;
+			new |= wlock;
 			new &= ~RWSEM_FLAG_HANDOFF;
 
 			if (list_is_singular(&sem->wait_list))
@@ -774,13 +775,14 @@ static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem,
+						 const long wlock)
 {
 	long count = atomic_long_read(&sem->count);
 
 	while (!(count & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count | RWSEM_WRITER_LOCKED)) {
+					count | wlock)) {
 			rwsem_set_owner(sem);
 			lockevent_inc(rwsem_opt_wlock);
 			return true;
@@ -925,7 +927,7 @@ static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
 	return sched_clock() + delta;
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
 {
 	bool taken = false;
 	int prev_owner_state = OWNER_NULL;
@@ -956,7 +958,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 		/*
 		 * Try to acquire the lock
 		 */
-		taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+		taken = wlock ? rwsem_try_write_lock_unqueued(sem, wlock)
 			      : rwsem_try_read_lock_unqueued(sem);
 
 		if (taken)
@@ -1109,7 +1111,8 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
 	return false;
 }
 
-static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem,
+					 const long wlock)
 {
 	return false;
 }
@@ -1288,10 +1291,11 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
+	const long wlock = RWSEM_WRITER_LOCKED;
 
 	/* do optimistic spinning and steal lock if possible */
 	if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) &&
-	    rwsem_optimistic_spin(sem, true))
+	    rwsem_optimistic_spin(sem, wlock))
 		return sem;
 
 	/*
@@ -1353,7 +1357,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	while (true) {
-		if (rwsem_try_write_lock(sem, wstate))
+		if (rwsem_try_write_lock(sem, wlock, wstate))
 			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v8 19/19] locking/rwsem: Disable preemption in down_read*() if owner in count
  2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (17 preceding siblings ...)
  2019-05-20 20:59 ` [PATCH v8 18/19] locking/rwsem: Remove redundant computation of writer lock word Waiman Long
@ 2019-05-20 20:59 ` Waiman Long
  18 siblings, 0 replies; 83+ messages in thread
From: Waiman Long @ 2019-05-20 20:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

It is very unlikely that successive preemption at the middle of
down_read's inc-check-dec sequence will cause the reader count to
overflow, For absolute correctness, however, we still need to prevent
that possibility from happening. So preemption will be disabled during
the down_read*() call.

For PREEMPT=n kernels, there isn't much overhead in doing that.
For PREEMPT=y kernels, there will be some additional cost. RT kernels
have their own rwsem code, so it will not be a problem for them.

If MERGE_OWNER_INTO_COUNT isn't defined, we don't need to worry about
reader count overflow and so we don't need to disable preemption.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 38 ++++++++++++++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 29f0e0e5b62e..cede2f99220b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -356,6 +356,24 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 }
 
 #ifdef MERGE_OWNER_INTO_COUNT
+/*
+ * It is very unlikely that successive preemption at the middle of
+ * down_read's inc-check-dec sequence will cause the reader count to
+ * overflow, For absolute correctness, we still need to prevent
+ * that possibility from happening. So preemption will be disabled
+ * during the down_read*() call.
+ *
+ * For PREEMPT=n kernels, there isn't much overhead in doing that.
+ * For PREEMPT=y kernels, there will be some additional cost.
+ *
+ * If MERGE_OWNER_INTO_COUNT isn't defined, we don't need to worry
+ * about reader count overflow and so we don't need to disable
+ * preemption.
+ */
+#define rwsem_preempt_disable()			preempt_disable()
+#define rwsem_preempt_enable()			preempt_enable()
+#define rwsem_schedule_preempt_disabled()	schedule_preempt_disabled()
+
 /*
  * Get the owner value from count to have early access to the task structure.
  */
@@ -420,6 +438,10 @@ late_initcall(rwsem_show_count_status);
 
 #else /* !MERGE_OWNER_INTO_COUNT */
 
+#define rwsem_preempt_disable()
+#define rwsem_preempt_enable()
+#define rwsem_schedule_preempt_disabled()	schedule()
+
 /*
  * Return just the real task structure pointer of the owner
  */
@@ -1247,7 +1269,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
 			raw_spin_unlock_irq(&sem->wait_lock);
 			break;
 		}
-		schedule();
+		rwsem_schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_reader);
 	}
 
@@ -1472,28 +1494,36 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
  */
 inline void __down_read(struct rw_semaphore *sem)
 {
-	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+	long tmp, adjustment;
 
+	rwsem_preempt_disable();
+	adjustment = rwsem_read_trylock(sem, &tmp);
 	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
 		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, adjustment);
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
+	rwsem_preempt_enable();
 }
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+	long tmp, adjustment;
 
+	rwsem_preempt_disable();
+	adjustment = rwsem_read_trylock(sem, &tmp);
 	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
 		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
-						    adjustment)))
+						    adjustment))) {
+			rwsem_preempt_enable();
 			return -EINTR;
+		}
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
+	rwsem_preempt_enable();
 	return 0;
 }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-05-20 20:59 ` [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
@ 2019-06-04  3:03   ` Yuyang Du
  2019-06-04  3:26     ` Yuyang Du
  2019-06-17 14:27   ` [tip:locking/core] " tip-bot for Waiman Long
  1 sibling, 1 reply; 83+ messages in thread
From: Yuyang Du @ 2019-06-04  3:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, LKML, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying, Boqun Feng

Hi Waiman,

On Tue, 21 May 2019 at 05:01, Waiman Long <longman@redhat.com> wrote:
>
> Because of writer lock stealing, it is possible that a constant
> stream of incoming writers will cause a waiting writer or reader to
> wait indefinitely leading to lock starvation.
>
> This patch implements a lock handoff mechanism to disable lock stealing
> and force lock handoff to the first waiter or waiters (for readers)
> in the queue after at least a 4ms waiting period unless it is a RT
> writer task which doesn't need to wait. The waiting period is used to
> avoid discouraging lock stealing too much to affect performance.

I was working on a patchset to solve read-write lock deadlock
detection problem (https://lkml.org/lkml/2019/5/16/93).

One of the mistakes in that work is that I considered the following
case as deadlock:

  T1            T2
  --            --

  down_read1    down_write2

  down_write2   down_read1

So I was trying to understand what really went wrong and find the
problem is that if I understand correctly the current rwsem design
isn't showing real fairness but priority in favor of write locks, and
thus one of the bad effects is that read locks can be starved if write
locks keep coming.

Luckily, I noticed you are revamping rwsem and seem to have thought
about it already. I am not crystal sure what is your work's
ramification on the above case, so hope that you can shed some light
and perhaps share your thoughts on this.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-06-04  3:03   ` Yuyang Du
@ 2019-06-04  3:26     ` Yuyang Du
  2019-06-04  9:12       ` Boqun Feng
  2019-06-04 13:21       ` Waiman Long
  0 siblings, 2 replies; 83+ messages in thread
From: Yuyang Du @ 2019-06-04  3:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, LKML, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying, Boqun Feng

On Tue, 4 Jun 2019 at 11:03, Yuyang Du <duyuyang@gmail.com> wrote:
>
> Hi Waiman,
>
> On Tue, 21 May 2019 at 05:01, Waiman Long <longman@redhat.com> wrote:
> >
> > Because of writer lock stealing, it is possible that a constant
> > stream of incoming writers will cause a waiting writer or reader to
> > wait indefinitely leading to lock starvation.
> >
> > This patch implements a lock handoff mechanism to disable lock stealing
> > and force lock handoff to the first waiter or waiters (for readers)
> > in the queue after at least a 4ms waiting period unless it is a RT
> > writer task which doesn't need to wait. The waiting period is used to
> > avoid discouraging lock stealing too much to affect performance.
>
> I was working on a patchset to solve read-write lock deadlock
> detection problem (https://lkml.org/lkml/2019/5/16/93).
>
> One of the mistakes in that work is that I considered the following
> case as deadlock:

Sorry everyone, but let me rephrase:

One of the mistakes in that work is that I considered the following
case as no deadlock:

>
>   T1            T2
>   --            --
>
>   down_read1    down_write2
>
>   down_write2   down_read1
>
> So I was trying to understand what really went wrong and find the
> problem is that if I understand correctly the current rwsem design
> isn't showing real fairness but priority in favor of write locks, and
> thus one of the bad effects is that read locks can be starved if write
> locks keep coming.
>
> Luckily, I noticed you are revamping rwsem and seem to have thought
> about it already. I am not crystal sure what is your work's
> ramification on the above case, so hope that you can shed some light
> and perhaps share your thoughts on this.
>
> Thanks,
> Yuyang

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-05-20 20:59 ` [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t Waiman Long
@ 2019-06-04  8:52   ` Peter Zijlstra
  2019-06-04 15:44     ` Waiman Long
  2019-06-17 14:32   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-07-19 18:45   ` [PATCH v8 13/19] " Luis Henriques
  2 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-04  8:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
> +static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
> +{
> +	return (struct task_struct *)(atomic_long_read(&sem->owner) &
> +					~RWSEM_OWNER_FLAGS_MASK);
> +}
> +
> +/*
> + * Return the real task structure pointer of the owner and the embedded
> + * flags in the owner. pflags must be non-NULL.
> + */
> +static inline struct task_struct *
> +rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
> +{
> +	long owner = atomic_long_read(&sem->owner);
> +
> +	*pflags = owner & RWSEM_OWNER_FLAGS_MASK;
> +	return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
> +}

I got confused by the 'read' part in those nanes, I initially thought
they paired with rwsem_set_reader_owned().

So I've done 's/rwsem_read_owner/rwsem_owner/g on it.

--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -194,10 +194,10 @@ static inline void rwsem_clear_reader_ow
 /*
  * Return just the real task structure pointer of the owner
  */
-static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
+static inline struct task_struct *rwsem_owner(struct rw_semaphore *sem)
 {
-	return (struct task_struct *)(atomic_long_read(&sem->owner) &
-					~RWSEM_OWNER_FLAGS_MASK);
+	return (struct task_struct *)
+		(atomic_long_read(&sem->owner) & ~RWSEM_OWNER_FLAGS_MASK);
 }
 
 /*
@@ -205,7 +205,7 @@ static inline struct task_struct *rwsem_
  * flags in the owner. pflags must be non-NULL.
  */
 static inline struct task_struct *
-rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
+rwsem_owner_flags(struct rw_semaphore *sem, long *pflags)
 {
 	long owner = atomic_long_read(&sem->owner);
 
@@ -561,7 +561,7 @@ static inline bool rwsem_can_spin_on_own
 
 	preempt_disable();
 	rcu_read_lock();
-	owner = rwsem_read_owner_flags(sem, &flags);
+	owner = rwsem_owner_flags(sem, &flags);
 	if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
 		ret = false;
 	rcu_read_unlock();
@@ -590,8 +590,8 @@ enum owner_state {
 };
 #define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
 
-static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
-						 long flags)
+static inline enum owner_state
+rwsem_owner_state(struct task_struct *owner, long flags)
 {
 	if (flags & RWSEM_NONSPINNABLE)
 		return OWNER_NONSPINNABLE;
@@ -608,7 +608,7 @@ static noinline enum owner_state rwsem_s
 	long flags, new_flags;
 	enum owner_state state;
 
-	owner = rwsem_read_owner_flags(sem, &flags);
+	owner = rwsem_owner_flags(sem, &flags);
 	state = rwsem_owner_state(owner, flags);
 	if (state != OWNER_WRITER)
 		return state;
@@ -620,7 +620,7 @@ static noinline enum owner_state rwsem_s
 			break;
 		}
 
-		new = rwsem_read_owner_flags(sem, &new_flags);
+		new = rwsem_owner_flags(sem, &new_flags);
 		if ((new != owner) || (new_flags != flags)) {
 			state = rwsem_owner_state(new, new_flags);
 			break;
@@ -1139,7 +1139,7 @@ static inline void __up_write(struct rw_
 	 * sem->owner may differ from current if the ownership is transferred
 	 * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
 	 */
-	DEBUG_RWSEMS_WARN_ON((rwsem_read_owner(sem) != current) &&
+	DEBUG_RWSEMS_WARN_ON((rwsem_owner(sem) != current) &&
 			    !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
 	rwsem_clear_owner(sem);
 	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
@@ -1161,7 +1161,7 @@ static inline void __downgrade_write(str
 	 * read-locked region is ok to be re-ordered into the
 	 * write side. As such, rely on RELEASE semantics.
 	 */
-	DEBUG_RWSEMS_WARN_ON(rwsem_read_owner(sem) != current, sem);
+	DEBUG_RWSEMS_WARN_ON(rwsem_owner(sem) != current, sem);
 	tmp = atomic_long_fetch_add_release(
 		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
 	rwsem_set_reader_owned(sem);

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 14/19] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-05-20 20:59 ` [PATCH v8 14/19] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
@ 2019-06-04  9:03   ` Peter Zijlstra
  2019-06-04 16:54     ` Waiman Long
  2019-06-17 14:32   ` [tip:locking/core] " tip-bot for Waiman Long
  1 sibling, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-04  9:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Mon, May 20, 2019 at 04:59:13PM -0400, Waiman Long wrote:
> +static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
> +{
> +	long owner = atomic_long_read(&sem->owner);
> +
> +	while (owner & RWSEM_READER_OWNED) {
> +		if (owner & RWSEM_NONSPINNABLE)
> +			break;
> +		owner = atomic_long_cmpxchg(&sem->owner, owner,
> +					    owner | RWSEM_NONSPINNABLE);
> +	}
> +}

--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -206,12 +206,13 @@ static inline void rwsem_set_nonspinnabl
 {
 	long owner = atomic_long_read(&sem->owner);
 
-	while (owner & RWSEM_READER_OWNED) {
+	do {
+		if (!(owner & RWSEM_READER_OWNED))
+			break;
 		if (owner & RWSEM_NONSPINNABLE)
 			break;
-		owner = atomic_long_cmpxchg(&sem->owner, owner,
-					    owner | RWSEM_NONSPINNABLE);
-	}
+	} while (!atomic_long_try_cmpxchg(&sem->owner, &owner,
+					  owner | RWSEM_NONSPINNABLE));
 }
 
 /*

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-05-20 20:59 ` [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning Waiman Long
@ 2019-06-04  9:10   ` Peter Zijlstra
  2019-06-04 17:28     ` Waiman Long
  2019-06-04  9:14   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-04  9:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
> Reader optimistic spinning is helpful when the reader critical section
> is short and there aren't that many readers around. It makes readers
> relatively more preferred than writers. When a writer times out spinning
> on a reader-owned lock and set the nospinnable bits, there are two main
> reasons for that.
> 
>  1) The reader critical section is long, perhaps the task sleeps after
>     acquiring the read lock.
>  2) There are just too many readers contending the lock causing it to
>     take a while to service all of them.
> 
> In the former case, long reader critical section will impede the progress
> of writers which is usually more important for system performance.
> In the later case, reader optimistic spinning tends to make the reader
> groups that contain readers that acquire the lock together smaller
> leading to more of them. That may hurt performance in some cases. In
> other words, the setting of nonspinnable bits indicates that reader
> optimistic spinning may not be helpful for those workloads that cause it.
> 
> Therefore, any writers that have observed the setting of the writer
> nonspinnable bit for a given rwsem after they fail to acquire the lock
> via optimistic spinning will set the reader nonspinnable bit once they
> acquire the write lock. Similarly, readers that observe the setting
> of reader nonspinnable bit at slowpath entry will also set the reader
> nonspinnable bit when they acquire the read lock via the wakeup path.

So both cases set the _reader_ nonspinnable bit?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-06-04  3:26     ` Yuyang Du
@ 2019-06-04  9:12       ` Boqun Feng
  2019-06-04 16:00         ` Waiman Long
  2019-06-04 13:21       ` Waiman Long
  1 sibling, 1 reply; 83+ messages in thread
From: Boqun Feng @ 2019-06-04  9:12 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin, LKML, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

[-- Attachment #1: Type: text/plain, Size: 2560 bytes --]

On Tue, Jun 04, 2019 at 11:26:30AM +0800, Yuyang Du wrote:
> On Tue, 4 Jun 2019 at 11:03, Yuyang Du <duyuyang@gmail.com> wrote:
> >
> > Hi Waiman,
> >
> > On Tue, 21 May 2019 at 05:01, Waiman Long <longman@redhat.com> wrote:
> > >
> > > Because of writer lock stealing, it is possible that a constant
> > > stream of incoming writers will cause a waiting writer or reader to
> > > wait indefinitely leading to lock starvation.
> > >
> > > This patch implements a lock handoff mechanism to disable lock stealing
> > > and force lock handoff to the first waiter or waiters (for readers)
> > > in the queue after at least a 4ms waiting period unless it is a RT
> > > writer task which doesn't need to wait. The waiting period is used to
> > > avoid discouraging lock stealing too much to affect performance.
> >
> > I was working on a patchset to solve read-write lock deadlock
> > detection problem (https://lkml.org/lkml/2019/5/16/93).
> >
> > One of the mistakes in that work is that I considered the following
> > case as deadlock:
> 
> Sorry everyone, but let me rephrase:
> 
> One of the mistakes in that work is that I considered the following
> case as no deadlock:
> 
> >
> >   T1            T2
> >   --            --
> >
> >   down_read1    down_write2
> >
> >   down_write2   down_read1
> >

Not sure I understand the whole context here, but isn't adding a third
independent task makes this a deadlock?

	 T1            T2		T3
	 --            --		--

	 down_read1    down_write2
	 				down_write1
	 down_write2   down_read1

from the perspective of lockdep, we cannot be sure whether there will
a T3 or not.

In case that I mis-understood you, maybe your point is about in the
above case whether "down_read1" on T2 can *gauranteedly* steal (in the
sense of breaking the fairness) the read lock after Waiman modification?
If so, I will wait for Waiman's response ;-)

Regards,
Boqun

> > So I was trying to understand what really went wrong and find the
> > problem is that if I understand correctly the current rwsem design
> > isn't showing real fairness but priority in favor of write locks, and
> > thus one of the bad effects is that read locks can be starved if write
> > locks keep coming.
> >
> > Luckily, I noticed you are revamping rwsem and seem to have thought
> > about it already. I am not crystal sure what is your work's
> > ramification on the above case, so hope that you can shed some light
> > and perhaps share your thoughts on this.
> >
> > Thanks,
> > Yuyang

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-05-20 20:59 ` [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning Waiman Long
  2019-06-04  9:10   ` Peter Zijlstra
@ 2019-06-04  9:14   ` Peter Zijlstra
  2019-06-04 17:29     ` Waiman Long
  2019-06-04  9:20   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-04  9:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
> On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
> the will-it-scale benchmark was run with various number of threads. The
> number of operations done before reader optimistic spinning patches,
> this patch and after this patch were:
> 
>   Threads  Before rspin  Before patch  After patch    %change
>   -------  ------------  ------------  -----------    -------
>     20        5541068      5345484       5455667    -3.5%/ +2.1%
>     40       10185150      7292313       9219276   -28.5%/+26.4%
>     60        8196733      6460517       7181209   -21.2%/+11.2%
>     80        9508864      6739559       8107025   -29.1%/+20.3%

'rspin' is patch 12 in this series, right?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-05-20 20:59 ` [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning Waiman Long
  2019-06-04  9:10   ` Peter Zijlstra
  2019-06-04  9:14   ` Peter Zijlstra
@ 2019-06-04  9:20   ` Peter Zijlstra
  2019-06-04 17:30     ` Waiman Long
  2019-06-04 10:58   ` Peter Zijlstra
  2019-06-17 14:33   ` [tip:locking/core] " tip-bot for Waiman Long
  4 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-04  9:20 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
> @@ -286,6 +325,7 @@ struct rwsem_waiter {
>  	struct task_struct *task;
>  	enum rwsem_waiter_type type;
>  	unsigned long timeout;
> +	long last_rowner;
>  };
>  #define rwsem_first_waiter(sem) \
>  	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)

> +static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> +					      long last_rowner)

> +static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> +					      unsigned long last_rowner)

> +	waiter.last_rowner = atomic_long_read(&sem->owner);

That's somewhat inconsistent wrt the type. I'll make it unsigned long,
as that is what makes most sense, given there's a pointer inside.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 17/19] locking/rwsem: Merge owner into count on x86-64
  2019-05-20 20:59 ` [PATCH v8 17/19] locking/rwsem: Merge owner into count on x86-64 Waiman Long
@ 2019-06-04  9:45   ` Peter Zijlstra
  2019-06-04 15:47     ` Waiman Long
  0 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-04  9:45 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Mon, May 20, 2019 at 04:59:16PM -0400, Waiman Long wrote:
> With separate count and owner, there are timing windows where the two
> values are inconsistent. That can cause problem when trying to figure
> out the exact state of the rwsem. For instance, a RT task will stop
> optimistic spinning if the lock is acquired by a writer but the owner
> field isn't set yet. That can be solved by combining the count and
> owner together in a single atomic value.

I just realized we can use cmpxchg_double() here (where available of
course).


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-05-20 20:59 ` [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning Waiman Long
                     ` (2 preceding siblings ...)
  2019-06-04  9:20   ` Peter Zijlstra
@ 2019-06-04 10:58   ` Peter Zijlstra
  2019-06-17 14:33   ` [tip:locking/core] " tip-bot for Waiman Long
  4 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-04 10:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
> +static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> +					      long last_rowner)
> +{
> +	long owner = atomic_long_read(&sem->owner);
> +
> +	if (!(owner & RWSEM_READER_OWNED))
> +		return false;
> +
> +	owner	    &= ~RWSEM_OWNER_FLAGS_MASK;
> +	last_rowner &= ~RWSEM_OWNER_FLAGS_MASK;
> +	if ((owner != last_rowner) && rwsem_try_read_lock_unqueued(sem)) {

just because I'm struggling with sleep deprivation and the big picture
isn't making sense,.. you can write that like:

	((owner ^ last_rowner) & ~RWSEM_OWNER_FLAGS_MASK)

> +		lockevent_inc(rwsem_opt_rlock2);
> +		lockevent_add(rwsem_opt_fail, -1);
> +		return true;
> +	}
> +	return false;
> +}

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-06-04  3:26     ` Yuyang Du
  2019-06-04  9:12       ` Boqun Feng
@ 2019-06-04 13:21       ` Waiman Long
  1 sibling, 0 replies; 83+ messages in thread
From: Waiman Long @ 2019-06-04 13:21 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, LKML, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying, Boqun Feng

On 6/3/19 11:26 PM, Yuyang Du wrote:
> On Tue, 4 Jun 2019 at 11:03, Yuyang Du <duyuyang@gmail.com> wrote:
>> Hi Waiman,
>>
>> On Tue, 21 May 2019 at 05:01, Waiman Long <longman@redhat.com> wrote:
>>> Because of writer lock stealing, it is possible that a constant
>>> stream of incoming writers will cause a waiting writer or reader to
>>> wait indefinitely leading to lock starvation.
>>>
>>> This patch implements a lock handoff mechanism to disable lock stealing
>>> and force lock handoff to the first waiter or waiters (for readers)
>>> in the queue after at least a 4ms waiting period unless it is a RT
>>> writer task which doesn't need to wait. The waiting period is used to
>>> avoid discouraging lock stealing too much to affect performance.
>> I was working on a patchset to solve read-write lock deadlock
>> detection problem (https://lkml.org/lkml/2019/5/16/93).
>>
>> One of the mistakes in that work is that I considered the following
>> case as deadlock:
> Sorry everyone, but let me rephrase:
>
> One of the mistakes in that work is that I considered the following
> case as no deadlock:
>
>>   T1            T2
>>   --            --
>>
>>   down_read1    down_write2
>>
>>   down_write2   down_read1

Yes, that combination shouldn't cause a deadlock. However, the lockdep
code isn't able to recognize this case and so you may still see splat
about possible deadlock scenario when lockdep checking is enabled. So
the general advise is still to try to rearrange the lock ordering, if
possible.

>> So I was trying to understand what really went wrong and find the
>> problem is that if I understand correctly the current rwsem design
>> isn't showing real fairness but priority in favor of write locks, and
>> thus one of the bad effects is that read locks can be starved if write
>> locks keep coming.
>>
>> Luckily, I noticed you are revamping rwsem and seem to have thought
>> about it already. I am not crystal sure what is your work's
>> ramification on the above case, so hope that you can shed some light
>> and perhaps share your thoughts on this.

Lock starvation is certainly possible with the current rwsem code. Why
don't try to apply the patch to see if it can remedy your problem?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-06-04  8:52   ` Peter Zijlstra
@ 2019-06-04 15:44     ` Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: Waiman Long @ 2019-06-04 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 6/4/19 4:52 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
>> +static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
>> +{
>> +	return (struct task_struct *)(atomic_long_read(&sem->owner) &
>> +					~RWSEM_OWNER_FLAGS_MASK);
>> +}
>> +
>> +/*
>> + * Return the real task structure pointer of the owner and the embedded
>> + * flags in the owner. pflags must be non-NULL.
>> + */
>> +static inline struct task_struct *
>> +rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
>> +{
>> +	long owner = atomic_long_read(&sem->owner);
>> +
>> +	*pflags = owner & RWSEM_OWNER_FLAGS_MASK;
>> +	return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
>> +}
> I got confused by the 'read' part in those nanes, I initially thought
> they paired with rwsem_set_reader_owned().

Sorry for the confusion. I initially use "get", but I am afraid that it
may get confused with "get/put" for reference counting. So I used read
instead.


> So I've done 's/rwsem_read_owner/rwsem_owner/g on it.

I am fine with getting rid of the "read" from the function names.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 17/19] locking/rwsem: Merge owner into count on x86-64
  2019-06-04  9:45   ` Peter Zijlstra
@ 2019-06-04 15:47     ` Waiman Long
  2019-06-04 17:02       ` Peter Zijlstra
  0 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-06-04 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 6/4/19 5:45 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:16PM -0400, Waiman Long wrote:
>> With separate count and owner, there are timing windows where the two
>> values are inconsistent. That can cause problem when trying to figure
>> out the exact state of the rwsem. For instance, a RT task will stop
>> optimistic spinning if the lock is acquired by a writer but the owner
>> field isn't set yet. That can be solved by combining the count and
>> owner together in a single atomic value.
> I just realized we can use cmpxchg_double() here (where available of
> course).

Does the 2 doubles need to be 128-bit aligned to use cmpxchg_double()? I
don't think we can guarantee that unless we explicitly set this alignment.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-06-04  9:12       ` Boqun Feng
@ 2019-06-04 16:00         ` Waiman Long
  2019-06-05  7:48           ` Yuyang Du
  0 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-06-04 16:00 UTC (permalink / raw)
  To: Boqun Feng, Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, LKML, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 6/4/19 5:12 AM, Boqun Feng wrote:
> On Tue, Jun 04, 2019 at 11:26:30AM +0800, Yuyang Du wrote:
>> On Tue, 4 Jun 2019 at 11:03, Yuyang Du <duyuyang@gmail.com> wrote:
>>> Hi Waiman,
>>>
>>> On Tue, 21 May 2019 at 05:01, Waiman Long <longman@redhat.com> wrote:
>>>> Because of writer lock stealing, it is possible that a constant
>>>> stream of incoming writers will cause a waiting writer or reader to
>>>> wait indefinitely leading to lock starvation.
>>>>
>>>> This patch implements a lock handoff mechanism to disable lock stealing
>>>> and force lock handoff to the first waiter or waiters (for readers)
>>>> in the queue after at least a 4ms waiting period unless it is a RT
>>>> writer task which doesn't need to wait. The waiting period is used to
>>>> avoid discouraging lock stealing too much to affect performance.
>>> I was working on a patchset to solve read-write lock deadlock
>>> detection problem (https://lkml.org/lkml/2019/5/16/93).
>>>
>>> One of the mistakes in that work is that I considered the following
>>> case as deadlock:
>> Sorry everyone, but let me rephrase:
>>
>> One of the mistakes in that work is that I considered the following
>> case as no deadlock:
>>
>>>   T1            T2
>>>   --            --
>>>
>>>   down_read1    down_write2
>>>
>>>   down_write2   down_read1
>>>
> Not sure I understand the whole context here, but isn't adding a third
> independent task makes this a deadlock?
>
> 	 T1            T2		T3
> 	 --            --		--
>
> 	 down_read1    down_write2
> 	 				down_write1
> 	 down_write2   down_read1
>
> from the perspective of lockdep, we cannot be sure whether there will
> a T3 or not.

Yes, that will be a deadlock even with the my rwsem patch applied, as it
will still try to preserve the reader-writer ordering. So it will
certainly be safer to have the same lock ordering for both tasks.

>
> In case that I mis-understood you, maybe your point is about in the
> above case whether "down_read1" on T2 can *gauranteedly* steal (in the
> sense of breaking the fairness) the read lock after Waiman modification?
> If so, I will wait for Waiman's response ;-)

With my patchset applied, the reader-writer ordering is still supposed
to be preserved. Of course, there can be exceptions depending on the
exact timing, but we can't rely on that to prevent deadlock.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 14/19] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-06-04  9:03   ` Peter Zijlstra
@ 2019-06-04 16:54     ` Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: Waiman Long @ 2019-06-04 16:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 6/4/19 5:03 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:13PM -0400, Waiman Long wrote:
>> +static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
>> +{
>> +	long owner = atomic_long_read(&sem->owner);
>> +
>> +	while (owner & RWSEM_READER_OWNED) {
>> +		if (owner & RWSEM_NONSPINNABLE)
>> +			break;
>> +		owner = atomic_long_cmpxchg(&sem->owner, owner,
>> +					    owner | RWSEM_NONSPINNABLE);
>> +	}
>> +}
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -206,12 +206,13 @@ static inline void rwsem_set_nonspinnabl
>  {
>  	long owner = atomic_long_read(&sem->owner);
>  
> -	while (owner & RWSEM_READER_OWNED) {
> +	do {
> +		if (!(owner & RWSEM_READER_OWNED))
> +			break;
>  		if (owner & RWSEM_NONSPINNABLE)
>  			break;
> -		owner = atomic_long_cmpxchg(&sem->owner, owner,
> -					    owner | RWSEM_NONSPINNABLE);
> -	}
> +	} while (!atomic_long_try_cmpxchg(&sem->owner, &owner,
> +					  owner | RWSEM_NONSPINNABLE));
>  }
>  
>  /*

Sure.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 17/19] locking/rwsem: Merge owner into count on x86-64
  2019-06-04 15:47     ` Waiman Long
@ 2019-06-04 17:02       ` Peter Zijlstra
  2019-06-04 17:06         ` Waiman Long
  0 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-04 17:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Tue, Jun 04, 2019 at 11:47:21AM -0400, Waiman Long wrote:
> On 6/4/19 5:45 AM, Peter Zijlstra wrote:
> > On Mon, May 20, 2019 at 04:59:16PM -0400, Waiman Long wrote:
> >> With separate count and owner, there are timing windows where the two
> >> values are inconsistent. That can cause problem when trying to figure
> >> out the exact state of the rwsem. For instance, a RT task will stop
> >> optimistic spinning if the lock is acquired by a writer but the owner
> >> field isn't set yet. That can be solved by combining the count and
> >> owner together in a single atomic value.
> > I just realized we can use cmpxchg_double() here (where available of
> > course).
> 
> Does the 2 doubles need to be 128-bit aligned to use cmpxchg_double()? I
> don't think we can guarantee that unless we explicitly set this alignment.

It does :/ and yes, we'd need to play games with __align(2*sizeof(long))
and such.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 17/19] locking/rwsem: Merge owner into count on x86-64
  2019-06-04 17:02       ` Peter Zijlstra
@ 2019-06-04 17:06         ` Waiman Long
  2019-06-04 17:18           ` Peter Zijlstra
  0 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-06-04 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 6/4/19 1:02 PM, Peter Zijlstra wrote:
> On Tue, Jun 04, 2019 at 11:47:21AM -0400, Waiman Long wrote:
>> On 6/4/19 5:45 AM, Peter Zijlstra wrote:
>>> On Mon, May 20, 2019 at 04:59:16PM -0400, Waiman Long wrote:
>>>> With separate count and owner, there are timing windows where the two
>>>> values are inconsistent. That can cause problem when trying to figure
>>>> out the exact state of the rwsem. For instance, a RT task will stop
>>>> optimistic spinning if the lock is acquired by a writer but the owner
>>>> field isn't set yet. That can be solved by combining the count and
>>>> owner together in a single atomic value.
>>> I just realized we can use cmpxchg_double() here (where available of
>>> course).
>> Does the 2 doubles need to be 128-bit aligned to use cmpxchg_double()? I
>> don't think we can guarantee that unless we explicitly set this alignment.
> It does :/ and yes, we'd need to play games with __align(2*sizeof(long))
> and such.

So do you want this as an option now as it will be x86 specific? Or we
can do that as a follow-up if we want to.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 17/19] locking/rwsem: Merge owner into count on x86-64
  2019-06-04 17:06         ` Waiman Long
@ 2019-06-04 17:18           ` Peter Zijlstra
  0 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-04 17:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Tue, Jun 04, 2019 at 01:06:11PM -0400, Waiman Long wrote:
> On 6/4/19 1:02 PM, Peter Zijlstra wrote:
> > On Tue, Jun 04, 2019 at 11:47:21AM -0400, Waiman Long wrote:
> >> On 6/4/19 5:45 AM, Peter Zijlstra wrote:
> >>> On Mon, May 20, 2019 at 04:59:16PM -0400, Waiman Long wrote:
> >>>> With separate count and owner, there are timing windows where the two
> >>>> values are inconsistent. That can cause problem when trying to figure
> >>>> out the exact state of the rwsem. For instance, a RT task will stop
> >>>> optimistic spinning if the lock is acquired by a writer but the owner
> >>>> field isn't set yet. That can be solved by combining the count and
> >>>> owner together in a single atomic value.
> >>> I just realized we can use cmpxchg_double() here (where available of
> >>> course).
> >> Does the 2 doubles need to be 128-bit aligned to use cmpxchg_double()? I
> >> don't think we can guarantee that unless we explicitly set this alignment.
> > It does :/ and yes, we'd need to play games with __align(2*sizeof(long))
> > and such.
> 
> So do you want this as an option now as it will be x86 specific? Or we
> can do that as a follow-up if we want to.

x86, s390 and arm64 have cmpxchg_double().

I was going to have a look (but like I wrote, I'm pretty useless today
so i didn't actually get anywhere) at the exact race that's a problem
here and see if there's not another solution too.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-04  9:10   ` Peter Zijlstra
@ 2019-06-04 17:28     ` Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: Waiman Long @ 2019-06-04 17:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 6/4/19 5:10 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
>> Reader optimistic spinning is helpful when the reader critical section
>> is short and there aren't that many readers around. It makes readers
>> relatively more preferred than writers. When a writer times out spinning
>> on a reader-owned lock and set the nospinnable bits, there are two main
>> reasons for that.
>>
>>  1) The reader critical section is long, perhaps the task sleeps after
>>     acquiring the read lock.
>>  2) There are just too many readers contending the lock causing it to
>>     take a while to service all of them.
>>
>> In the former case, long reader critical section will impede the progress
>> of writers which is usually more important for system performance.
>> In the later case, reader optimistic spinning tends to make the reader
>> groups that contain readers that acquire the lock together smaller
>> leading to more of them. That may hurt performance in some cases. In
>> other words, the setting of nonspinnable bits indicates that reader
>> optimistic spinning may not be helpful for those workloads that cause it.
>>
>> Therefore, any writers that have observed the setting of the writer
>> nonspinnable bit for a given rwsem after they fail to acquire the lock
>> via optimistic spinning will set the reader nonspinnable bit once they
>> acquire the write lock. Similarly, readers that observe the setting
>> of reader nonspinnable bit at slowpath entry will also set the reader
>> nonspinnable bit when they acquire the read lock via the wakeup path.
> So both cases set the _reader_ nonspinnable bit?

Yes.

-Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-04  9:14   ` Peter Zijlstra
@ 2019-06-04 17:29     ` Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: Waiman Long @ 2019-06-04 17:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 6/4/19 5:14 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
>> On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
>> the will-it-scale benchmark was run with various number of threads. The
>> number of operations done before reader optimistic spinning patches,
>> this patch and after this patch were:
>>
>>   Threads  Before rspin  Before patch  After patch    %change
>>   -------  ------------  ------------  -----------    -------
>>     20        5541068      5345484       5455667    -3.5%/ +2.1%
>>     40       10185150      7292313       9219276   -28.5%/+26.4%
>>     60        8196733      6460517       7181209   -21.2%/+11.2%
>>     80        9508864      6739559       8107025   -29.1%/+20.3%
> 'rspin' is patch 12 in this series, right?

Yes, I should have spell out the patch name.

-Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-04  9:20   ` Peter Zijlstra
@ 2019-06-04 17:30     ` Waiman Long
  2019-06-04 17:38       ` Peter Zijlstra
  0 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-06-04 17:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 6/4/19 5:20 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
>> @@ -286,6 +325,7 @@ struct rwsem_waiter {
>>  	struct task_struct *task;
>>  	enum rwsem_waiter_type type;
>>  	unsigned long timeout;
>> +	long last_rowner;
>>  };
>>  #define rwsem_first_waiter(sem) \
>>  	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
>> +static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>> +					      long last_rowner)
>> +static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>> +					      unsigned long last_rowner)
>> +	waiter.last_rowner = atomic_long_read(&sem->owner);
> That's somewhat inconsistent wrt the type. I'll make it unsigned long,
> as that is what makes most sense, given there's a pointer inside.

Thank for spotting that, I will fix it.

-Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-04 17:30     ` Waiman Long
@ 2019-06-04 17:38       ` Peter Zijlstra
  2019-06-04 18:04         ` Waiman Long
  0 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-04 17:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Tue, Jun 04, 2019 at 01:30:00PM -0400, Waiman Long wrote:
> > That's somewhat inconsistent wrt the type. I'll make it unsigned long,
> > as that is what makes most sense, given there's a pointer inside.
> 
> Thank for spotting that, I will fix it.

I fixed a whole bunch of them; please find the modified patches here:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=locking/core

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-04 17:38       ` Peter Zijlstra
@ 2019-06-04 18:04         ` Waiman Long
  2019-06-04 18:14           ` Peter Zijlstra
  0 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-06-04 18:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 6/4/19 1:38 PM, Peter Zijlstra wrote:
> On Tue, Jun 04, 2019 at 01:30:00PM -0400, Waiman Long wrote:
>>> That's somewhat inconsistent wrt the type. I'll make it unsigned long,
>>> as that is what makes most sense, given there's a pointer inside.
>> Thank for spotting that, I will fix it.
> I fixed a whole bunch of them; please find the modified patches here:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=locking/core

Thanks for reviewing the patches.

So how do you think about the overall state of this patchset? Do you
think it is mature enough to go into 5.3?

Or if you want more time to think about solving the RT thread issue, we
can merge just patches 1-16 and play with the last threes for some more
time. I am fine with that too as improving RT tasks is not my main
focus. I like patch 16 as it led me to discover the rwsem reader wakeup
bug as I hit the negative dentry count WARN_ON message in my testing.

I worked on this owner merging patch mainly to alleviate the need to use
cmpxchg for reader lock. cmpxchg_double() is certainly one possible
solution though it won't work on older CPUs. We can have a config option
to use cmpxchg_double as it may increase the size of other structures
that embedded rwsem and impose additional alignment constraint.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-04 18:04         ` Waiman Long
@ 2019-06-04 18:14           ` Peter Zijlstra
  2019-06-04 18:21             ` Waiman Long
  0 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-04 18:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Tue, Jun 04, 2019 at 02:04:34PM -0400, Waiman Long wrote:
> On 6/4/19 1:38 PM, Peter Zijlstra wrote:
> > On Tue, Jun 04, 2019 at 01:30:00PM -0400, Waiman Long wrote:
> >>> That's somewhat inconsistent wrt the type. I'll make it unsigned long,
> >>> as that is what makes most sense, given there's a pointer inside.
> >> Thank for spotting that, I will fix it.
> > I fixed a whole bunch of them; please find the modified patches here:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=locking/core
> 
> Thanks for reviewing the patches.
> 
> So how do you think about the overall state of this patchset? Do you
> think it is mature enough to go into 5.3?

So far so good :-)

> Or if you want more time to think about solving the RT thread issue, we
> can merge just patches 1-16 and play with the last threes for some more
> time. I am fine with that too as improving RT tasks is not my main
> focus. I like patch 16 as it led me to discover the rwsem reader wakeup
> bug as I hit the negative dentry count WARN_ON message in my testing.

My brain gave out around patch 14.. I'll try again tomorrow. But I'm
thinking we should be able to do as you suggest and get -16 merged.

> I worked on this owner merging patch mainly to alleviate the need to use
> cmpxchg for reader lock. cmpxchg_double() is certainly one possible
> solution though it won't work on older CPUs. We can have a config option
> to use cmpxchg_double as it may increase the size of other structures
> that embedded rwsem and impose additional alignment constraint.

cmpxchg8b was introduced with the Pentium (for PAE IIRC, it enabled
atomic 64bit PTEs, but Linux never used it for that) and every Intel/AMD
thereafter has had it. AFAIK there's no x86_64 chip without cmpxchg16b.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-04 18:14           ` Peter Zijlstra
@ 2019-06-04 18:21             ` Waiman Long
  2019-06-05 18:13               ` Waiman Long
  0 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-06-04 18:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 6/4/19 2:14 PM, Peter Zijlstra wrote:
>> I worked on this owner merging patch mainly to alleviate the need to use
>> cmpxchg for reader lock. cmpxchg_double() is certainly one possible
>> solution though it won't work on older CPUs. We can have a config option
>> to use cmpxchg_double as it may increase the size of other structures
>> that embedded rwsem and impose additional alignment constraint.
> cmpxchg8b was introduced with the Pentium (for PAE IIRC, it enabled
> atomic 64bit PTEs, but Linux never used it for that) and every Intel/AMD
> thereafter has had it. AFAIK there's no x86_64 chip without cmpxchg16b.

Thank for the clarification. I actually didn't check when cmpxch8b was
introduced. I know it is a bit slower than regular cmpxchg. So we may
still need to do some performance analysis to see how it compares with
my current approach.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-06-04 16:00         ` Waiman Long
@ 2019-06-05  7:48           ` Yuyang Du
  0 siblings, 0 replies; 83+ messages in thread
From: Yuyang Du @ 2019-06-05  7:48 UTC (permalink / raw)
  To: Waiman Long
  Cc: Boqun Feng, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin, LKML, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

Hi Waiman,

On Wed, 5 Jun 2019 at 00:00, Waiman Long <longman@redhat.com> wrote:
> With my patchset applied, the reader-writer ordering is still supposed
> to be preserved. Of course, there can be exceptions depending on the
> exact timing, but we can't rely on that to prevent deadlock.

This is exactly what I want to know. Thanks for the reply.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-04 18:21             ` Waiman Long
@ 2019-06-05 18:13               ` Waiman Long
  2019-06-05 20:19                 ` Peter Zijlstra
  0 siblings, 1 reply; 83+ messages in thread
From: Waiman Long @ 2019-06-05 18:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 6/4/19 2:21 PM, Waiman Long wrote:
> On 6/4/19 2:14 PM, Peter Zijlstra wrote:
>>> I worked on this owner merging patch mainly to alleviate the need to use
>>> cmpxchg for reader lock. cmpxchg_double() is certainly one possible
>>> solution though it won't work on older CPUs. We can have a config option
>>> to use cmpxchg_double as it may increase the size of other structures
>>> that embedded rwsem and impose additional alignment constraint.
>> cmpxchg8b was introduced with the Pentium (for PAE IIRC, it enabled
>> atomic 64bit PTEs, but Linux never used it for that) and every Intel/AMD
>> thereafter has had it. AFAIK there's no x86_64 chip without cmpxchg16b.
> Thank for the clarification. I actually didn't check when cmpxch8b was
> introduced. I know it is a bit slower than regular cmpxchg. So we may
> still need to do some performance analysis to see how it compares with
> my current approach.

Using cmpxchg_double is actually more risky than I thought. I have been
trying to try to use cmpxchg_double for down_write, but I kept getting
kernel panics because the rwsem wasn't 16b-aligned. As rwsem is embedded
in quite a large number of structures, they all have to align properly
to make that work or the kernel will panic. That does seem too risky to
me. So I am dropping the idea of trying to use it.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-05 18:13               ` Waiman Long
@ 2019-06-05 20:19                 ` Peter Zijlstra
  2019-06-05 20:52                   ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-05 20:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Wed, Jun 05, 2019 at 02:13:27PM -0400, Waiman Long wrote:

> Using cmpxchg_double is actually more risky than I thought. I have been
> trying to try to use cmpxchg_double for down_write, but I kept getting
> kernel panics because the rwsem wasn't 16b-aligned. As rwsem is embedded
> in quite a large number of structures, they all have to align properly
> to make that work or the kernel will panic. That does seem too risky to
> me. So I am dropping the idea of trying to use it.

Urgh, that's another things that's been on the TODO list for a long long
time, write code to verify the alignment of allocations :/ I'm
suspecting quite a lot of that goes wrong all over the place.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-05 20:19                 ` Peter Zijlstra
@ 2019-06-05 20:52                   ` Linus Torvalds
  2019-06-06  8:03                     ` Peter Zijlstra
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2019-06-05 20:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying

On Wed, Jun 5, 2019 at 1:19 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Urgh, that's another things that's been on the TODO list for a long long
> time, write code to verify the alignment of allocations :/ I'm
> suspecting quite a lot of that goes wrong all over the place.

On x86, we only guarantee 8-byte alignment from things like kmalloc(), iirc.

That ends up actually being a useful thing for small allocations,
which do happen.

On the whole, I would suggest against cmpxchg2 unless it's something
_really_ special. And would definitely strongly suggest against it for
something like a rwsem. Requiring 16-byte alignment just because your
data structure has a lock is nasty. Of course, we could probably
fairly easily change our kmalloc alignment rules to be "still just 8
bytes for small allocations, 16 bytes for anything that is >=64 bytes"
or whatever.

At least nobody is hopefully crazy enough to put one of those things
on the stack, where we *definitely* don't want to increase alignment
issues.

And before people say "surely small allocations aren't normal" - take
a look at slaballoc. Small allocations (<= 32 bytes) are actually not
all that uncommon, and you want them dense in the cache and dense in
memory to not waste either. arm64 has some insane alignment issues
(128 byte alignment due to DMA coherency issues, iirc), and it hurts
them badly.

Right now my machine has 400k 8-byte allocations, if I read things right.

You also find some core slab caches that are small and that don't need
16-byte alignment. A quick script finds things like
ext4_extent_status, which is 40 bytes, not horribly uncommon (I've
apparently got 250k of those things on my system), and currently fits
102 entries per page *because* it's  not excessively aligned. Or
Acpi-Parse, which I apparently have 350k of, and is 56 bytes, and fits
73 per page exactly because it only needs 8-byte alignment (but
admittedly a 16-byte alignment would waste some memory, but guarantee
it doesn't cross a cacheline, so _maybe_ it would be ok).

16-byte alignment really isn't a good idea when you have data sizes
that are clearly smaller than even a cacheline.

So I *really* don't want to force excessive alignment. We'd have to
add some special static tooling to say "this kmalloc is assigned to a
pointer which requires 16-byte alignment" and make it use a separate
slab cache with that explicit alignment for that.

                Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-05 20:52                   ` Linus Torvalds
@ 2019-06-06  8:03                     ` Peter Zijlstra
  2019-06-06  8:11                       ` Peter Zijlstra
  0 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-06  8:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying,
	dvyukov, glider, aryabinin

On Wed, Jun 05, 2019 at 01:52:15PM -0700, Linus Torvalds wrote:
> On Wed, Jun 5, 2019 at 1:19 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Urgh, that's another things that's been on the TODO list for a long long
> > time, write code to verify the alignment of allocations :/ I'm
> > suspecting quite a lot of that goes wrong all over the place.
> 
> On x86, we only guarantee 8-byte alignment from things like kmalloc(), iirc.

Oh sure, and I'm not proposing to change that. I was more thinking of
having a GCC plugin that verifies, for every ptr assignment:

	ptr = foo;

that the actual alignment maches:

	assert(!(uintptr_t)ptr % __alignof(*ptr));

That would catch bugs like:

struct bar {
	int ponies;
	int peaches __smp_cacheline_aligned;
};

	struct bar *barp = kmalloc(sizeof(barp, GFP_KERNEL);

Blatantly violating alignment can't be right; either the alignment
constraints put on the data structures are not important and they should
be fixed, or we should respect them and fix the allocation, either way,
we should not silently violate things like we do today.



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-06-06  8:03                     ` Peter Zijlstra
@ 2019-06-06  8:11                       ` Peter Zijlstra
  0 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-06  8:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying,
	dvyukov, glider, aryabinin

On Thu, Jun 06, 2019 at 10:03:15AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 05, 2019 at 01:52:15PM -0700, Linus Torvalds wrote:
> > On Wed, Jun 5, 2019 at 1:19 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > Urgh, that's another things that's been on the TODO list for a long long
> > > time, write code to verify the alignment of allocations :/ I'm
> > > suspecting quite a lot of that goes wrong all over the place.
> > 
> > On x86, we only guarantee 8-byte alignment from things like kmalloc(), iirc.
> 
> Oh sure, and I'm not proposing to change that. I was more thinking of
> having a GCC plugin that verifies, for every ptr assignment:
> 
> 	ptr = foo;

To better qualify: 'for every ptr assignment that includes a type cast',
and since allocators return 'void *' and (typically/eventually) assign
to a typed pointer, that would be the place to check.

This avoids having to instrument every single pointer assignment.

> that the actual alignment maches:
> 
> 	assert(!(uintptr_t)ptr % __alignof(*ptr));
> 
> That would catch bugs like:
> 
> struct bar {
> 	int ponies;
> 	int peaches __smp_cacheline_aligned;
> };
> 
> 	struct bar *barp = kmalloc(sizeof(barp, GFP_KERNEL);
> 
> Blatantly violating alignment can't be right; either the alignment
> constraints put on the data structures are not important and they should
> be fixed, or we should respect them and fix the allocation, either way,
> we should not silently violate things like we do today.
> 
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 16/19] locking/rwsem: Guard against making count negative
  2019-05-20 20:59 ` [PATCH v8 16/19] locking/rwsem: Guard against making count negative Waiman Long
@ 2019-06-11 13:11   ` Peter Zijlstra
  2019-06-11 13:27     ` Peter Zijlstra
  2019-06-11 13:13   ` Peter Zijlstra
  2019-06-17 14:34   ` [tip:locking/core] " tip-bot for Waiman Long
  2 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-11 13:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Mon, May 20, 2019 at 04:59:15PM -0400, Waiman Long wrote:

> +static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
> +{
> +	long adjustment = -RWSEM_READER_BIAS;
> +
> +	*cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);

I'm thinking we'd actually want add_return_acquire() here.

> +	if (unlikely(*cnt < 0)) {
> +		atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> +		adjustment = 0;
> +	}
> +	return adjustment;
> +}

> @@ -1271,9 +1332,10 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
>   */
>  inline void __down_read(struct rw_semaphore *sem)
>  {
> +	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
> +
> +	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
> +		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, adjustment);
>  		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
>  	} else {
>  		rwsem_set_reader_owned(sem);
> @@ -1282,9 +1344,11 @@ inline void __down_read(struct rw_semaphore *sem)
>  
>  static inline int __down_read_killable(struct rw_semaphore *sem)
>  {
> +	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
> +
> +	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
> +		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
> +						    adjustment)))
>  			return -EINTR;
>  		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
>  	} else {

I'm confused by the need for @tmp; isn't that returning the exact same
state !adjustment is?

Also; half the patch seems to do cnt<0, while the other half (above)
does &READ_FAILED, what gives?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 16/19] locking/rwsem: Guard against making count negative
  2019-05-20 20:59 ` [PATCH v8 16/19] locking/rwsem: Guard against making count negative Waiman Long
  2019-06-11 13:11   ` Peter Zijlstra
@ 2019-06-11 13:13   ` Peter Zijlstra
  2019-06-17 14:34   ` [tip:locking/core] " tip-bot for Waiman Long
  2 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-11 13:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Mon, May 20, 2019 at 04:59:15PM -0400, Waiman Long wrote:
>  static struct rw_semaphore __sched *
> +rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
>  {
> +	long count;
>  	bool wake = false;
>  	struct rwsem_waiter waiter;
>  	DEFINE_WAKE_Q(wake_q);
>  
> +	if (unlikely(!adjustment)) {
> +		/*
> +		 * This shouldn't happen. If it does, there is probably
> +		 * something wrong in the system.
> +		 */
> +		WARN_ON_ONCE(1);

	if (WARN_ON_ONCE(!adjustment)) {

> +
> +		/*
> +		 * An adjustment of 0 means that there are too many readers
> +		 * holding or trying to acquire the lock. So disable
> +		 * optimistic spinning and go directly into the wait list.
> +		 */
> +		if (rwsem_test_oflags(sem, RWSEM_RD_NONSPINNABLE))
> +			rwsem_set_nonspinnable(sem);

ISTR rwsem_set_nonspinnable() already does that test, so no need to do
it again, right?

> +		goto queue;
> +	}
> +
>  	/*
>  	 * Save the current read-owner of rwsem, if available, and the
>  	 * reader nonspinnable bit.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 16/19] locking/rwsem: Guard against making count negative
  2019-06-11 13:11   ` Peter Zijlstra
@ 2019-06-11 13:27     ` Peter Zijlstra
  0 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2019-06-11 13:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Tue, Jun 11, 2019 at 03:11:31PM +0200, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:15PM -0400, Waiman Long wrote:
> 
> > +static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
> > +{
> > +	long adjustment = -RWSEM_READER_BIAS;
> > +
> > +	*cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
> 
> I'm thinking we'd actually want add_return_acquire() here.
> 
> > +	if (unlikely(*cnt < 0)) {
> > +		atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> > +		adjustment = 0;
> > +	}
> > +	return adjustment;
> > +}
> 
> > @@ -1271,9 +1332,10 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
> >   */
> >  inline void __down_read(struct rw_semaphore *sem)
> >  {
> > +	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
> > +
> > +	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
> > +		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, adjustment);
> >  		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> >  	} else {
> >  		rwsem_set_reader_owned(sem);
> > @@ -1282,9 +1344,11 @@ inline void __down_read(struct rw_semaphore *sem)
> >  
> >  static inline int __down_read_killable(struct rw_semaphore *sem)
> >  {
> > +	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
> > +
> > +	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
> > +		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
> > +						    adjustment)))
> >  			return -EINTR;
> >  		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> >  	} else {
> 
> I'm confused by the need for @tmp; isn't that returning the exact same
> state !adjustment is?

Argh.. READ_FAILED_MASK isn't just the MSB. Bah, this is confusing.

Maybe something like so?

--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -116,13 +116,28 @@
 #endif
 
 /*
- * The definition of the atomic counter in the semaphore:
+ * On 64-bit architectures, the bit definitions of the count are:
  *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bit  2   - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * Bit  0    - writer locked bit
+ * Bit  1    - waiters present bit
+ * Bit  2    - lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-62 - 55-bit reader count
+ * Bit  63   - read fail bit
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit  0    - writer locked bit
+ * Bit  1    - waiters present bit
+ * Bit  2    - lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-30 - 23-bit reader count
+ * Bit  31   - read fail bit
+ *
+ * It is not likely that the most significant bit (read fail bit) will ever
+ * be set. This guard bit is still checked anyway in the down_read() fastpath
+ * just in case we need to use up more of the reader bits for other purpose
+ * in the future.
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
@@ -139,6 +154,7 @@
 #define RWSEM_WRITER_LOCKED	(1UL << 0)
 #define RWSEM_FLAG_WAITERS	(1UL << 1)
 #define RWSEM_FLAG_HANDOFF	(1UL << 2)
+#define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
 
 #define RWSEM_READER_SHIFT	8
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
@@ -146,7 +162,7 @@
 #define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
-				 RWSEM_FLAG_HANDOFF)
+				 RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -254,6 +270,14 @@ static inline void rwsem_set_nonspinnabl
 					  owner | RWSEM_NONSPINNABLE));
 }
 
+static inline bool rwsem_read_trylock(struct rw_semaphore *sem)
+{
+	unsigned long cnt = atomic_long_add_return_acquire(RWSEM_READER_BIAS, &sem->count);
+	WARN_ON_ONCE(cnt < 0);
+	return !(cnt & RWSEM_READ_FAILED_MASK);
+
+}
+
 /*
  * Return just the real task structure pointer of the owner
  */
@@ -403,6 +427,12 @@ static void rwsem_mark_wake(struct rw_se
 	}
 
 	/*
+	 * No reader wakeup if there are too many of them already.
+	 */
+	if (unlikely(atomic_long_read(&sem->count) < 0))
+		return;
+
+	/*
 	 * Writers might steal the lock before we grant it to the next reader.
 	 * We prefer to do the first reader grant before counting readers
 	 * so we can bail out early if a writer stole the lock.
@@ -949,9 +979,9 @@ static struct rw_semaphore __sched *
 rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count, adjustment = -RWSEM_READER_BIAS;
-	bool wake = false;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
+	bool wake = false;
 
 	/*
 	 * Save the current read-owner of rwsem, if available, and the
@@ -1270,8 +1300,7 @@ static struct rw_semaphore *rwsem_downgr
  */
 inline void __down_read(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
+	if (!rwsem_read_trylock(sem)) {
 		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
@@ -1281,9 +1310,8 @@ inline void __down_read(struct rw_semaph
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
+	if (!rwsem_read_trylock(sem)) {
+		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE));
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
@@ -1359,6 +1387,7 @@ inline void __up_read(struct rw_semaphor
 	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
 		      RWSEM_FLAG_WAITERS)) {
 		clear_wr_nonspinnable(sem);

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER
  2019-05-20 20:59 ` [PATCH v8 01/19] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER Waiman Long
@ 2019-06-17 14:23   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dave, longman, hpa, linux-kernel, tim.c.chen, bp, torvalds,
	huang.ying.caritas, peterz, tglx, mingo, will.deacon

Commit-ID:  c71fd893f614f205dbc050d60299cc5496491c19
Gitweb:     https://git.kernel.org/tip/c71fd893f614f205dbc050d60299cc5496491c19
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:00 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:27:54 +0200

locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER

The owner field in the rw_semaphore structure is used primarily for
optimistic spinning. However, identifying the rwsem owner can also be
helpful in debugging as well as tracing locking related issues when
analyzing crash dump. The owner field may also store state information
that can be important to the operation of the rwsem.

So the owner field is now made a permanent member of the rw_semaphore
structure irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-2-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/rwsem.h       |  9 +++++----
 kernel/locking/rwsem-xadd.c |  2 +-
 kernel/locking/rwsem.h      | 23 -----------------------
 lib/Kconfig.debug           |  8 ++++----
 4 files changed, 10 insertions(+), 32 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 2ea18a3def04..148983e21d47 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -34,12 +34,12 @@
  */
 struct rw_semaphore {
 	atomic_long_t count;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	/*
-	 * Write owner. Used as a speculative check to see
-	 * if the owner is running on the cpu.
+	 * Write owner or one of the read owners. Can be used as a
+	 * speculative check to see if the owner is running on the cpu.
 	 */
 	struct task_struct *owner;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	struct optimistic_spin_queue osq; /* spinner MCS lock */
 #endif
 	raw_spinlock_t wait_lock;
@@ -73,13 +73,14 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #endif
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL
+#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED
 #else
 #define __RWSEM_OPT_INIT(lockname)
 #endif
 
 #define __RWSEM_INITIALIZER(name)				\
 	{ __RWSEM_INIT_COUNT(name),				\
+	  .owner = NULL,					\
 	  .wait_list = LIST_HEAD_INIT((name).wait_list),	\
 	  .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock)	\
 	  __RWSEM_OPT_INIT(name)				\
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 0b1f77957240..c0500679fd2f 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -86,8 +86,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	osq_lock_init(&sem->osq);
 #endif
 }
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 64877f5294e3..eb9c8534299b 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -61,7 +61,6 @@
 #define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
 #define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
 
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
  * store tearing can't happen as optimistic spinners may read and use
@@ -126,7 +125,6 @@ static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
  * real owner or one of the real owners. The only exception is when the
  * unlock is done by up_read_non_owner().
  */
-#define rwsem_clear_reader_owned rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
@@ -135,28 +133,7 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
 				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
 }
-#endif
-
 #else
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
-					   struct task_struct *owner)
-{
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-#ifndef rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 }
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index cbdfae379896..417bdd9e80fb 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1095,7 +1095,7 @@ config PROVE_LOCKING
 	select DEBUG_SPINLOCK
 	select DEBUG_MUTEXES
 	select DEBUG_RT_MUTEXES if RT_MUTEXES
-	select DEBUG_RWSEMS if RWSEM_SPIN_ON_OWNER
+	select DEBUG_RWSEMS
 	select DEBUG_WW_MUTEX_SLOWPATH
 	select DEBUG_LOCK_ALLOC
 	select TRACE_IRQFLAGS
@@ -1199,10 +1199,10 @@ config DEBUG_WW_MUTEX_SLOWPATH
 
 config DEBUG_RWSEMS
 	bool "RW Semaphore debugging: basic checks"
-	depends on DEBUG_KERNEL && RWSEM_SPIN_ON_OWNER
+	depends on DEBUG_KERNEL
 	help
-	  This debugging feature allows mismatched rw semaphore locks and unlocks
-	  to be detected and reported.
+	  This debugging feature allows mismatched rw semaphore locks
+	  and unlocks to be detected and reported.
 
 config DEBUG_LOCK_ALLOC
 	bool "Lock debugging: detect incorrect freeing of live locks"

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Remove rwsem_wake() wakeup optimization
  2019-05-20 20:59 ` [PATCH v8 02/19] locking/rwsem: Remove rwsem_wake() wakeup optimization Waiman Long
@ 2019-06-17 14:24   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, huang.ying.caritas, dave, mingo, linux-kernel, tglx, peterz,
	tim.c.chen, longman, torvalds, hpa, will.deacon

Commit-ID:  5c1ec49b60cdb31e51010f8a647f3189b774bddf
Gitweb:     https://git.kernel.org/tip/5c1ec49b60cdb31e51010f8a647f3189b774bddf
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:01 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:27:55 +0200

locking/rwsem: Remove rwsem_wake() wakeup optimization

After the following commit:

  59aabfc7e959 ("locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()")

the rwsem_wake() forgoes doing a wakeup if the wait_lock cannot be directly
acquired and an optimistic spinning locker is present.  This can help performance
by avoiding spinning on the wait_lock when it is contended.

With the later commit:

  133e89ef5ef3 ("locking/rwsem: Enable lockless waiter wakeup(s)")

the performance advantage of the above optimization diminishes as the average
wait_lock hold time become much shorter.

With a later patch that supports rwsem lock handoff, we can no
longer relies on the fact that the presence of an optimistic spinning
locker will ensure that the lock will be acquired by a task soon and
rwsem_wake() will be called later on to wake up waiters. This can lead
to missed wakeup and application hang.

So the original 59aabfc7e959 commit has to be reverted.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-3-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/rwsem-xadd.c | 72 ---------------------------------------------
 1 file changed, 72 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index c0500679fd2f..3083fdf50447 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -411,25 +411,11 @@ done:
 	lockevent_cond_inc(rwsem_opt_fail, !taken);
 	return taken;
 }
-
-/*
- * Return true if the rwsem has active spinner
- */
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-	return osq_is_locked(&sem->osq);
-}
-
 #else
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
 	return false;
 }
-
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-	return false;
-}
 #endif
 
 /*
@@ -651,65 +637,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
 
-	/*
-	* __rwsem_down_write_failed_common(sem)
-	*   rwsem_optimistic_spin(sem)
-	*     osq_unlock(sem->osq)
-	*   ...
-	*   atomic_long_add_return(&sem->count)
-	*
-	*      - VS -
-	*
-	*              __up_write()
-	*                if (atomic_long_sub_return_release(&sem->count) < 0)
-	*                  rwsem_wake(sem)
-	*                    osq_is_locked(&sem->osq)
-	*
-	* And __up_write() must observe !osq_is_locked() when it observes the
-	* atomic_long_add_return() in order to not miss a wakeup.
-	*
-	* This boils down to:
-	*
-	* [S.rel] X = 1                [RmW] r0 = (Y += 0)
-	*         MB                         RMB
-	* [RmW]   Y += 1               [L]   r1 = X
-	*
-	* exists (r0=1 /\ r1=0)
-	*/
-	smp_rmb();
-
-	/*
-	 * If a spinner is present, it is not necessary to do the wakeup.
-	 * Try to do wakeup only if the trylock succeeds to minimize
-	 * spinlock contention which may introduce too much delay in the
-	 * unlock operation.
-	 *
-	 *    spinning writer		up_write/up_read caller
-	 *    ---------------		-----------------------
-	 * [S]   osq_unlock()		[L]   osq
-	 *	 MB			      RMB
-	 * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
-	 *
-	 * Here, it is important to make sure that there won't be a missed
-	 * wakeup while the rwsem is free and the only spinning writer goes
-	 * to sleep without taking the rwsem. Even when the spinning writer
-	 * is just going to break out of the waiting loop, it will still do
-	 * a trylock in rwsem_down_write_failed() before sleeping. IOW, if
-	 * rwsem_has_spinner() is true, it will guarantee at least one
-	 * trylock attempt on the rwsem later on.
-	 */
-	if (rwsem_has_spinner(sem)) {
-		/*
-		 * The smp_rmb() here is to make sure that the spinner
-		 * state is consulted before reading the wait_lock.
-		 */
-		smp_rmb();
-		if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
-			return sem;
-		goto locked;
-	}
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-locked:
 
 	if (!list_empty(&sem->wait_list))
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Implement a new locking scheme
  2019-05-20 20:59 ` [PATCH v8 03/19] locking/rwsem: Implement a new locking scheme Waiman Long
@ 2019-06-17 14:24   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, will.deacon, peterz, dave, hpa, tglx,
	huang.ying.caritas, bp, mingo, tim.c.chen, torvalds, longman

Commit-ID:  64489e78004cb5623211c75790cac90bd25ff5e9
Gitweb:     https://git.kernel.org/tip/64489e78004cb5623211c75790cac90bd25ff5e9
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:02 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:27:56 +0200

locking/rwsem: Implement a new locking scheme

The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.

To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used.  The atomic long count has the following
bit definitions:

  Bit  0   - writer locked bit
  Bit  1   - waiters present bit
  Bits 2-7 - reserved for future extension
  Bits 8-X - reader count (24/56 bits)

The cmpxchg instruction is now used to acquire the write lock. The read
lock is still acquired with xadd instruction, so there is no change here.
This scheme will allow up to 16M/64P active readers which should be
more than enough. We can always use some more reserved bits if necessary.

With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) of the benchmark on a 8-socket 120-core
IvyBridge-EX system before and after the patch were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        30,659   31,341   31,055   31,283
        2         8,909   16,457    9,884   17,659
        4         9,028   15,823    8,933   20,233
        8         8,410   14,212    7,230   17,140
       16         8,217   25,240    7,479   24,607

The locking rates of the benchmark on a Power8 system were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        12,963   13,647   13,275   13,601
        2         7,570   11,569    7,902   10,829
        4         5,232    5,516    5,466    5,435
        8         5,233    3,386    5,467    3,168

The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        21,495   21,046   21,524   21,074
        2         5,293   10,502    5,333   10,504
        4         5,325   11,463    5,358   11,631
        8         5,391   11,712    5,470   11,680

The performance are roughly the same before and after the patch. There
are run-to-run variations in performance. Runs with higher variances
usually have higher throughput.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-4-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/rwsem-xadd.c | 147 +++++++++++++++-----------------------------
 kernel/locking/rwsem.h      |  74 +++++++++++-----------
 2 files changed, 85 insertions(+), 136 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3083fdf50447..7d537b50a849 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -9,6 +9,8 @@
  *
  * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
  * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition by Waiman Long <longman@redhat.com>.
  */
 #include <linux/rwsem.h>
 #include <linux/init.h>
@@ -22,52 +24,20 @@
 #include "rwsem.h"
 
 /*
- * Guide to the rw_semaphore's count field for common values.
- * (32-bit case illustrated, similar for 64-bit)
- *
- * 0x0000000X	(1) X readers active or attempting lock, no writer waiting
- *		    X = #active_readers + #readers attempting to lock
- *		    (X*ACTIVE_BIAS)
- *
- * 0x00000000	rwsem is unlocked, and no one is waiting for the lock or
- *		attempting to read lock or write lock.
- *
- * 0xffff000X	(1) X readers active or attempting lock, with waiters for lock
- *		    X = #active readers + # readers attempting lock
- *		    (X*ACTIVE_BIAS + WAITING_BIAS)
- *		(2) 1 writer attempting lock, no waiters for lock
- *		    X-1 = #active readers + #readers attempting lock
- *		    ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *		(3) 1 writer active, no waiters for lock
- *		    X-1 = #active readers + #readers attempting lock
- *		    ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *
- * 0xffff0001	(1) 1 reader active or attempting lock, waiters for lock
- *		    (WAITING_BIAS + ACTIVE_BIAS)
- *		(2) 1 writer active or attempting lock, no waiters for lock
- *		    (ACTIVE_WRITE_BIAS)
+ * Guide to the rw_semaphore's count field.
  *
- * 0xffff0000	(1) There are writers or readers queued but none active
- *		    or in the process of attempting lock.
- *		    (WAITING_BIAS)
- *		Note: writer can attempt to steal lock for this count by adding
- *		ACTIVE_WRITE_BIAS in cmpxchg and checking the old count
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
  *
- * 0xfffe0001	(1) 1 writer active, or attempting lock. Waiters on queue.
- *		    (ACTIVE_WRITE_BIAS + WAITING_BIAS)
- *
- * Note: Readers attempt to lock by adding ACTIVE_BIAS in down_read and checking
- *	 the count becomes more than 0 for successful lock acquisition,
- *	 i.e. the case where there are only readers or nobody has lock.
- *	 (1st and 2nd case above).
- *
- *	 Writers attempt to lock by adding ACTIVE_WRITE_BIAS in down_write and
- *	 checking the count becomes ACTIVE_WRITE_BIAS for successful lock
- *	 acquisition (i.e. nobody else has lock or attempts lock).  If
- *	 unsuccessful, in rwsem_down_write_failed, we'll check to see if there
- *	 are only waiters but none active (5th case above), and attempt to
- *	 steal the lock.
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
  *
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
  */
 
 /*
@@ -113,9 +83,8 @@ enum rwsem_wake_type {
 
 /*
  * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then:
- *   - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
- *   - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ *   have been set.
  * - there must be someone on the queue
  * - the wait_lock must be held by the caller
  * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
@@ -160,22 +129,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	 * so we can bail out early if a writer stole the lock.
 	 */
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
-		adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
+		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
-		if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-			/*
-			 * If the count is still less than RWSEM_WAITING_BIAS
-			 * after removing the adjustment, it is assumed that
-			 * a writer has stolen the lock. We have to undo our
-			 * reader grant.
-			 */
-			if (atomic_long_add_return(-adjustment, &sem->count) <
-			    RWSEM_WAITING_BIAS)
-				return;
-
-			/* Last active locker left. Retry waking readers. */
-			goto try_reader_grant;
+		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+			atomic_long_sub(adjustment, &sem->count);
+			return;
 		}
 		/*
 		 * Set it to reader-owned to give spinners an early
@@ -209,11 +167,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	}
 	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
 
-	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
+	adjustment = woken * RWSEM_READER_BIAS - adjustment;
 	lockevent_cond_inc(rwsem_wake_reader, woken);
 	if (list_empty(&sem->wait_list)) {
 		/* hit end of list above */
-		adjustment -= RWSEM_WAITING_BIAS;
+		adjustment -= RWSEM_FLAG_WAITERS;
 	}
 
 	if (adjustment)
@@ -248,22 +206,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  */
 static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
 {
-	/*
-	 * Avoid trying to acquire write lock if count isn't RWSEM_WAITING_BIAS.
-	 */
-	if (count != RWSEM_WAITING_BIAS)
+	long new;
+
+	if (count & RWSEM_LOCK_MASK)
 		return false;
 
-	/*
-	 * Acquire the lock by trying to set it to ACTIVE_WRITE_BIAS. If there
-	 * are other tasks on the wait list, we need to add on WAITING_BIAS.
-	 */
-	count = list_is_singular(&sem->wait_list) ?
-			RWSEM_ACTIVE_WRITE_BIAS :
-			RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS;
+	new = count + RWSEM_WRITER_LOCKED -
+	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
 
-	if (atomic_long_cmpxchg_acquire(&sem->count, RWSEM_WAITING_BIAS, count)
-							== RWSEM_WAITING_BIAS) {
+	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
 		rwsem_set_owner(sem);
 		return true;
 	}
@@ -279,9 +230,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 {
 	long count = atomic_long_read(&sem->count);
 
-	while (!count || count == RWSEM_WAITING_BIAS) {
+	while (!(count & RWSEM_LOCK_MASK)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_ACTIVE_WRITE_BIAS)) {
+					count + RWSEM_WRITER_LOCKED)) {
 			rwsem_set_owner(sem);
 			lockevent_inc(rwsem_opt_wlock);
 			return true;
@@ -424,7 +375,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 static inline struct rw_semaphore __sched *
 __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 {
-	long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
+	long count, adjustment = -RWSEM_READER_BIAS;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
@@ -436,16 +387,16 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 		/*
 		 * In case the wait queue is empty and the lock isn't owned
 		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_ACTIVE_READ_BIAS has already
-		 * been set in the count.
+		 * immediately as its RWSEM_READER_BIAS has already been
+		 * set in the count.
 		 */
-		if (atomic_long_read(&sem->count) >= 0) {
+		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
 			return sem;
 		}
-		adjustment += RWSEM_WAITING_BIAS;
+		adjustment += RWSEM_FLAG_WAITERS;
 	}
 	list_add_tail(&waiter.list, &sem->wait_list);
 
@@ -458,9 +409,8 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 	 * If there are no writers and we are first in the queue,
 	 * wake our own waiter to join the existing active readers !
 	 */
-	if (count == RWSEM_WAITING_BIAS ||
-	    (count > RWSEM_WAITING_BIAS &&
-	     adjustment != -RWSEM_ACTIVE_READ_BIAS))
+	if (!(count & RWSEM_LOCK_MASK) ||
+	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
@@ -488,7 +438,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 out_nolock:
 	list_del(&waiter.list);
 	if (list_empty(&sem->wait_list))
-		atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock_fail);
@@ -521,9 +471,6 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
 
-	/* undo write bias from down_write operation, stop active locking */
-	count = atomic_long_sub_return(RWSEM_ACTIVE_WRITE_BIAS, &sem->count);
-
 	/* do optimistic spinning and steal lock if possible */
 	if (rwsem_optimistic_spin(sem))
 		return sem;
@@ -543,16 +490,18 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 
 	list_add_tail(&waiter.list, &sem->wait_list);
 
-	/* we're now waiting on the lock, but no longer actively locking */
+	/* we're now waiting on the lock */
 	if (waiting) {
 		count = atomic_long_read(&sem->count);
 
 		/*
 		 * If there were already threads queued before us and there are
-		 * no active writers, the lock must be read owned; so we try to
-		 * wake any read locks that were queued ahead of us.
+		 * no active writers and some readers, the lock must be read
+		 * owned; so we try to  any read locks that were queued ahead
+		 * of us.
 		 */
-		if (count > RWSEM_WAITING_BIAS) {
+		if (!(count & RWSEM_WRITER_MASK) &&
+		     (count & RWSEM_READER_MASK)) {
 			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
 			/*
 			 * The wakeup is normally called _after_ the wait_lock
@@ -569,8 +518,9 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 			wake_q_init(&wake_q);
 		}
 
-	} else
-		count = atomic_long_add_return(RWSEM_WAITING_BIAS, &sem->count);
+	} else {
+		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+	}
 
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
@@ -587,7 +537,8 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 			schedule();
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
-		} while ((count = atomic_long_read(&sem->count)) & RWSEM_ACTIVE_MASK);
+			count = atomic_long_read(&sem->count);
+		} while (count & RWSEM_LOCK_MASK);
 
 		raw_spin_lock_irq(&sem->wait_lock);
 	}
@@ -603,7 +554,7 @@ out_nolock:
 	raw_spin_lock_irq(&sem->wait_lock);
 	list_del(&waiter.list);
 	if (list_empty(&sem->wait_list))
-		atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	else
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 	raw_spin_unlock_irq(&sem->wait_lock);
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index eb9c8534299b..499a9b2bda82 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -42,24 +42,24 @@
 #endif
 
 /*
- * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
- * Adapted largely from include/asm-i386/rwsem.h
- * by Paul Mackerras <paulus@samba.org>.
- */
-
-/*
- * the semaphore definition
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit  0   - writer locked bit
+ * Bit  1   - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
  */
-#ifdef CONFIG_64BIT
-# define RWSEM_ACTIVE_MASK		0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK		0x0000ffffL
-#endif
-
-#define RWSEM_ACTIVE_BIAS		0x00000001L
-#define RWSEM_WAITING_BIAS		(-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_WRITER_LOCKED	(1UL << 0)
+#define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_READER_SHIFT	8
+#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -151,7 +151,8 @@ extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
  */
 static inline void __down_read(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		rwsem_down_read_failed(sem);
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
@@ -162,7 +163,8 @@ static inline void __down_read(struct rw_semaphore *sem)
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
@@ -183,11 +185,11 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
 	lockevent_inc(rwsem_rtrylock);
 	do {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
-					tmp + RWSEM_ACTIVE_READ_BIAS)) {
+					tmp + RWSEM_READER_BIAS)) {
 			rwsem_set_reader_owned(sem);
 			return 1;
 		}
-	} while (tmp >= 0);
+	} while (!(tmp & RWSEM_READ_FAILED_MASK));
 	return 0;
 }
 
@@ -196,22 +198,16 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
  */
 static inline void __down_write(struct rw_semaphore *sem)
 {
-	long tmp;
-
-	tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
-					     &sem->count);
-	if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
 		rwsem_down_write_failed(sem);
 	rwsem_set_owner(sem);
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
 {
-	long tmp;
-
-	tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
-					     &sem->count);
-	if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
 		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
 			return -EINTR;
 	rwsem_set_owner(sem);
@@ -224,7 +220,7 @@ static inline int __down_write_trylock(struct rw_semaphore *sem)
 
 	lockevent_inc(rwsem_wtrylock);
 	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-		      RWSEM_ACTIVE_WRITE_BIAS);
+					  RWSEM_WRITER_LOCKED);
 	if (tmp == RWSEM_UNLOCKED_VALUE) {
 		rwsem_set_owner(sem);
 		return true;
@@ -242,8 +238,9 @@ static inline void __up_read(struct rw_semaphore *sem)
 	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
 				sem);
 	rwsem_clear_reader_owned(sem);
-	tmp = atomic_long_dec_return_release(&sem->count);
-	if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
+	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+			== RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -254,8 +251,8 @@ static inline void __up_write(struct rw_semaphore *sem)
 {
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
 	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
-						    &sem->count) < 0))
+	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+			&sem->count) & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -274,8 +271,9 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	 * write side. As such, rely on RELEASE semantics.
 	 */
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
+	tmp = atomic_long_fetch_add_release(
+		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
 	rwsem_set_reader_owned(sem);
-	if (tmp < 0)
+	if (tmp & RWSEM_FLAG_WAITERS)
 		rwsem_downgrade_wake(sem);
 }

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
  2019-05-20 20:59 ` [PATCH v8 04/19] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c Waiman Long
@ 2019-06-17 14:25   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, hpa, dave, torvalds, tim.c.chen, bp, linux-kernel, peterz,
	longman, mingo, will.deacon, huang.ying.caritas

Commit-ID:  5dec94d4923683b1dd6a09dc62427a24d79ee7b4
Gitweb:     https://git.kernel.org/tip/5dec94d4923683b1dd6a09dc62427a24d79ee7b4
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:03 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:27:57 +0200

locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c

Now we only have one implementation of rwsem. Even though we still use
xadd to handle reader locking, we use cmpxchg for writer instead. So
the filename rwsem-xadd.c is not strictly correct. Also no one outside
of the rwsem code need to know the internal implementation other than
function prototypes for two internal functions that are called directly
from percpu-rwsem.c.

So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
the following order:

  <upper part of rwsem.h>
  <rwsem-xadd.c>
  <lower part of rwsem.h>
  <rwsem.c>

The rwsem.h file now contains only 2 function declarations for
__up_read() and __down_read().

This is a code relocation patch with no code change at all except
making __up_read() and __down_read() non-static functions so they
can be used by percpu-rwsem.c.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-5-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/Makefile     |   2 +-
 kernel/locking/rwsem-xadd.c | 624 -------------------------------
 kernel/locking/rwsem.c      | 884 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/locking/rwsem.h      | 281 +-------------
 4 files changed, 891 insertions(+), 900 deletions(-)

diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 6fe2f333aecb..45452facff3b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -3,7 +3,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT		:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o rwsem-xadd.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
deleted file mode 100644
index 7d537b50a849..000000000000
--- a/kernel/locking/rwsem-xadd.c
+++ /dev/null
@@ -1,624 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* rwsem.c: R/W semaphores: contention handling functions
- *
- * Written by David Howells (dhowells@redhat.com).
- * Derived from arch/i386/kernel/semaphore.c
- *
- * Writer lock-stealing by Alex Shi <alex.shi@intel.com>
- * and Michel Lespinasse <walken@google.com>
- *
- * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
- * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
- *
- * Rwsem count bit fields re-definition by Waiman Long <longman@redhat.com>.
- */
-#include <linux/rwsem.h>
-#include <linux/init.h>
-#include <linux/export.h>
-#include <linux/sched/signal.h>
-#include <linux/sched/rt.h>
-#include <linux/sched/wake_q.h>
-#include <linux/sched/debug.h>
-#include <linux/osq_lock.h>
-
-#include "rwsem.h"
-
-/*
- * Guide to the rw_semaphore's count field.
- *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
- *
- * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
- * (2) some of the reader bits are set in count, and
- * (3) the owner field has RWSEM_READ_OWNED bit set.
- *
- * Having some reader bits set is not enough to guarantee a readers owned
- * lock as the readers may be in the process of backing out from the count
- * and a writer has just released the lock. So another writer may steal
- * the lock immediately after that.
- */
-
-/*
- * Initialize an rwsem:
- */
-void __init_rwsem(struct rw_semaphore *sem, const char *name,
-		  struct lock_class_key *key)
-{
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-	/*
-	 * Make sure we are not reinitializing a held semaphore:
-	 */
-	debug_check_no_locks_freed((void *)sem, sizeof(*sem));
-	lockdep_init_map(&sem->dep_map, name, key, 0);
-#endif
-	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
-	raw_spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-	sem->owner = NULL;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-	osq_lock_init(&sem->osq);
-#endif
-}
-
-EXPORT_SYMBOL(__init_rwsem);
-
-enum rwsem_waiter_type {
-	RWSEM_WAITING_FOR_WRITE,
-	RWSEM_WAITING_FOR_READ
-};
-
-struct rwsem_waiter {
-	struct list_head list;
-	struct task_struct *task;
-	enum rwsem_waiter_type type;
-};
-
-enum rwsem_wake_type {
-	RWSEM_WAKE_ANY,		/* Wake whatever's at head of wait list */
-	RWSEM_WAKE_READERS,	/* Wake readers only */
-	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
-};
-
-/*
- * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
- *   have been set.
- * - there must be someone on the queue
- * - the wait_lock must be held by the caller
- * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
- *   to actually wakeup the blocked task(s) and drop the reference count,
- *   preferably when the wait_lock is released
- * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only marked woken if downgrading is false
- */
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
-			      enum rwsem_wake_type wake_type,
-			      struct wake_q_head *wake_q)
-{
-	struct rwsem_waiter *waiter, *tmp;
-	long oldcount, woken = 0, adjustment = 0;
-	struct list_head wlist;
-
-	/*
-	 * Take a peek at the queue head waiter such that we can determine
-	 * the wakeup(s) to perform.
-	 */
-	waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
-
-	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
-		if (wake_type == RWSEM_WAKE_ANY) {
-			/*
-			 * Mark writer at the front of the queue for wakeup.
-			 * Until the task is actually later awoken later by
-			 * the caller, other writers are able to steal it.
-			 * Readers, on the other hand, will block as they
-			 * will notice the queued writer.
-			 */
-			wake_q_add(wake_q, waiter->task);
-			lockevent_inc(rwsem_wake_writer);
-		}
-
-		return;
-	}
-
-	/*
-	 * Writers might steal the lock before we grant it to the next reader.
-	 * We prefer to do the first reader grant before counting readers
-	 * so we can bail out early if a writer stole the lock.
-	 */
-	if (wake_type != RWSEM_WAKE_READ_OWNED) {
-		adjustment = RWSEM_READER_BIAS;
-		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
-		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
-			atomic_long_sub(adjustment, &sem->count);
-			return;
-		}
-		/*
-		 * Set it to reader-owned to give spinners an early
-		 * indication that readers now have the lock.
-		 */
-		__rwsem_set_reader_owned(sem, waiter->task);
-	}
-
-	/*
-	 * Grant an infinite number of read locks to the readers at the front
-	 * of the queue. We know that woken will be at least 1 as we accounted
-	 * for above. Note we increment the 'active part' of the count by the
-	 * number of readers before waking any processes up.
-	 *
-	 * We have to do wakeup in 2 passes to prevent the possibility that
-	 * the reader count may be decremented before it is incremented. It
-	 * is because the to-be-woken waiter may not have slept yet. So it
-	 * may see waiter->task got cleared, finish its critical section and
-	 * do an unlock before the reader count increment.
-	 *
-	 * 1) Collect the read-waiters in a separate list, count them and
-	 *    fully increment the reader count in rwsem.
-	 * 2) For each waiters in the new list, clear waiter->task and
-	 *    put them into wake_q to be woken up later.
-	 */
-	list_for_each_entry(waiter, &sem->wait_list, list) {
-		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-			break;
-
-		woken++;
-	}
-	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
-
-	adjustment = woken * RWSEM_READER_BIAS - adjustment;
-	lockevent_cond_inc(rwsem_wake_reader, woken);
-	if (list_empty(&sem->wait_list)) {
-		/* hit end of list above */
-		adjustment -= RWSEM_FLAG_WAITERS;
-	}
-
-	if (adjustment)
-		atomic_long_add(adjustment, &sem->count);
-
-	/* 2nd pass */
-	list_for_each_entry_safe(waiter, tmp, &wlist, list) {
-		struct task_struct *tsk;
-
-		tsk = waiter->task;
-		get_task_struct(tsk);
-
-		/*
-		 * Ensure calling get_task_struct() before setting the reader
-		 * waiter to nil such that rwsem_down_read_failed() cannot
-		 * race with do_exit() by always holding a reference count
-		 * to the task to wakeup.
-		 */
-		smp_store_release(&waiter->task, NULL);
-		/*
-		 * Ensure issuing the wakeup (either by us or someone else)
-		 * after setting the reader waiter to nil.
-		 */
-		wake_q_add_safe(wake_q, tsk);
-	}
-}
-
-/*
- * This function must be called with the sem->wait_lock held to prevent
- * race conditions between checking the rwsem wait list and setting the
- * sem->count accordingly.
- */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
-{
-	long new;
-
-	if (count & RWSEM_LOCK_MASK)
-		return false;
-
-	new = count + RWSEM_WRITER_LOCKED -
-	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
-
-	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
-		rwsem_set_owner(sem);
-		return true;
-	}
-
-	return false;
-}
-
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-/*
- * Try to acquire write lock before the writer has been put on wait queue.
- */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
-{
-	long count = atomic_long_read(&sem->count);
-
-	while (!(count & RWSEM_LOCK_MASK)) {
-		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_WRITER_LOCKED)) {
-			rwsem_set_owner(sem);
-			lockevent_inc(rwsem_opt_wlock);
-			return true;
-		}
-	}
-	return false;
-}
-
-static inline bool owner_on_cpu(struct task_struct *owner)
-{
-	/*
-	 * As lock holder preemption issue, we both skip spinning if
-	 * task is not on cpu or its cpu is preempted
-	 */
-	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
-}
-
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
-{
-	struct task_struct *owner;
-	bool ret = true;
-
-	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
-
-	if (need_resched())
-		return false;
-
-	rcu_read_lock();
-	owner = READ_ONCE(sem->owner);
-	if (owner) {
-		ret = is_rwsem_owner_spinnable(owner) &&
-		      owner_on_cpu(owner);
-	}
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
- * Return true only if we can still spin on the owner field of the rwsem.
- */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
-{
-	struct task_struct *owner = READ_ONCE(sem->owner);
-
-	if (!is_rwsem_owner_spinnable(owner))
-		return false;
-
-	rcu_read_lock();
-	while (owner && (READ_ONCE(sem->owner) == owner)) {
-		/*
-		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking sem->owner still matches owner, if that fails,
-		 * owner might point to free()d memory, if it still matches,
-		 * the rcu_read_lock() ensures the memory stays valid.
-		 */
-		barrier();
-
-		/*
-		 * abort spinning when need_resched or owner is not running or
-		 * owner's cpu is preempted.
-		 */
-		if (need_resched() || !owner_on_cpu(owner)) {
-			rcu_read_unlock();
-			return false;
-		}
-
-		cpu_relax();
-	}
-	rcu_read_unlock();
-
-	/*
-	 * If there is a new owner or the owner is not set, we continue
-	 * spinning.
-	 */
-	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
-}
-
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
-	bool taken = false;
-
-	preempt_disable();
-
-	/* sem->wait_lock should not be held when doing optimistic spinning */
-	if (!rwsem_can_spin_on_owner(sem))
-		goto done;
-
-	if (!osq_lock(&sem->osq))
-		goto done;
-
-	/*
-	 * Optimistically spin on the owner field and attempt to acquire the
-	 * lock whenever the owner changes. Spinning will be stopped when:
-	 *  1) the owning writer isn't running; or
-	 *  2) readers own the lock as we can't determine if they are
-	 *     actively running or not.
-	 */
-	while (rwsem_spin_on_owner(sem)) {
-		/*
-		 * Try to acquire the lock
-		 */
-		if (rwsem_try_write_lock_unqueued(sem)) {
-			taken = true;
-			break;
-		}
-
-		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
-		 */
-		if (!sem->owner && (need_resched() || rt_task(current)))
-			break;
-
-		/*
-		 * The cpu_relax() call is a compiler barrier which forces
-		 * everything in this loop to be re-loaded. We don't need
-		 * memory barriers as we'll eventually observe the right
-		 * values at the cost of a few extra spins.
-		 */
-		cpu_relax();
-	}
-	osq_unlock(&sem->osq);
-done:
-	preempt_enable();
-	lockevent_cond_inc(rwsem_opt_fail, !taken);
-	return taken;
-}
-#else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
-	return false;
-}
-#endif
-
-/*
- * Wait for the read lock to be granted
- */
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
-{
-	long count, adjustment = -RWSEM_READER_BIAS;
-	struct rwsem_waiter waiter;
-	DEFINE_WAKE_Q(wake_q);
-
-	waiter.task = current;
-	waiter.type = RWSEM_WAITING_FOR_READ;
-
-	raw_spin_lock_irq(&sem->wait_lock);
-	if (list_empty(&sem->wait_list)) {
-		/*
-		 * In case the wait queue is empty and the lock isn't owned
-		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_READER_BIAS has already been
-		 * set in the count.
-		 */
-		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
-			raw_spin_unlock_irq(&sem->wait_lock);
-			rwsem_set_reader_owned(sem);
-			lockevent_inc(rwsem_rlock_fast);
-			return sem;
-		}
-		adjustment += RWSEM_FLAG_WAITERS;
-	}
-	list_add_tail(&waiter.list, &sem->wait_list);
-
-	/* we're now waiting on the lock, but no longer actively locking */
-	count = atomic_long_add_return(adjustment, &sem->count);
-
-	/*
-	 * If there are no active locks, wake the front queued process(es).
-	 *
-	 * If there are no writers and we are first in the queue,
-	 * wake our own waiter to join the existing active readers !
-	 */
-	if (!(count & RWSEM_LOCK_MASK) ||
-	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
-	raw_spin_unlock_irq(&sem->wait_lock);
-	wake_up_q(&wake_q);
-
-	/* wait to be given the lock */
-	while (true) {
-		set_current_state(state);
-		if (!waiter.task)
-			break;
-		if (signal_pending_state(state, current)) {
-			raw_spin_lock_irq(&sem->wait_lock);
-			if (waiter.task)
-				goto out_nolock;
-			raw_spin_unlock_irq(&sem->wait_lock);
-			break;
-		}
-		schedule();
-		lockevent_inc(rwsem_sleep_reader);
-	}
-
-	__set_current_state(TASK_RUNNING);
-	lockevent_inc(rwsem_rlock);
-	return sem;
-out_nolock:
-	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	__set_current_state(TASK_RUNNING);
-	lockevent_inc(rwsem_rlock_fail);
-	return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
-/*
- * Wait until we successfully acquire the write lock
- */
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
-{
-	long count;
-	bool waiting = true; /* any queued threads before us */
-	struct rwsem_waiter waiter;
-	struct rw_semaphore *ret = sem;
-	DEFINE_WAKE_Q(wake_q);
-
-	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_optimistic_spin(sem))
-		return sem;
-
-	/*
-	 * Optimistic spinning failed, proceed to the slowpath
-	 * and block until we can acquire the sem.
-	 */
-	waiter.task = current;
-	waiter.type = RWSEM_WAITING_FOR_WRITE;
-
-	raw_spin_lock_irq(&sem->wait_lock);
-
-	/* account for this before adding a new element to the list */
-	if (list_empty(&sem->wait_list))
-		waiting = false;
-
-	list_add_tail(&waiter.list, &sem->wait_list);
-
-	/* we're now waiting on the lock */
-	if (waiting) {
-		count = atomic_long_read(&sem->count);
-
-		/*
-		 * If there were already threads queued before us and there are
-		 * no active writers and some readers, the lock must be read
-		 * owned; so we try to  any read locks that were queued ahead
-		 * of us.
-		 */
-		if (!(count & RWSEM_WRITER_MASK) &&
-		     (count & RWSEM_READER_MASK)) {
-			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
-			/*
-			 * The wakeup is normally called _after_ the wait_lock
-			 * is released, but given that we are proactively waking
-			 * readers we can deal with the wake_q overhead as it is
-			 * similar to releasing and taking the wait_lock again
-			 * for attempting rwsem_try_write_lock().
-			 */
-			wake_up_q(&wake_q);
-
-			/*
-			 * Reinitialize wake_q after use.
-			 */
-			wake_q_init(&wake_q);
-		}
-
-	} else {
-		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
-	}
-
-	/* wait until we successfully acquire the lock */
-	set_current_state(state);
-	while (true) {
-		if (rwsem_try_write_lock(count, sem))
-			break;
-		raw_spin_unlock_irq(&sem->wait_lock);
-
-		/* Block until there are no active lockers. */
-		do {
-			if (signal_pending_state(state, current))
-				goto out_nolock;
-
-			schedule();
-			lockevent_inc(rwsem_sleep_writer);
-			set_current_state(state);
-			count = atomic_long_read(&sem->count);
-		} while (count & RWSEM_LOCK_MASK);
-
-		raw_spin_lock_irq(&sem->wait_lock);
-	}
-	__set_current_state(TASK_RUNNING);
-	list_del(&waiter.list);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	lockevent_inc(rwsem_wlock);
-
-	return ret;
-
-out_nolock:
-	__set_current_state(TASK_RUNNING);
-	raw_spin_lock_irq(&sem->wait_lock);
-	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-	else
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	wake_up_q(&wake_q);
-	lockevent_inc(rwsem_wlock_fail);
-
-	return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
-/*
- * handle waking up a waiter on the semaphore
- * - up_read/up_write has decremented the active part of count if we come here
- */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
-{
-	unsigned long flags;
-	DEFINE_WAKE_Q(wake_q);
-
-	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
-	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
-	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
-	wake_up_q(&wake_q);
-
-	return sem;
-}
-EXPORT_SYMBOL(rwsem_wake);
-
-/*
- * downgrade a write lock into a read lock
- * - caller incremented waiting part of count and discovered it still negative
- * - just wake up any readers at the front of the queue
- */
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
-{
-	unsigned long flags;
-	DEFINE_WAKE_Q(wake_q);
-
-	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
-	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
-
-	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
-	wake_up_q(&wake_q);
-
-	return sem;
-}
-EXPORT_SYMBOL(rwsem_downgrade_wake);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ccbf18f560ff..8317bcdf063b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -3,17 +3,901 @@
  *
  * Written by David Howells (dhowells@redhat.com).
  * Derived from asm-i386/semaphore.h
+ *
+ * Writer lock-stealing by Alex Shi <alex.shi@intel.com>
+ * and Michel Lespinasse <walken@google.com>
+ *
+ * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
+ * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition and rwsem rearchitecture
+ * by Waiman Long <longman@redhat.com>.
  */
 
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/sched/rt.h>
+#include <linux/sched/task.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched/signal.h>
 #include <linux/export.h>
 #include <linux/rwsem.h>
 #include <linux/atomic.h>
 
 #include "rwsem.h"
+#include "lock_events.h"
+
+/*
+ * The least significant 2 bits of the owner value has the following
+ * meanings when set.
+ *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
+ *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
+ *    i.e. the owner(s) cannot be readily determined. It can be reader
+ *    owned or the owning writer is indeterminate.
+ *
+ * When a writer acquires a rwsem, it puts its task_struct pointer
+ * into the owner field. It is cleared after an unlock.
+ *
+ * When a reader acquires a rwsem, it will also puts its task_struct
+ * pointer into the owner field with both the RWSEM_READER_OWNED and
+ * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * largely be left untouched. So for a free or reader-owned rwsem,
+ * the owner value may contain information about the last reader that
+ * acquires the rwsem. The anonymous bit is set because that particular
+ * reader may or may not still own the lock.
+ *
+ * That information may be helpful in debugging cases where the system
+ * seems to hang on a reader owned rwsem especially if only one reader
+ * is involved. Ideally we would like to track all the readers that own
+ * a rwsem, but the overhead is simply too big.
+ */
+#define RWSEM_READER_OWNED	(1UL << 0)
+#define RWSEM_ANONYMOUSLY_OWNED	(1UL << 1)
+
+#ifdef CONFIG_DEBUG_RWSEMS
+# define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
+	if (!debug_locks_silent &&				\
+	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
+		#c, atomic_long_read(&(sem)->count),		\
+		(long)((sem)->owner), (long)current,		\
+		list_empty(&(sem)->wait_list) ? "" : "not "))	\
+			debug_locks_off();			\
+	} while (0)
+#else
+# define DEBUG_RWSEMS_WARN_ON(c, sem)
+#endif
+
+/*
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit  0   - writer locked bit
+ * Bit  1   - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
+ */
+#define RWSEM_WRITER_LOCKED	(1UL << 0)
+#define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_READER_SHIFT	8
+#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+
+/*
+ * All writes to owner are protected by WRITE_ONCE() to make sure that
+ * store tearing can't happen as optimistic spinners may read and use
+ * the owner value concurrently without lock. Read from owner, however,
+ * may not need READ_ONCE() as long as the pointer value is only used
+ * for comparison and isn't being dereferenced.
+ */
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+	WRITE_ONCE(sem->owner, current);
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+	WRITE_ONCE(sem->owner, NULL);
+}
+
+/*
+ * The task_struct pointer of the last owning reader will be left in
+ * the owner field.
+ *
+ * Note that the owner value just indicates the task has owned the rwsem
+ * previously, it may not be the real owner or one of the real owners
+ * anymore when that field is examined, so take it with a grain of salt.
+ */
+static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
+					    struct task_struct *owner)
+{
+	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
+						 | RWSEM_ANONYMOUSLY_OWNED;
+
+	WRITE_ONCE(sem->owner, (struct task_struct *)val);
+}
+
+static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
+{
+	__rwsem_set_reader_owned(sem, current);
+}
+
+/*
+ * Return true if the a rwsem waiter can spin on the rwsem's owner
+ * and steal the lock, i.e. the lock is not anonymously owned.
+ * N.B. !owner is considered spinnable.
+ */
+static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+{
+	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
+}
+
+/*
+ * Return true if rwsem is owned by an anonymous writer or readers.
+ */
+static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
+{
+	return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+}
+
+#ifdef CONFIG_DEBUG_RWSEMS
+/*
+ * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
+ * is a task pointer in owner of a reader-owned rwsem, it will be the
+ * real owner or one of the real owners. The only exception is when the
+ * unlock is done by up_read_non_owner().
+ */
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
+						   | RWSEM_ANONYMOUSLY_OWNED;
+	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
+		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
+				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+}
+#else
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+}
+#endif
+
+/*
+ * Guide to the rw_semaphore's count field.
+ *
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
+ *
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
+ *
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
+ */
+
+/*
+ * Initialize an rwsem:
+ */
+void __init_rwsem(struct rw_semaphore *sem, const char *name,
+		  struct lock_class_key *key)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	/*
+	 * Make sure we are not reinitializing a held semaphore:
+	 */
+	debug_check_no_locks_freed((void *)sem, sizeof(*sem));
+	lockdep_init_map(&sem->dep_map, name, key, 0);
+#endif
+	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+	raw_spin_lock_init(&sem->wait_lock);
+	INIT_LIST_HEAD(&sem->wait_list);
+	sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+	osq_lock_init(&sem->osq);
+#endif
+}
+
+EXPORT_SYMBOL(__init_rwsem);
+
+enum rwsem_waiter_type {
+	RWSEM_WAITING_FOR_WRITE,
+	RWSEM_WAITING_FOR_READ
+};
+
+struct rwsem_waiter {
+	struct list_head list;
+	struct task_struct *task;
+	enum rwsem_waiter_type type;
+};
+
+enum rwsem_wake_type {
+	RWSEM_WAKE_ANY,		/* Wake whatever's at head of wait list */
+	RWSEM_WAKE_READERS,	/* Wake readers only */
+	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
+};
+
+/*
+ * handle the lock release when processes blocked on it that can now run
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ *   have been set.
+ * - there must be someone on the queue
+ * - the wait_lock must be held by the caller
+ * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
+ *   to actually wakeup the blocked task(s) and drop the reference count,
+ *   preferably when the wait_lock is released
+ * - woken process blocks are discarded from the list after having task zeroed
+ * - writers are only marked woken if downgrading is false
+ */
+static void __rwsem_mark_wake(struct rw_semaphore *sem,
+			      enum rwsem_wake_type wake_type,
+			      struct wake_q_head *wake_q)
+{
+	struct rwsem_waiter *waiter, *tmp;
+	long oldcount, woken = 0, adjustment = 0;
+	struct list_head wlist;
+
+	/*
+	 * Take a peek at the queue head waiter such that we can determine
+	 * the wakeup(s) to perform.
+	 */
+	waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
+
+	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
+		if (wake_type == RWSEM_WAKE_ANY) {
+			/*
+			 * Mark writer at the front of the queue for wakeup.
+			 * Until the task is actually later awoken later by
+			 * the caller, other writers are able to steal it.
+			 * Readers, on the other hand, will block as they
+			 * will notice the queued writer.
+			 */
+			wake_q_add(wake_q, waiter->task);
+			lockevent_inc(rwsem_wake_writer);
+		}
+
+		return;
+	}
+
+	/*
+	 * Writers might steal the lock before we grant it to the next reader.
+	 * We prefer to do the first reader grant before counting readers
+	 * so we can bail out early if a writer stole the lock.
+	 */
+	if (wake_type != RWSEM_WAKE_READ_OWNED) {
+		adjustment = RWSEM_READER_BIAS;
+		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
+		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+			atomic_long_sub(adjustment, &sem->count);
+			return;
+		}
+		/*
+		 * Set it to reader-owned to give spinners an early
+		 * indication that readers now have the lock.
+		 */
+		__rwsem_set_reader_owned(sem, waiter->task);
+	}
+
+	/*
+	 * Grant an infinite number of read locks to the readers at the front
+	 * of the queue. We know that woken will be at least 1 as we accounted
+	 * for above. Note we increment the 'active part' of the count by the
+	 * number of readers before waking any processes up.
+	 *
+	 * We have to do wakeup in 2 passes to prevent the possibility that
+	 * the reader count may be decremented before it is incremented. It
+	 * is because the to-be-woken waiter may not have slept yet. So it
+	 * may see waiter->task got cleared, finish its critical section and
+	 * do an unlock before the reader count increment.
+	 *
+	 * 1) Collect the read-waiters in a separate list, count them and
+	 *    fully increment the reader count in rwsem.
+	 * 2) For each waiters in the new list, clear waiter->task and
+	 *    put them into wake_q to be woken up later.
+	 */
+	list_for_each_entry(waiter, &sem->wait_list, list) {
+		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
+			break;
+
+		woken++;
+	}
+	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
+
+	adjustment = woken * RWSEM_READER_BIAS - adjustment;
+	lockevent_cond_inc(rwsem_wake_reader, woken);
+	if (list_empty(&sem->wait_list)) {
+		/* hit end of list above */
+		adjustment -= RWSEM_FLAG_WAITERS;
+	}
+
+	if (adjustment)
+		atomic_long_add(adjustment, &sem->count);
+
+	/* 2nd pass */
+	list_for_each_entry_safe(waiter, tmp, &wlist, list) {
+		struct task_struct *tsk;
+
+		tsk = waiter->task;
+		get_task_struct(tsk);
+
+		/*
+		 * Ensure calling get_task_struct() before setting the reader
+		 * waiter to nil such that rwsem_down_read_failed() cannot
+		 * race with do_exit() by always holding a reference count
+		 * to the task to wakeup.
+		 */
+		smp_store_release(&waiter->task, NULL);
+		/*
+		 * Ensure issuing the wakeup (either by us or someone else)
+		 * after setting the reader waiter to nil.
+		 */
+		wake_q_add_safe(wake_q, tsk);
+	}
+}
+
+/*
+ * This function must be called with the sem->wait_lock held to prevent
+ * race conditions between checking the rwsem wait list and setting the
+ * sem->count accordingly.
+ */
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+	long new;
+
+	if (count & RWSEM_LOCK_MASK)
+		return false;
+
+	new = count + RWSEM_WRITER_LOCKED -
+	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+
+	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
+		rwsem_set_owner(sem);
+		return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire write lock before the writer has been put on wait queue.
+ */
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+
+	while (!(count & RWSEM_LOCK_MASK)) {
+		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
+					count + RWSEM_WRITER_LOCKED)) {
+			rwsem_set_owner(sem);
+			lockevent_inc(rwsem_opt_wlock);
+			return true;
+		}
+	}
+	return false;
+}
+
+static inline bool owner_on_cpu(struct task_struct *owner)
+{
+	/*
+	 * As lock holder preemption issue, we both skip spinning if
+	 * task is not on cpu or its cpu is preempted
+	 */
+	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+}
+
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	struct task_struct *owner;
+	bool ret = true;
+
+	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
+
+	if (need_resched())
+		return false;
+
+	rcu_read_lock();
+	owner = READ_ONCE(sem->owner);
+	if (owner) {
+		ret = is_rwsem_owner_spinnable(owner) &&
+		      owner_on_cpu(owner);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Return true only if we can still spin on the owner field of the rwsem.
+ */
+static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+	struct task_struct *owner = READ_ONCE(sem->owner);
+
+	if (!is_rwsem_owner_spinnable(owner))
+		return false;
+
+	rcu_read_lock();
+	while (owner && (READ_ONCE(sem->owner) == owner)) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking sem->owner still matches owner, if that fails,
+		 * owner might point to free()d memory, if it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		/*
+		 * abort spinning when need_resched or owner is not running or
+		 * owner's cpu is preempted.
+		 */
+		if (need_resched() || !owner_on_cpu(owner)) {
+			rcu_read_unlock();
+			return false;
+		}
+
+		cpu_relax();
+	}
+	rcu_read_unlock();
+
+	/*
+	 * If there is a new owner or the owner is not set, we continue
+	 * spinning.
+	 */
+	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+}
+
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	bool taken = false;
+
+	preempt_disable();
+
+	/* sem->wait_lock should not be held when doing optimistic spinning */
+	if (!rwsem_can_spin_on_owner(sem))
+		goto done;
+
+	if (!osq_lock(&sem->osq))
+		goto done;
+
+	/*
+	 * Optimistically spin on the owner field and attempt to acquire the
+	 * lock whenever the owner changes. Spinning will be stopped when:
+	 *  1) the owning writer isn't running; or
+	 *  2) readers own the lock as we can't determine if they are
+	 *     actively running or not.
+	 */
+	while (rwsem_spin_on_owner(sem)) {
+		/*
+		 * Try to acquire the lock
+		 */
+		if (rwsem_try_write_lock_unqueued(sem)) {
+			taken = true;
+			break;
+		}
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!sem->owner && (need_resched() || rt_task(current)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax();
+	}
+	osq_unlock(&sem->osq);
+done:
+	preempt_enable();
+	lockevent_cond_inc(rwsem_opt_fail, !taken);
+	return taken;
+}
+#else
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	return false;
+}
+#endif
+
+/*
+ * Wait for the read lock to be granted
+ */
+static inline struct rw_semaphore __sched *
+__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+{
+	long count, adjustment = -RWSEM_READER_BIAS;
+	struct rwsem_waiter waiter;
+	DEFINE_WAKE_Q(wake_q);
+
+	waiter.task = current;
+	waiter.type = RWSEM_WAITING_FOR_READ;
+
+	raw_spin_lock_irq(&sem->wait_lock);
+	if (list_empty(&sem->wait_list)) {
+		/*
+		 * In case the wait queue is empty and the lock isn't owned
+		 * by a writer, this reader can exit the slowpath and return
+		 * immediately as its RWSEM_READER_BIAS has already been
+		 * set in the count.
+		 */
+		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+			raw_spin_unlock_irq(&sem->wait_lock);
+			rwsem_set_reader_owned(sem);
+			lockevent_inc(rwsem_rlock_fast);
+			return sem;
+		}
+		adjustment += RWSEM_FLAG_WAITERS;
+	}
+	list_add_tail(&waiter.list, &sem->wait_list);
+
+	/* we're now waiting on the lock, but no longer actively locking */
+	count = atomic_long_add_return(adjustment, &sem->count);
+
+	/*
+	 * If there are no active locks, wake the front queued process(es).
+	 *
+	 * If there are no writers and we are first in the queue,
+	 * wake our own waiter to join the existing active readers !
+	 */
+	if (!(count & RWSEM_LOCK_MASK) ||
+	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+	raw_spin_unlock_irq(&sem->wait_lock);
+	wake_up_q(&wake_q);
+
+	/* wait to be given the lock */
+	while (true) {
+		set_current_state(state);
+		if (!waiter.task)
+			break;
+		if (signal_pending_state(state, current)) {
+			raw_spin_lock_irq(&sem->wait_lock);
+			if (waiter.task)
+				goto out_nolock;
+			raw_spin_unlock_irq(&sem->wait_lock);
+			break;
+		}
+		schedule();
+		lockevent_inc(rwsem_sleep_reader);
+	}
+
+	__set_current_state(TASK_RUNNING);
+	lockevent_inc(rwsem_rlock);
+	return sem;
+out_nolock:
+	list_del(&waiter.list);
+	if (list_empty(&sem->wait_list))
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	__set_current_state(TASK_RUNNING);
+	lockevent_inc(rwsem_rlock_fail);
+	return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed(struct rw_semaphore *sem)
+{
+	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed_killable(struct rw_semaphore *sem)
+{
+	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed_killable);
+
+/*
+ * Wait until we successfully acquire the write lock
+ */
+static inline struct rw_semaphore *
+__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+{
+	long count;
+	bool waiting = true; /* any queued threads before us */
+	struct rwsem_waiter waiter;
+	struct rw_semaphore *ret = sem;
+	DEFINE_WAKE_Q(wake_q);
+
+	/* do optimistic spinning and steal lock if possible */
+	if (rwsem_optimistic_spin(sem))
+		return sem;
+
+	/*
+	 * Optimistic spinning failed, proceed to the slowpath
+	 * and block until we can acquire the sem.
+	 */
+	waiter.task = current;
+	waiter.type = RWSEM_WAITING_FOR_WRITE;
+
+	raw_spin_lock_irq(&sem->wait_lock);
+
+	/* account for this before adding a new element to the list */
+	if (list_empty(&sem->wait_list))
+		waiting = false;
+
+	list_add_tail(&waiter.list, &sem->wait_list);
+
+	/* we're now waiting on the lock */
+	if (waiting) {
+		count = atomic_long_read(&sem->count);
+
+		/*
+		 * If there were already threads queued before us and there are
+		 * no active writers and some readers, the lock must be read
+		 * owned; so we try to  any read locks that were queued ahead
+		 * of us.
+		 */
+		if (!(count & RWSEM_WRITER_MASK) &&
+		     (count & RWSEM_READER_MASK)) {
+			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
+			/*
+			 * The wakeup is normally called _after_ the wait_lock
+			 * is released, but given that we are proactively waking
+			 * readers we can deal with the wake_q overhead as it is
+			 * similar to releasing and taking the wait_lock again
+			 * for attempting rwsem_try_write_lock().
+			 */
+			wake_up_q(&wake_q);
+
+			/*
+			 * Reinitialize wake_q after use.
+			 */
+			wake_q_init(&wake_q);
+		}
+
+	} else {
+		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+	}
+
+	/* wait until we successfully acquire the lock */
+	set_current_state(state);
+	while (true) {
+		if (rwsem_try_write_lock(count, sem))
+			break;
+		raw_spin_unlock_irq(&sem->wait_lock);
+
+		/* Block until there are no active lockers. */
+		do {
+			if (signal_pending_state(state, current))
+				goto out_nolock;
+
+			schedule();
+			lockevent_inc(rwsem_sleep_writer);
+			set_current_state(state);
+			count = atomic_long_read(&sem->count);
+		} while (count & RWSEM_LOCK_MASK);
+
+		raw_spin_lock_irq(&sem->wait_lock);
+	}
+	__set_current_state(TASK_RUNNING);
+	list_del(&waiter.list);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	lockevent_inc(rwsem_wlock);
+
+	return ret;
+
+out_nolock:
+	__set_current_state(TASK_RUNNING);
+	raw_spin_lock_irq(&sem->wait_lock);
+	list_del(&waiter.list);
+	if (list_empty(&sem->wait_list))
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+	else
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	wake_up_q(&wake_q);
+	lockevent_inc(rwsem_wlock_fail);
+
+	return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed(struct rw_semaphore *sem)
+{
+	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed_killable(struct rw_semaphore *sem)
+{
+	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed_killable);
+
+/*
+ * handle waking up a waiter on the semaphore
+ * - up_read/up_write has decremented the active part of count if we come here
+ */
+__visible
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+{
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+	if (!list_empty(&sem->wait_list))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+	wake_up_q(&wake_q);
+
+	return sem;
+}
+EXPORT_SYMBOL(rwsem_wake);
+
+/*
+ * downgrade a write lock into a read lock
+ * - caller incremented waiting part of count and discovered it still negative
+ * - just wake up any readers at the front of the queue
+ */
+__visible
+struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+{
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+	if (!list_empty(&sem->wait_list))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+
+	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+	wake_up_q(&wake_q);
+
+	return sem;
+}
+EXPORT_SYMBOL(rwsem_downgrade_wake);
+
+/*
+ * lock for reading
+ */
+inline void __down_read(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
+		rwsem_down_read_failed(sem);
+		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+					RWSEM_READER_OWNED), sem);
+	} else {
+		rwsem_set_reader_owned(sem);
+	}
+}
+
+static inline int __down_read_killable(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
+		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+			return -EINTR;
+		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+					RWSEM_READER_OWNED), sem);
+	} else {
+		rwsem_set_reader_owned(sem);
+	}
+	return 0;
+}
+
+static inline int __down_read_trylock(struct rw_semaphore *sem)
+{
+	/*
+	 * Optimize for the case when the rwsem is not locked at all.
+	 */
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	lockevent_inc(rwsem_rtrylock);
+	do {
+		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+					tmp + RWSEM_READER_BIAS)) {
+			rwsem_set_reader_owned(sem);
+			return 1;
+		}
+	} while (!(tmp & RWSEM_READ_FAILED_MASK));
+	return 0;
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
+		rwsem_down_write_failed(sem);
+	rwsem_set_owner(sem);
+}
+
+static inline int __down_write_killable(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
+		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+			return -EINTR;
+	rwsem_set_owner(sem);
+	return 0;
+}
+
+static inline int __down_write_trylock(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	lockevent_inc(rwsem_wtrylock);
+	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
+					  RWSEM_WRITER_LOCKED);
+	if (tmp == RWSEM_UNLOCKED_VALUE) {
+		rwsem_set_owner(sem);
+		return true;
+	}
+	return false;
+}
+
+/*
+ * unlock after reading
+ */
+inline void __up_read(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
+				sem);
+	rwsem_clear_reader_owned(sem);
+	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+			== RWSEM_FLAG_WAITERS))
+		rwsem_wake(sem);
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	rwsem_clear_owner(sem);
+	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+			&sem->count) & RWSEM_FLAG_WAITERS))
+		rwsem_wake(sem);
+}
+
+/*
+ * downgrade write lock to read lock
+ */
+static inline void __downgrade_write(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	/*
+	 * When downgrading from exclusive to shared ownership,
+	 * anything inside the write-locked region cannot leak
+	 * into the read side. In contrast, anything in the
+	 * read-locked region is ok to be re-ordered into the
+	 * write side. As such, rely on RELEASE semantics.
+	 */
+	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	tmp = atomic_long_fetch_add_release(
+		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
+	rwsem_set_reader_owned(sem);
+	if (tmp & RWSEM_FLAG_WAITERS)
+		rwsem_downgrade_wake(sem);
+}
 
 /*
  * lock for reading
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 499a9b2bda82..2534ce49f648 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -1,279 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-/*
- * The least significant 2 bits of the owner value has the following
- * meanings when set.
- *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- *    i.e. the owner(s) cannot be readily determined. It can be reader
- *    owned or the owning writer is indeterminate.
- *
- * When a writer acquires a rwsem, it puts its task_struct pointer
- * into the owner field. It is cleared after an unlock.
- *
- * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
- *
- * That information may be helpful in debugging cases where the system
- * seems to hang on a reader owned rwsem especially if only one reader
- * is involved. Ideally we would like to track all the readers that own
- * a rwsem, but the overhead is simply too big.
- */
-#include "lock_events.h"
 
-#define RWSEM_READER_OWNED	(1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED	(1UL << 1)
+#ifndef __INTERNAL_RWSEM_H
+#define __INTERNAL_RWSEM_H
+#include <linux/rwsem.h>
 
-#ifdef CONFIG_DEBUG_RWSEMS
-# define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
-	if (!debug_locks_silent &&				\
-	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
-		#c, atomic_long_read(&(sem)->count),		\
-		(long)((sem)->owner), (long)current,		\
-		list_empty(&(sem)->wait_list) ? "" : "not "))	\
-			debug_locks_off();			\
-	} while (0)
-#else
-# define DEBUG_RWSEMS_WARN_ON(c, sem)
-#endif
+extern void __down_read(struct rw_semaphore *sem);
+extern void __up_read(struct rw_semaphore *sem);
 
-/*
- * The definition of the atomic counter in the semaphore:
- *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bits 2-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
- *
- * atomic_long_fetch_add() is used to obtain reader lock, whereas
- * atomic_long_cmpxchg() will be used to obtain writer lock.
- */
-#define RWSEM_WRITER_LOCKED	(1UL << 0)
-#define RWSEM_FLAG_WAITERS	(1UL << 1)
-#define RWSEM_READER_SHIFT	8
-#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
-#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
-#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
-
-/*
- * All writes to owner are protected by WRITE_ONCE() to make sure that
- * store tearing can't happen as optimistic spinners may read and use
- * the owner value concurrently without lock. Read from owner, however,
- * may not need READ_ONCE() as long as the pointer value is only used
- * for comparison and isn't being dereferenced.
- */
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
-	WRITE_ONCE(sem->owner, current);
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
-	WRITE_ONCE(sem->owner, NULL);
-}
-
-/*
- * The task_struct pointer of the last owning reader will be left in
- * the owner field.
- *
- * Note that the owner value just indicates the task has owned the rwsem
- * previously, it may not be the real owner or one of the real owners
- * anymore when that field is examined, so take it with a grain of salt.
- */
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
-					    struct task_struct *owner)
-{
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-						 | RWSEM_ANONYMOUSLY_OWNED;
-
-	WRITE_ONCE(sem->owner, (struct task_struct *)val);
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
-	__rwsem_set_reader_owned(sem, current);
-}
-
-/*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
- * N.B. !owner is considered spinnable.
- */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
-{
-	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
-	return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
-}
-
-#ifdef CONFIG_DEBUG_RWSEMS
-/*
- * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
- * is a task pointer in owner of a reader-owned rwsem, it will be the
- * real owner or one of the real owners. The only exception is when the
- * unlock is done by up_read_non_owner().
- */
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
-	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
-						   | RWSEM_ANONYMOUSLY_OWNED;
-	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
-		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
-				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
-}
-#else
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_read_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_failed(sem);
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
-	} else {
-		rwsem_set_reader_owned(sem);
-	}
-}
-
-static inline int __down_read_killable(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
-			return -EINTR;
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
-	} else {
-		rwsem_set_reader_owned(sem);
-	}
-	return 0;
-}
-
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
-	/*
-	 * Optimize for the case when the rwsem is not locked at all.
-	 */
-	long tmp = RWSEM_UNLOCKED_VALUE;
-
-	lockevent_inc(rwsem_rtrylock);
-	do {
-		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
-					tmp + RWSEM_READER_BIAS)) {
-			rwsem_set_reader_owned(sem);
-			return 1;
-		}
-	} while (!(tmp & RWSEM_READ_FAILED_MASK));
-	return 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		rwsem_down_write_failed(sem);
-	rwsem_set_owner(sem);
-}
-
-static inline int __down_write_killable(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
-			return -EINTR;
-	rwsem_set_owner(sem);
-	return 0;
-}
-
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	lockevent_inc(rwsem_wtrylock);
-	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-					  RWSEM_WRITER_LOCKED);
-	if (tmp == RWSEM_UNLOCKED_VALUE) {
-		rwsem_set_owner(sem);
-		return true;
-	}
-	return false;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
-	rwsem_clear_reader_owned(sem);
-	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
-	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
-			== RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
-			&sem->count) & RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	/*
-	 * When downgrading from exclusive to shared ownership,
-	 * anything inside the write-locked region cannot leak
-	 * into the read side. In contrast, anything in the
-	 * read-locked region is ok to be re-ordered into the
-	 * write side. As such, rely on RELEASE semantics.
-	 */
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	tmp = atomic_long_fetch_add_release(
-		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
-	rwsem_set_reader_owned(sem);
-	if (tmp & RWSEM_FLAG_WAITERS)
-		rwsem_downgrade_wake(sem);
-}
+#endif /* __INTERNAL_RWSEM_H */

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Code cleanup after files merging
  2019-05-20 20:59 ` [PATCH v8 05/19] locking/rwsem: Code cleanup after files merging Waiman Long
@ 2019-06-17 14:26   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, longman, tim.c.chen, linux-kernel, peterz, torvalds, mingo,
	tglx, hpa, will.deacon, huang.ying.caritas, dave

Commit-ID:  6cef7ff6e43cbdb9fa8eb91eb9a6b25d45ae11e3
Gitweb:     https://git.kernel.org/tip/6cef7ff6e43cbdb9fa8eb91eb9a6b25d45ae11e3
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:04 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:27:58 +0200

locking/rwsem: Code cleanup after files merging

After merging all the relevant rwsem code into one single file, there
are a number of optimizations and cleanups that can be done:

 1) Remove all the EXPORT_SYMBOL() calls for functions that are not
    accessed elsewhere.
 2) Remove all the __visible tags as none of the functions will be
    called from assembly code anymore.
 3) Make all the internal functions static.
 4) Remove some unneeded blank lines.
 5) Remove the intermediate rwsem_down_{read|write}_failed*() functions
    and rename __rwsem_down_{read|write}_failed_common() to
    rwsem_down_{read|write}_slowpath().
 6) Remove "__" prefix of __rwsem_mark_wake().
 7) Use atomic_long_try_cmpxchg_acquire() as much as possible.
 8) Remove the rwsem_rtrylock and rwsem_wtrylock lock events as they
    are not that useful.

That enables the compiler to do better optimization and reduce code
size. The text+data size of rwsem.o on an x86-64 machine with gcc8 was
reduced from 10237 bytes to 5030 bytes with this change.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-6-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/lock_events_list.h |   2 -
 kernel/locking/rwsem.c            | 135 ++++++++++++--------------------------
 2 files changed, 42 insertions(+), 95 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index ad7668cfc9da..11187a1d40b8 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,7 +61,5 @@ LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
-LOCK_EVENT(rwsem_rtrylock)	/* # of read trylock calls		*/
 LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
 LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
-LOCK_EVENT(rwsem_wtrylock)	/* # of write trylock calls		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8317bcdf063b..f56329240ef1 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -205,7 +205,6 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	osq_lock_init(&sem->osq);
 #endif
 }
-
 EXPORT_SYMBOL(__init_rwsem);
 
 enum rwsem_waiter_type {
@@ -237,9 +236,9 @@ enum rwsem_wake_type {
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only marked woken if downgrading is false
  */
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
-			      enum rwsem_wake_type wake_type,
-			      struct wake_q_head *wake_q)
+static void rwsem_mark_wake(struct rw_semaphore *sem,
+			    enum rwsem_wake_type wake_type,
+			    struct wake_q_head *wake_q)
 {
 	struct rwsem_waiter *waiter, *tmp;
 	long oldcount, woken = 0, adjustment = 0;
@@ -330,7 +329,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 
 		/*
 		 * Ensure calling get_task_struct() before setting the reader
-		 * waiter to nil such that rwsem_down_read_failed() cannot
+		 * waiter to nil such that rwsem_down_read_slowpath() cannot
 		 * race with do_exit() by always holding a reference count
 		 * to the task to wakeup.
 		 */
@@ -516,8 +515,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 /*
  * Wait for the read lock to be granted
  */
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore __sched *
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count, adjustment = -RWSEM_READER_BIAS;
 	struct rwsem_waiter waiter;
@@ -555,7 +554,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 	 */
 	if (!(count & RWSEM_LOCK_MASK) ||
 	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
 	wake_up_q(&wake_q);
@@ -589,25 +588,11 @@ out_nolock:
 	return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
 /*
  * Wait until we successfully acquire the write lock
  */
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore *
+rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count;
 	bool waiting = true; /* any queued threads before us */
@@ -646,7 +631,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 		 */
 		if (!(count & RWSEM_WRITER_MASK) &&
 		     (count & RWSEM_READER_MASK)) {
-			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
+			rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
 			/*
 			 * The wakeup is normally called _after_ the wait_lock
 			 * is released, but given that we are proactively waking
@@ -700,7 +685,7 @@ out_nolock:
 	if (list_empty(&sem->wait_list))
 		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	else
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	wake_up_q(&wake_q);
 	lockevent_inc(rwsem_wlock_fail);
@@ -708,26 +693,11 @@ out_nolock:
 	return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
 /*
  * handle waking up a waiter on the semaphore
  * - up_read/up_write has decremented the active part of count if we come here
  */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -735,22 +705,20 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 	wake_up_q(&wake_q);
 
 	return sem;
 }
-EXPORT_SYMBOL(rwsem_wake);
 
 /*
  * downgrade a write lock into a read lock
  * - caller incremented waiting part of count and discovered it still negative
  * - just wake up any readers at the front of the queue
  */
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -758,14 +726,13 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+		rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 	wake_up_q(&wake_q);
 
 	return sem;
 }
-EXPORT_SYMBOL(rwsem_downgrade_wake);
 
 /*
  * lock for reading
@@ -774,7 +741,7 @@ inline void __down_read(struct rw_semaphore *sem)
 {
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_failed(sem);
+		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
 	} else {
@@ -786,7 +753,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
 {
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
@@ -803,7 +770,6 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
 	 */
 	long tmp = RWSEM_UNLOCKED_VALUE;
 
-	lockevent_inc(rwsem_rtrylock);
 	do {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
 					tmp + RWSEM_READER_BIAS)) {
@@ -819,30 +785,33 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
  */
 static inline void __down_write(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		rwsem_down_write_failed(sem);
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+						      RWSEM_WRITER_LOCKED)))
+		rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
 	rwsem_set_owner(sem);
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+						      RWSEM_WRITER_LOCKED))) {
+		if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
+	}
 	rwsem_set_owner(sem);
 	return 0;
 }
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-	long tmp;
+	long tmp = RWSEM_UNLOCKED_VALUE;
 
-	lockevent_inc(rwsem_wtrylock);
-	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-					  RWSEM_WRITER_LOCKED);
-	if (tmp == RWSEM_UNLOCKED_VALUE) {
+	if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+					    RWSEM_WRITER_LOCKED)) {
 		rwsem_set_owner(sem);
 		return true;
 	}
@@ -856,12 +825,11 @@ inline void __up_read(struct rw_semaphore *sem)
 {
 	long tmp;
 
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
+	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
-	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
-			== RWSEM_FLAG_WAITERS))
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
+		      RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -870,10 +838,12 @@ inline void __up_read(struct rw_semaphore *sem)
  */
 static inline void __up_write(struct rw_semaphore *sem)
 {
+	long tmp;
+
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
 	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
-			&sem->count) & RWSEM_FLAG_WAITERS))
+	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -909,7 +879,6 @@ void __sched down_read(struct rw_semaphore *sem)
 
 	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
 }
-
 EXPORT_SYMBOL(down_read);
 
 int __sched down_read_killable(struct rw_semaphore *sem)
@@ -924,7 +893,6 @@ int __sched down_read_killable(struct rw_semaphore *sem)
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_read_killable);
 
 /*
@@ -938,7 +906,6 @@ int down_read_trylock(struct rw_semaphore *sem)
 		rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_);
 	return ret;
 }
-
 EXPORT_SYMBOL(down_read_trylock);
 
 /*
@@ -948,10 +915,8 @@ void __sched down_write(struct rw_semaphore *sem)
 {
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(down_write);
 
 /*
@@ -962,14 +927,14 @@ int __sched down_write_killable(struct rw_semaphore *sem)
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
-	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+				  __down_write_killable)) {
 		rwsem_release(&sem->dep_map, 1, _RET_IP_);
 		return -EINTR;
 	}
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_write_killable);
 
 /*
@@ -984,7 +949,6 @@ int down_write_trylock(struct rw_semaphore *sem)
 
 	return ret;
 }
-
 EXPORT_SYMBOL(down_write_trylock);
 
 /*
@@ -993,10 +957,8 @@ EXPORT_SYMBOL(down_write_trylock);
 void up_read(struct rw_semaphore *sem)
 {
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
 	__up_read(sem);
 }
-
 EXPORT_SYMBOL(up_read);
 
 /*
@@ -1005,10 +967,8 @@ EXPORT_SYMBOL(up_read);
 void up_write(struct rw_semaphore *sem)
 {
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
 	__up_write(sem);
 }
-
 EXPORT_SYMBOL(up_write);
 
 /*
@@ -1017,10 +977,8 @@ EXPORT_SYMBOL(up_write);
 void downgrade_write(struct rw_semaphore *sem)
 {
 	lock_downgrade(&sem->dep_map, _RET_IP_);
-
 	__downgrade_write(sem);
 }
-
 EXPORT_SYMBOL(downgrade_write);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -1029,40 +987,32 @@ void down_read_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
 	rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
 }
-
 EXPORT_SYMBOL(down_read_nested);
 
 void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
 {
 	might_sleep();
 	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(_down_write_nest_lock);
 
 void down_read_non_owner(struct rw_semaphore *sem)
 {
 	might_sleep();
-
 	__down_read(sem);
 	__rwsem_set_reader_owned(sem, NULL);
 }
-
 EXPORT_SYMBOL(down_read_non_owner);
 
 void down_write_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(down_write_nested);
 
 int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
@@ -1070,14 +1020,14 @@ int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
-	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+				  __down_write_killable)) {
 		rwsem_release(&sem->dep_map, 1, _RET_IP_);
 		return -EINTR;
 	}
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_write_killable_nested);
 
 void up_read_non_owner(struct rw_semaphore *sem)
@@ -1086,7 +1036,6 @@ void up_read_non_owner(struct rw_semaphore *sem)
 				sem);
 	__up_read(sem);
 }
-
 EXPORT_SYMBOL(up_read_non_owner);
 
 #endif

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-05-20 20:59 ` [PATCH v8 06/19] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
@ 2019-06-17 14:27   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dave, hpa, linux-kernel, bp, mingo, torvalds, tim.c.chen,
	will.deacon, tglx, huang.ying.caritas, longman, peterz

Commit-ID:  3f6d517a3ece6e6ced7abcbe798ff332ac5ca586
Gitweb:     https://git.kernel.org/tip/3f6d517a3ece6e6ced7abcbe798ff332ac5ca586
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:05 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:27:59 +0200

locking/rwsem: Make rwsem_spin_on_owner() return owner state

This patch modifies rwsem_spin_on_owner() to return four possible
values to better reflect the state of lock holder which enables us to
make a better decision of what to do next.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-7-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/rwsem.c | 65 ++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 47 insertions(+), 18 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index f56329240ef1..8d0f2acfe13d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -414,17 +414,54 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 }
 
 /*
- * Return true only if we can still spin on the owner field of the rwsem.
+ * The rwsem_spin_on_owner() function returns the folowing 4 values
+ * depending on the lock owner state.
+ *   OWNER_NULL  : owner is currently NULL
+ *   OWNER_WRITER: when owner changes and is a writer
+ *   OWNER_READER: when owner changes and the new owner may be a reader.
+ *   OWNER_NONSPINNABLE:
+ *		   when optimistic spinning has to stop because either the
+ *		   owner stops running, is unknown, or its timeslice has
+ *		   been used up.
  */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+enum owner_state {
+	OWNER_NULL		= 1 << 0,
+	OWNER_WRITER		= 1 << 1,
+	OWNER_READER		= 1 << 2,
+	OWNER_NONSPINNABLE	= 1 << 3,
+};
+#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
+
+static inline enum owner_state rwsem_owner_state(unsigned long owner)
 {
-	struct task_struct *owner = READ_ONCE(sem->owner);
+	if (!owner)
+		return OWNER_NULL;
 
-	if (!is_rwsem_owner_spinnable(owner))
-		return false;
+	if (owner & RWSEM_ANONYMOUSLY_OWNED)
+		return OWNER_NONSPINNABLE;
+
+	if (owner & RWSEM_READER_OWNED)
+		return OWNER_READER;
+
+	return OWNER_WRITER;
+}
+
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
+	enum owner_state state = rwsem_owner_state((unsigned long)owner);
+
+	if (state != OWNER_WRITER)
+		return state;
 
 	rcu_read_lock();
-	while (owner && (READ_ONCE(sem->owner) == owner)) {
+	for (;;) {
+		tmp = READ_ONCE(sem->owner);
+		if (tmp != owner) {
+			state = rwsem_owner_state((unsigned long)tmp);
+			break;
+		}
+
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
 		 * checking sem->owner still matches owner, if that fails,
@@ -433,24 +470,16 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
 		 */
 		barrier();
 
-		/*
-		 * abort spinning when need_resched or owner is not running or
-		 * owner's cpu is preempted.
-		 */
 		if (need_resched() || !owner_on_cpu(owner)) {
-			rcu_read_unlock();
-			return false;
+			state = OWNER_NONSPINNABLE;
+			break;
 		}
 
 		cpu_relax();
 	}
 	rcu_read_unlock();
 
-	/*
-	 * If there is a new owner or the owner is not set, we continue
-	 * spinning.
-	 */
-	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+	return state;
 }
 
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
@@ -473,7 +502,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	 *  2) readers own the lock as we can't determine if they are
 	 *     actively running or not.
 	 */
-	while (rwsem_spin_on_owner(sem)) {
+	while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
 		/*
 		 * Try to acquire the lock
 		 */

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-05-20 20:59 ` [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
  2019-06-04  3:03   ` Yuyang Du
@ 2019-06-17 14:27   ` tip-bot for Waiman Long
  1 sibling, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: will.deacon, dave, tim.c.chen, hpa, torvalds, bp, peterz, mingo,
	huang.ying.caritas, longman, linux-kernel, tglx

Commit-ID:  4f23dbc1e657951e5d94c60369bc1db065961fb3
Gitweb:     https://git.kernel.org/tip/4f23dbc1e657951e5d94c60369bc1db065961fb3
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:06 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:27:59 +0200

locking/rwsem: Implement lock handoff to prevent lock starvation

Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.

This patch implements a lock handoff mechanism to disable lock stealing
and force lock handoff to the first waiter or waiters (for readers)
in the queue after at least a 4ms waiting period unless it is a RT
writer task which doesn't need to wait. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.

The setting and clearing of the handoff bit is serialized by the
wait_lock. So racing is not possible.

A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
80-thread Skylake system with a v5.1 based kernel and 240 write_lock
threads with 5us sleep critical section.

Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,792/173,696. After the patch, the figures
became 5,842/6,542/7,458.  It can be seen that the rwsem became much
more fair, though there was a drop of about 16% in the mean locking
operations done which was a tradeoff of having better fairness.

Making the waiter set the handoff bit right after the first wakeup can
impact performance especially with a mixed reader/writer workload. With
the same microbenchmark with short critical section and equal number of
reader and writer threads (40/40), the reader/writer locking operation
counts with the current patch were:

  40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
  40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081

By making waiter set handoff bit immediately after wakeup:

  40 readers, Iterations Min/Mean/Max = 43/44/46
  40 writers, Iterations Min/Mean/Max = 43/1,263/3,191

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-8-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/lock_events_list.h |   2 +
 kernel/locking/rwsem.c            | 225 +++++++++++++++++++++++++++++---------
 2 files changed, 173 insertions(+), 54 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 11187a1d40b8..634b47fd8b5e 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,5 +61,7 @@ LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
+LOCK_EVENT(rwsem_rlock_handoff)	/* # of read lock handoffs		*/
 LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
 LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
+LOCK_EVENT(rwsem_wlock_handoff)	/* # of write lock handoffs		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8d0f2acfe13d..decda9fb8c6d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -10,8 +10,9 @@
  * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
  * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
  *
- * Rwsem count bit fields re-definition and rwsem rearchitecture
- * by Waiman Long <longman@redhat.com>.
+ * Rwsem count bit fields re-definition and rwsem rearchitecture by
+ * Waiman Long <longman@redhat.com> and
+ * Peter Zijlstra <peterz@infradead.org>.
  */
 
 #include <linux/types.h>
@@ -74,20 +75,33 @@
  *
  * Bit  0   - writer locked bit
  * Bit  1   - waiters present bit
- * Bits 2-7 - reserved
+ * Bit  2   - lock handoff bit
+ * Bits 3-7 - reserved
  * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
+ *
+ * There are three places where the lock handoff bit may be set or cleared.
+ * 1) rwsem_mark_wake() for readers.
+ * 2) rwsem_try_write_lock() for writers.
+ * 3) Error path of rwsem_down_write_slowpath().
+ *
+ * For all the above cases, wait_lock will be held. A writer must also
+ * be the first one in the wait_list to be eligible for setting the handoff
+ * bit. So concurrent setting/clearing of handoff bit is not possible.
  */
 #define RWSEM_WRITER_LOCKED	(1UL << 0)
 #define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_FLAG_HANDOFF	(1UL << 2)
+
 #define RWSEM_READER_SHIFT	8
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
 #define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
 #define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
+				 RWSEM_FLAG_HANDOFF)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -216,7 +230,10 @@ struct rwsem_waiter {
 	struct list_head list;
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
+	unsigned long timeout;
 };
+#define rwsem_first_waiter(sem) \
+	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
 
 enum rwsem_wake_type {
 	RWSEM_WAKE_ANY,		/* Wake whatever's at head of wait list */
@@ -224,6 +241,19 @@ enum rwsem_wake_type {
 	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
 };
 
+enum writer_wait_state {
+	WRITER_NOT_FIRST,	/* Writer is not first in wait list */
+	WRITER_FIRST,		/* Writer is first in wait list     */
+	WRITER_HANDOFF		/* Writer is first & handoff needed */
+};
+
+/*
+ * The typical HZ value is either 250 or 1000. So set the minimum waiting
+ * time to at least 4ms or 1 jiffy (if it is higher than 4ms) in the wait
+ * queue before initiating the handoff protocol.
+ */
+#define RWSEM_WAIT_TIMEOUT	DIV_ROUND_UP(HZ, 250)
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -244,11 +274,13 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 	long oldcount, woken = 0, adjustment = 0;
 	struct list_head wlist;
 
+	lockdep_assert_held(&sem->wait_lock);
+
 	/*
 	 * Take a peek at the queue head waiter such that we can determine
 	 * the wakeup(s) to perform.
 	 */
-	waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
+	waiter = rwsem_first_waiter(sem);
 
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
 		if (wake_type == RWSEM_WAKE_ANY) {
@@ -275,7 +307,18 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
 		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
-			atomic_long_sub(adjustment, &sem->count);
+			/*
+			 * When we've been waiting "too" long (for writers
+			 * to give up the lock), request a HANDOFF to
+			 * force the issue.
+			 */
+			if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
+			    time_after(jiffies, waiter->timeout)) {
+				adjustment -= RWSEM_FLAG_HANDOFF;
+				lockevent_inc(rwsem_rlock_handoff);
+			}
+
+			atomic_long_add(-adjustment, &sem->count);
 			return;
 		}
 		/*
@@ -317,6 +360,13 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		adjustment -= RWSEM_FLAG_WAITERS;
 	}
 
+	/*
+	 * When we've woken a reader, we no longer need to force writers
+	 * to give up the lock and we can clear HANDOFF.
+	 */
+	if (woken && (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF))
+		adjustment -= RWSEM_FLAG_HANDOFF;
+
 	if (adjustment)
 		atomic_long_add(adjustment, &sem->count);
 
@@ -346,23 +396,48 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
  * This function must be called with the sem->wait_lock held to prevent
  * race conditions between checking the rwsem wait list and setting the
  * sem->count accordingly.
+ *
+ * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
+ * bit is set or the lock is acquired with handoff bit cleared.
  */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+					enum writer_wait_state wstate)
 {
 	long new;
 
-	if (count & RWSEM_LOCK_MASK)
-		return false;
+	lockdep_assert_held(&sem->wait_lock);
 
-	new = count + RWSEM_WRITER_LOCKED -
-	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+	do {
+		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
 
-	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
-		rwsem_set_owner(sem);
-		return true;
-	}
+		if (has_handoff && wstate == WRITER_NOT_FIRST)
+			return false;
 
-	return false;
+		new = count;
+
+		if (count & RWSEM_LOCK_MASK) {
+			if (has_handoff || (wstate != WRITER_HANDOFF))
+				return false;
+
+			new |= RWSEM_FLAG_HANDOFF;
+		} else {
+			new |= RWSEM_WRITER_LOCKED;
+			new &= ~RWSEM_FLAG_HANDOFF;
+
+			if (list_is_singular(&sem->wait_list))
+				new &= ~RWSEM_FLAG_WAITERS;
+		}
+	} while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
+
+	/*
+	 * We have either acquired the lock with handoff bit cleared or
+	 * set the handoff bit.
+	 */
+	if (new & RWSEM_FLAG_HANDOFF)
+		return false;
+
+	rwsem_set_owner(sem);
+	return true;
 }
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
@@ -373,9 +448,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 {
 	long count = atomic_long_read(&sem->count);
 
-	while (!(count & RWSEM_LOCK_MASK)) {
+	while (!(count & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_WRITER_LOCKED)) {
+					count | RWSEM_WRITER_LOCKED)) {
 			rwsem_set_owner(sem);
 			lockevent_inc(rwsem_opt_wlock);
 			return true;
@@ -456,6 +531,11 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 
 	rcu_read_lock();
 	for (;;) {
+		if (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF) {
+			state = OWNER_NONSPINNABLE;
+			break;
+		}
+
 		tmp = READ_ONCE(sem->owner);
 		if (tmp != owner) {
 			state = rwsem_owner_state((unsigned long)tmp);
@@ -553,16 +633,18 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_READ;
+	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list)) {
 		/*
 		 * In case the wait queue is empty and the lock isn't owned
-		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_READER_BIAS has already been
-		 * set in the count.
+		 * by a writer or has the handoff bit set, this reader can
+		 * exit the slowpath and return immediately as its
+		 * RWSEM_READER_BIAS has already been set in the count.
 		 */
-		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+		if (!(atomic_long_read(&sem->count) &
+		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
@@ -609,8 +691,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	return sem;
 out_nolock:
 	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+	if (list_empty(&sem->wait_list)) {
+		atomic_long_andnot(RWSEM_FLAG_WAITERS|RWSEM_FLAG_HANDOFF,
+				   &sem->count);
+	}
 	raw_spin_unlock_irq(&sem->wait_lock);
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock_fail);
@@ -624,7 +708,7 @@ static struct rw_semaphore *
 rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count;
-	bool waiting = true; /* any queued threads before us */
+	enum writer_wait_state wstate;
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
@@ -639,66 +723,95 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	 */
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_WRITE;
+	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 
 	/* account for this before adding a new element to the list */
-	if (list_empty(&sem->wait_list))
-		waiting = false;
+	wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
 
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock */
-	if (waiting) {
+	if (wstate == WRITER_NOT_FIRST) {
 		count = atomic_long_read(&sem->count);
 
 		/*
-		 * If there were already threads queued before us and there are
-		 * no active writers and some readers, the lock must be read
-		 * owned; so we try to  any read locks that were queued ahead
-		 * of us.
+		 * If there were already threads queued before us and:
+		 *  1) there are no no active locks, wake the front
+		 *     queued process(es) as the handoff bit might be set.
+		 *  2) there are no active writers and some readers, the lock
+		 *     must be read owned; so we try to wake any read lock
+		 *     waiters that were queued ahead of us.
 		 */
-		if (!(count & RWSEM_WRITER_MASK) &&
-		     (count & RWSEM_READER_MASK)) {
-			rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
-			/*
-			 * The wakeup is normally called _after_ the wait_lock
-			 * is released, but given that we are proactively waking
-			 * readers we can deal with the wake_q overhead as it is
-			 * similar to releasing and taking the wait_lock again
-			 * for attempting rwsem_try_write_lock().
-			 */
-			wake_up_q(&wake_q);
+		if (count & RWSEM_WRITER_MASK)
+			goto wait;
 
-			/*
-			 * Reinitialize wake_q after use.
-			 */
-			wake_q_init(&wake_q);
-		}
+		rwsem_mark_wake(sem, (count & RWSEM_READER_MASK)
+					? RWSEM_WAKE_READERS
+					: RWSEM_WAKE_ANY, &wake_q);
 
+		/*
+		 * The wakeup is normally called _after_ the wait_lock
+		 * is released, but given that we are proactively waking
+		 * readers we can deal with the wake_q overhead as it is
+		 * similar to releasing and taking the wait_lock again
+		 * for attempting rwsem_try_write_lock().
+		 */
+		wake_up_q(&wake_q);
+
+		/* We need wake_q again below, reinitialize */
+		wake_q_init(&wake_q);
 	} else {
 		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
 	}
 
+wait:
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	while (true) {
-		if (rwsem_try_write_lock(count, sem))
+		if (rwsem_try_write_lock(count, sem, wstate))
 			break;
+
 		raw_spin_unlock_irq(&sem->wait_lock);
 
 		/* Block until there are no active lockers. */
-		do {
+		for (;;) {
 			if (signal_pending_state(state, current))
 				goto out_nolock;
 
 			schedule();
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
+			/*
+			 * If HANDOFF bit is set, unconditionally do
+			 * a trylock.
+			 */
+			if (wstate == WRITER_HANDOFF)
+				break;
+
+			if ((wstate == WRITER_NOT_FIRST) &&
+			    (rwsem_first_waiter(sem) == &waiter))
+				wstate = WRITER_FIRST;
+
 			count = atomic_long_read(&sem->count);
-		} while (count & RWSEM_LOCK_MASK);
+			if (!(count & RWSEM_LOCK_MASK))
+				break;
+
+			/*
+			 * The setting of the handoff bit is deferred
+			 * until rwsem_try_write_lock() is called.
+			 */
+			if ((wstate == WRITER_FIRST) && (rt_task(current) ||
+			    time_after(jiffies, waiter.timeout))) {
+				wstate = WRITER_HANDOFF;
+				lockevent_inc(rwsem_wlock_handoff);
+				break;
+			}
+		}
 
 		raw_spin_lock_irq(&sem->wait_lock);
+		count = atomic_long_read(&sem->count);
 	}
 	__set_current_state(TASK_RUNNING);
 	list_del(&waiter.list);
@@ -711,6 +824,10 @@ out_nolock:
 	__set_current_state(TASK_RUNNING);
 	raw_spin_lock_irq(&sem->wait_lock);
 	list_del(&waiter.list);
+
+	if (unlikely(wstate == WRITER_HANDOFF))
+		atomic_long_add(-RWSEM_FLAG_HANDOFF,  &sem->count);
+
 	if (list_empty(&sem->wait_list))
 		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	else
@@ -726,7 +843,7 @@ out_nolock:
  * handle waking up a waiter on the semaphore
  * - up_read/up_write has decremented the active part of count if we come here
  */
-static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -859,7 +976,7 @@ inline void __up_read(struct rw_semaphore *sem)
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
 		      RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
+		rwsem_wake(sem, tmp);
 }
 
 /*
@@ -873,7 +990,7 @@ static inline void __up_write(struct rw_semaphore *sem)
 	rwsem_clear_owner(sem);
 	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
+		rwsem_wake(sem, tmp);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Always release wait_lock before waking up tasks
  2019-05-20 20:59 ` [PATCH v8 08/19] locking/rwsem: Always release wait_lock before waking up tasks Waiman Long
@ 2019-06-17 14:28   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, huang.ying.caritas, longman, peterz, will.deacon,
	mingo, hpa, tglx, tim.c.chen, dave, bp, torvalds

Commit-ID:  00f3c5a3df2c1e3dab14d0dd2b71f852d46be97f
Gitweb:     https://git.kernel.org/tip/00f3c5a3df2c1e3dab14d0dd2b71f852d46be97f
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:07 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:28:00 +0200

locking/rwsem: Always release wait_lock before waking up tasks

With the use of wake_q, we can do task wakeups without holding the
wait_lock. There is one exception in the rwsem code, though. It is
when the writer in the slowpath detects that there are waiters ahead
but the rwsem is not held by a writer. This can lead to a long wait_lock
hold time especially when a large number of readers are to be woken up.

Remediate this situation by releasing the wait_lock before waking
up tasks and re-acquiring it afterward. The rwsem_try_write_lock()
function is also modified to read the rwsem count directly to avoid
stale count value.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-9-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched/wake_q.h |  5 +++++
 kernel/locking/rwsem.c       | 31 +++++++++++++++----------------
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index ad826d2a4557..26a2013ac39c 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,6 +51,11 @@ static inline void wake_q_init(struct wake_q_head *head)
 	head->lastp = &head->first;
 }
 
+static inline bool wake_q_empty(struct wake_q_head *head)
+{
+	return head->first == WAKE_Q_TAIL;
+}
+
 extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
 extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
 extern void wake_up_q(struct wake_q_head *head);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index decda9fb8c6d..5532304406f7 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -400,13 +400,14 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
  * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
  * bit is set or the lock is acquired with handoff bit cleared.
  */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 					enum writer_wait_state wstate)
 {
-	long new;
+	long count, new;
 
 	lockdep_assert_held(&sem->wait_lock);
 
+	count = atomic_long_read(&sem->count);
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
 
@@ -751,26 +752,25 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 					? RWSEM_WAKE_READERS
 					: RWSEM_WAKE_ANY, &wake_q);
 
-		/*
-		 * The wakeup is normally called _after_ the wait_lock
-		 * is released, but given that we are proactively waking
-		 * readers we can deal with the wake_q overhead as it is
-		 * similar to releasing and taking the wait_lock again
-		 * for attempting rwsem_try_write_lock().
-		 */
-		wake_up_q(&wake_q);
-
-		/* We need wake_q again below, reinitialize */
-		wake_q_init(&wake_q);
+		if (!wake_q_empty(&wake_q)) {
+			/*
+			 * We want to minimize wait_lock hold time especially
+			 * when a large number of readers are to be woken up.
+			 */
+			raw_spin_unlock_irq(&sem->wait_lock);
+			wake_up_q(&wake_q);
+			wake_q_init(&wake_q);	/* Used again, reinit */
+			raw_spin_lock_irq(&sem->wait_lock);
+		}
 	} else {
-		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+		atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count);
 	}
 
 wait:
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	while (true) {
-		if (rwsem_try_write_lock(count, sem, wstate))
+		if (rwsem_try_write_lock(sem, wstate))
 			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
@@ -811,7 +811,6 @@ wait:
 		}
 
 		raw_spin_lock_irq(&sem->wait_lock);
-		count = atomic_long_read(&sem->count);
 	}
 	__set_current_state(TASK_RUNNING);
 	list_del(&waiter.list);

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: More optimal RT task handling of null owner
  2019-05-20 20:59 ` [PATCH v8 09/19] locking/rwsem: More optimal RT task handling of null owner Waiman Long
@ 2019-06-17 14:29   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, longman, huang.ying.caritas, bp, tim.c.chen, hpa,
	linux-kernel, torvalds, will.deacon, dave, tglx, mingo

Commit-ID:  990fa7384a3057a3298bcf493651c6e14416c47c
Gitweb:     https://git.kernel.org/tip/990fa7384a3057a3298bcf493651c6e14416c47c
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:08 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:28:01 +0200

locking/rwsem: More optimal RT task handling of null owner

An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.

When the owner is temporarily set to NULL, there are two cases
where we may want to continue spinning:

 1) The lock owner is in the process of releasing the lock, sem->owner
    is cleared but the lock has not been released yet.

 2) The lock was free and owner cleared, but another task just comes
    in and acquire the lock before we try to get it. The new owner may
    be a spinnable writer.

So an RT task is now made to retry one more time to see if it can
acquire the lock or continue spinning on the new owning writer.

When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.

    Locking threads     Pre-patch     Post-patch
    ---------------     ---------     -----------
            4             2,753          2,608
            8             2,529          2,520
           16             1,727          1,918
           32             1,263          1,956
           64               889          1,343

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-10-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/rwsem.c | 51 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 5532304406f7..e1840b7c5310 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -566,6 +566,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
 	bool taken = false;
+	int prev_owner_state = OWNER_NULL;
 
 	preempt_disable();
 
@@ -583,7 +584,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	 *  2) readers own the lock as we can't determine if they are
 	 *     actively running or not.
 	 */
-	while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
+	for (;;) {
+		enum owner_state owner_state = rwsem_spin_on_owner(sem);
+
+		if (!(owner_state & OWNER_SPINNABLE))
+			break;
+
 		/*
 		 * Try to acquire the lock
 		 */
@@ -593,13 +599,44 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		}
 
 		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
+		 * An RT task cannot do optimistic spinning if it cannot
+		 * be sure the lock holder is running or live-lock may
+		 * happen if the current task and the lock holder happen
+		 * to run in the same CPU. However, aborting optimistic
+		 * spinning while a NULL owner is detected may miss some
+		 * opportunity where spinning can continue without causing
+		 * problem.
+		 *
+		 * There are 2 possible cases where an RT task may be able
+		 * to continue spinning.
+		 *
+		 * 1) The lock owner is in the process of releasing the
+		 *    lock, sem->owner is cleared but the lock has not
+		 *    been released yet.
+		 * 2) The lock was free and owner cleared, but another
+		 *    task just comes in and acquire the lock before
+		 *    we try to get it. The new owner may be a spinnable
+		 *    writer.
+		 *
+		 * To take advantage of two scenarios listed agove, the RT
+		 * task is made to retry one more time to see if it can
+		 * acquire the lock or continue spinning on the new owning
+		 * writer. Of course, if the time lag is long enough or the
+		 * new owner is not a writer or spinnable, the RT task will
+		 * quit spinning.
+		 *
+		 * If the owner is a writer, the need_resched() check is
+		 * done inside rwsem_spin_on_owner(). If the owner is not
+		 * a writer, need_resched() check needs to be done here.
 		 */
-		if (!sem->owner && (need_resched() || rt_task(current)))
-			break;
+		if (owner_state != OWNER_WRITER) {
+			if (need_resched())
+				break;
+			if (rt_task(current) &&
+			   (prev_owner_state != OWNER_WRITER))
+				break;
+		}
+		prev_owner_state = owner_state;
 
 		/*
 		 * The cpu_relax() call is a compiler barrier which forces

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Wake up almost all readers in wait queue
  2019-05-20 20:59 ` [PATCH v8 10/19] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
@ 2019-06-17 14:29   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, will.deacon, dave, bp, tglx, peterz, longman,
	huang.ying.caritas, mingo, torvalds, tim.c.chen

Commit-ID:  d3681e269fff84048c94012342c3434b227c4706
Gitweb:     https://git.kernel.org/tip/d3681e269fff84048c94012342c3434b227c4706
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:09 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:28:02 +0200

locking/rwsem: Wake up almost all readers in wait queue

When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.

Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.

Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput. This is
somewhat similar in concept to a phase-fair R/W lock.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-Patch   Post-patch
   ------------  ---------   ----------
        4          1,641        1,674
        8            731        1,062
       16            564          924
       32             78          300
       64             38          195
      240             50          149

There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-11-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/rwsem.c | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index e1840b7c5310..ded96023f4dc 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -254,6 +254,14 @@ enum writer_wait_state {
  */
 #define RWSEM_WAIT_TIMEOUT	DIV_ROUND_UP(HZ, 250)
 
+/*
+ * Magic number to batch-wakeup waiting readers, even when writers are
+ * also present in the queue. This both limits the amount of work the
+ * waking thread must do and also prevents any potential counter overflow,
+ * however unlikely.
+ */
+#define MAX_READERS_WAKEUP	0x100
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -329,11 +337,17 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 	}
 
 	/*
-	 * Grant an infinite number of read locks to the readers at the front
-	 * of the queue. We know that woken will be at least 1 as we accounted
+	 * Grant up to MAX_READERS_WAKEUP read locks to all the readers in the
+	 * queue. We know that the woken will be at least 1 as we accounted
 	 * for above. Note we increment the 'active part' of the count by the
 	 * number of readers before waking any processes up.
 	 *
+	 * This is an adaptation of the phase-fair R/W locks where at the
+	 * reader phase (first waiter is a reader), all readers are eligible
+	 * to acquire the lock at the same time irrespective of their order
+	 * in the queue. The writers acquire the lock according to their
+	 * order in the queue.
+	 *
 	 * We have to do wakeup in 2 passes to prevent the possibility that
 	 * the reader count may be decremented before it is incremented. It
 	 * is because the to-be-woken waiter may not have slept yet. So it
@@ -345,13 +359,20 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 	 * 2) For each waiters in the new list, clear waiter->task and
 	 *    put them into wake_q to be woken up later.
 	 */
-	list_for_each_entry(waiter, &sem->wait_list, list) {
+	INIT_LIST_HEAD(&wlist);
+	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
 		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-			break;
+			continue;
 
 		woken++;
+		list_move_tail(&waiter->list, &wlist);
+
+		/*
+		 * Limit # of readers that can be woken up per wakeup call.
+		 */
+		if (woken >= MAX_READERS_WAKEUP)
+			break;
 	}
-	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
 
 	adjustment = woken * RWSEM_READER_BIAS - adjustment;
 	lockevent_cond_inc(rwsem_wake_reader, woken);

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Clarify usage of owner's nonspinaable bit
  2019-05-20 20:59 ` [PATCH v8 11/19] locking/rwsem: Clarify usage of owner's nonspinaable bit Waiman Long
@ 2019-06-17 14:30   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, mingo, bp, huang.ying.caritas, will.deacon, tglx,
	peterz, hpa, tim.c.chen, linux-kernel, longman, dave

Commit-ID:  02f1082b003a0cd48f48f12533d969cdbf1c2b63
Gitweb:     https://git.kernel.org/tip/02f1082b003a0cd48f48f12533d969cdbf1c2b63
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:10 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:28:03 +0200

locking/rwsem: Clarify usage of owner's nonspinaable bit

Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
anonymous owner - readers or an anonymous writer. The setting of this
anonymous bit is used as an indicator that optimistic spinning cannot
be done on this rwsem.

With the upcoming reader optimistic spinning patches, a reader-owned
rwsem can be spinned on for a limit period of time. We still need
this bit to indicate a rwsem is nonspinnable, but not setting this
bit loses its meaning that the owner is known. So rename the bit
to RWSEM_NONSPINNABLE to clarify its meaning.

This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-12-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/rwsem.h  |  2 +-
 kernel/locking/rwsem.c | 43 +++++++++++++++++++++----------------------
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 148983e21d47..bb76e82398b2 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -50,7 +50,7 @@ struct rw_semaphore {
 };
 
 /*
- * Setting bit 1 of the owner field but not bit 0 will indicate
+ * Setting all bits of the owner field except bit 0 will indicate
  * that the rwsem is writer-owned with an unknown owner.
  */
 #define RWSEM_OWNER_UNKNOWN	((struct task_struct *)-2L)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ded96023f4dc..180455b6b0d4 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -33,17 +33,18 @@
 /*
  * The least significant 2 bits of the owner value has the following
  * meanings when set.
- *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- *    i.e. the owner(s) cannot be readily determined. It can be reader
- *    owned or the owning writer is indeterminate.
+ *  - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
+ *  - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
+ *    The rwsem is anonymously owned, i.e. the owner(s) cannot be
+ *    readily determined. It can be reader owned or the owning writer
+ *    is indeterminate.
  *
  * When a writer acquires a rwsem, it puts its task_struct pointer
  * into the owner field. It is cleared after an unlock.
  *
  * When a reader acquires a rwsem, it will also puts its task_struct
  * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
  * largely be left untouched. So for a free or reader-owned rwsem,
  * the owner value may contain information about the last reader that
  * acquires the rwsem. The anonymous bit is set because that particular
@@ -55,7 +56,8 @@
  * a rwsem, but the overhead is simply too big.
  */
 #define RWSEM_READER_OWNED	(1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED	(1UL << 1)
+#define RWSEM_NONSPINNABLE	(1UL << 1)
+#define RWSEM_OWNER_FLAGS_MASK	(RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
 
 #ifdef CONFIG_DEBUG_RWSEMS
 # define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
@@ -132,7 +134,7 @@ static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
 	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-						 | RWSEM_ANONYMOUSLY_OWNED;
+						 | RWSEM_NONSPINNABLE;
 
 	WRITE_ONCE(sem->owner, (struct task_struct *)val);
 }
@@ -144,20 +146,12 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
 
 /*
  * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
+ * and steal the lock.
  * N.B. !owner is considered spinnable.
  */
 static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
 {
-	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
-	return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+	return !((unsigned long)owner & RWSEM_NONSPINNABLE);
 }
 
 #ifdef CONFIG_DEBUG_RWSEMS
@@ -170,10 +164,10 @@ static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
-						   | RWSEM_ANONYMOUSLY_OWNED;
+						   | RWSEM_NONSPINNABLE;
 	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
 		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
-				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+				RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
 }
 #else
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -495,7 +489,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	struct task_struct *owner;
 	bool ret = true;
 
-	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
+	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
 
 	if (need_resched())
 		return false;
@@ -534,7 +528,7 @@ static inline enum owner_state rwsem_owner_state(unsigned long owner)
 	if (!owner)
 		return OWNER_NULL;
 
-	if (owner & RWSEM_ANONYMOUSLY_OWNED)
+	if (owner & RWSEM_NONSPINNABLE)
 		return OWNER_NONSPINNABLE;
 
 	if (owner & RWSEM_READER_OWNED)
@@ -1043,7 +1037,12 @@ static inline void __up_write(struct rw_semaphore *sem)
 {
 	long tmp;
 
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	/*
+	 * sem->owner may differ from current if the ownership is transferred
+	 * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
+	 */
+	DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
+			    !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
 	rwsem_clear_owner(sem);
 	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Enable readers spinning on writer
  2019-05-20 20:59 ` [PATCH v8 12/19] locking/rwsem: Enable readers spinning on writer Waiman Long
@ 2019-06-17 14:31   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-kernel, hpa, dave, peterz, will.deacon, longman,
	torvalds, mingo, huang.ying.caritas, bp, tim.c.chen

Commit-ID:  cf69482d62d996d3ce840eeead8e160de281ac6c
Gitweb:     https://git.kernel.org/tip/cf69482d62d996d3ce840eeead8e160de281ac6c
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:11 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:28:05 +0200

locking/rwsem: Enable readers spinning on writer

This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly.  The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
rwsem_down_read_slowpath() and rwsem_down_write_slowpath().

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        4          1,674        1,684
        8          1,062        1,074
       16            924          900
       32            300          458
       64            195          208
      128            164          168
      240            149          143

The performance change wasn't significant in this case, but this change
is required by a follow-on patch.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-13-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/lock_events_list.h |  1 +
 kernel/locking/rwsem.c            | 86 +++++++++++++++++++++++++++++++++------
 2 files changed, 75 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 634b47fd8b5e..ca954e4e00e4 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,6 +56,7 @@ LOCK_EVENT(rwsem_sleep_reader)	/* # of reader sleeps			*/
 LOCK_EVENT(rwsem_sleep_writer)	/* # of writer sleeps			*/
 LOCK_EVENT(rwsem_wake_reader)	/* # of reader wakeups			*/
 LOCK_EVENT(rwsem_wake_writer)	/* # of writer wakeups			*/
+LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 180455b6b0d4..985a03ad3f8c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -457,6 +457,30 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 }
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire read lock before the reader is put on wait queue.
+ * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
+ * is ongoing.
+ */
+static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+
+	if (count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))
+		return false;
+
+	count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+	if (!(count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+		rwsem_set_reader_owned(sem);
+		lockevent_inc(rwsem_opt_rlock);
+		return true;
+	}
+
+	/* Back out the change */
+	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+	return false;
+}
+
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
@@ -491,9 +515,12 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 
 	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
 
-	if (need_resched())
+	if (need_resched()) {
+		lockevent_inc(rwsem_opt_fail);
 		return false;
+	}
 
+	preempt_disable();
 	rcu_read_lock();
 	owner = READ_ONCE(sem->owner);
 	if (owner) {
@@ -501,6 +528,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 		      owner_on_cpu(owner);
 	}
 	rcu_read_unlock();
+	preempt_enable();
+
+	lockevent_cond_inc(rwsem_opt_fail, !ret);
 	return ret;
 }
 
@@ -578,7 +608,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 	return state;
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	bool taken = false;
 	int prev_owner_state = OWNER_NULL;
@@ -586,9 +616,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	preempt_disable();
 
 	/* sem->wait_lock should not be held when doing optimistic spinning */
-	if (!rwsem_can_spin_on_owner(sem))
-		goto done;
-
 	if (!osq_lock(&sem->osq))
 		goto done;
 
@@ -608,10 +635,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		/*
 		 * Try to acquire the lock
 		 */
-		if (rwsem_try_write_lock_unqueued(sem)) {
-			taken = true;
+		taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+			      : rwsem_try_read_lock_unqueued(sem);
+
+		if (taken)
 			break;
-		}
 
 		/*
 		 * An RT task cannot do optimistic spinning if it cannot
@@ -668,7 +696,12 @@ done:
 	return taken;
 }
 #else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	return false;
+}
+
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	return false;
 }
@@ -684,6 +717,31 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	if (!rwsem_can_spin_on_owner(sem))
+		goto queue;
+
+	/*
+	 * Undo read bias from down_read() and do optimistic spinning.
+	 */
+	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+	adjustment = 0;
+	if (rwsem_optimistic_spin(sem, false)) {
+		/*
+		 * Wake up other readers in the wait list if the front
+		 * waiter is a reader.
+		 */
+		if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS)) {
+			raw_spin_lock_irq(&sem->wait_lock);
+			if (!list_empty(&sem->wait_list))
+				rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
+						&wake_q);
+			raw_spin_unlock_irq(&sem->wait_lock);
+			wake_up_q(&wake_q);
+		}
+		return sem;
+	}
+
+queue:
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_READ;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
@@ -696,7 +754,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 		 * exit the slowpath and return immediately as its
 		 * RWSEM_READER_BIAS has already been set in the count.
 		 */
-		if (!(atomic_long_read(&sem->count) &
+		if (adjustment && !(atomic_long_read(&sem->count) &
 		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
@@ -708,7 +766,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
-	count = atomic_long_add_return(adjustment, &sem->count);
+	if (adjustment)
+		count = atomic_long_add_return(adjustment, &sem->count);
+	else
+		count = atomic_long_read(&sem->count);
 
 	/*
 	 * If there are no active locks, wake the front queued process(es).
@@ -767,7 +828,8 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	DEFINE_WAKE_Q(wake_q);
 
 	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_optimistic_spin(sem))
+	if (rwsem_can_spin_on_owner(sem) &&
+	    rwsem_optimistic_spin(sem, true))
 		return sem;
 
 	/*

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-05-20 20:59 ` [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t Waiman Long
  2019-06-04  8:52   ` Peter Zijlstra
@ 2019-06-17 14:32   ` tip-bot for Waiman Long
  2019-07-19 18:45   ` [PATCH v8 13/19] " Luis Henriques
  2 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, huang.ying.caritas, longman, tglx, linux-kernel, hpa,
	dave, tim.c.chen, mingo, will.deacon, bp, peterz

Commit-ID:  94a9717b3c40e77a54e4afacd8f19a9a86bfeead
Gitweb:     https://git.kernel.org/tip/94a9717b3c40e77a54e4afacd8f19a9a86bfeead
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:12 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:28:06 +0200

locking/rwsem: Make rwsem->owner an atomic_long_t

The rwsem->owner contains not just the task structure pointer, it also
holds some flags for storing the current state of the rwsem. Some of
the flags may have to be atomically updated. To reflect the new reality,
the owner is now changed to an atomic_long_t type.

New helper functions are added to properly separate out the task
structure pointer and the embedded flags.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-14-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/percpu-rwsem.h |   4 +-
 include/linux/rwsem.h        |  11 ++--
 kernel/locking/rwsem.c       | 125 +++++++++++++++++++++++++++----------------
 3 files changed, 88 insertions(+), 52 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 03cb4b6f842e..0a43830f1932 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -117,7 +117,7 @@ static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
 	lock_release(&sem->rw_sem.dep_map, 1, ip);
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	if (!read)
-		sem->rw_sem.owner = RWSEM_OWNER_UNKNOWN;
+		atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN);
 #endif
 }
 
@@ -127,7 +127,7 @@ static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
 	lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	if (!read)
-		sem->rw_sem.owner = current;
+		atomic_long_set(&sem->rw_sem.owner, (long)current);
 #endif
 }
 
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index bb76e82398b2..e401358c4e7e 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -35,10 +35,11 @@
 struct rw_semaphore {
 	atomic_long_t count;
 	/*
-	 * Write owner or one of the read owners. Can be used as a
-	 * speculative check to see if the owner is running on the cpu.
+	 * Write owner or one of the read owners as well flags regarding
+	 * the current state of the rwsem. Can be used as a speculative
+	 * check to see if the write owner is running on the cpu.
 	 */
-	struct task_struct *owner;
+	atomic_long_t owner;
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	struct optimistic_spin_queue osq; /* spinner MCS lock */
 #endif
@@ -53,7 +54,7 @@ struct rw_semaphore {
  * Setting all bits of the owner field except bit 0 will indicate
  * that the rwsem is writer-owned with an unknown owner.
  */
-#define RWSEM_OWNER_UNKNOWN	((struct task_struct *)-2L)
+#define RWSEM_OWNER_UNKNOWN	(-2L)
 
 /* In all implementations count != 0 means locked */
 static inline int rwsem_is_locked(struct rw_semaphore *sem)
@@ -80,7 +81,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 
 #define __RWSEM_INITIALIZER(name)				\
 	{ __RWSEM_INIT_COUNT(name),				\
-	  .owner = NULL,					\
+	  .owner = ATOMIC_LONG_INIT(0),				\
 	  .wait_list = LIST_HEAD_INIT((name).wait_list),	\
 	  .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock)	\
 	  __RWSEM_OPT_INIT(name)				\
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 985a03ad3f8c..fae557be8334 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -64,7 +64,7 @@
 	if (!debug_locks_silent &&				\
 	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
 		#c, atomic_long_read(&(sem)->count),		\
-		(long)((sem)->owner), (long)current,		\
+		atomic_long_read(&(sem)->owner), (long)current,	\
 		list_empty(&(sem)->wait_list) ? "" : "not "))	\
 			debug_locks_off();			\
 	} while (0)
@@ -114,12 +114,20 @@
  */
 static inline void rwsem_set_owner(struct rw_semaphore *sem)
 {
-	WRITE_ONCE(sem->owner, current);
+	atomic_long_set(&sem->owner, (long)current);
 }
 
 static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 {
-	WRITE_ONCE(sem->owner, NULL);
+	atomic_long_set(&sem->owner, 0);
+}
+
+/*
+ * Test the flags in the owner field.
+ */
+static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
+{
+	return atomic_long_read(&sem->owner) & flags;
 }
 
 /*
@@ -133,10 +141,9 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-						 | RWSEM_NONSPINNABLE;
+	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED | RWSEM_NONSPINNABLE;
 
-	WRITE_ONCE(sem->owner, (struct task_struct *)val);
+	atomic_long_set(&sem->owner, val);
 }
 
 static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
@@ -145,13 +152,20 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
 }
 
 /*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock.
- * N.B. !owner is considered spinnable.
+ * Return true if the rwsem is owned by a reader.
  */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
 {
-	return !((unsigned long)owner & RWSEM_NONSPINNABLE);
+#ifdef CONFIG_DEBUG_RWSEMS
+	/*
+	 * Check the count to see if it is write-locked.
+	 */
+	long count = atomic_long_read(&sem->count);
+
+	if (count & RWSEM_WRITER_MASK)
+		return false;
+#endif
+	return rwsem_test_oflags(sem, RWSEM_READER_OWNED);
 }
 
 #ifdef CONFIG_DEBUG_RWSEMS
@@ -163,11 +177,13 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
  */
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
-	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
-						   | RWSEM_NONSPINNABLE;
-	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
-		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
-				RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
+	unsigned long val = atomic_long_read(&sem->owner);
+
+	while ((val & ~RWSEM_OWNER_FLAGS_MASK) == (unsigned long)current) {
+		if (atomic_long_try_cmpxchg(&sem->owner, &val,
+					    val & RWSEM_OWNER_FLAGS_MASK))
+			return;
+	}
 }
 #else
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -175,6 +191,28 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 }
 #endif
 
+/*
+ * Return just the real task structure pointer of the owner
+ */
+static inline struct task_struct *rwsem_owner(struct rw_semaphore *sem)
+{
+	return (struct task_struct *)
+		(atomic_long_read(&sem->owner) & ~RWSEM_OWNER_FLAGS_MASK);
+}
+
+/*
+ * Return the real task structure pointer of the owner and the embedded
+ * flags in the owner. pflags must be non-NULL.
+ */
+static inline struct task_struct *
+rwsem_owner_flags(struct rw_semaphore *sem, unsigned long *pflags)
+{
+	unsigned long owner = atomic_long_read(&sem->owner);
+
+	*pflags = owner & RWSEM_OWNER_FLAGS_MASK;
+	return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
+}
+
 /*
  * Guide to the rw_semaphore's count field.
  *
@@ -208,7 +246,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
-	sem->owner = NULL;
+	atomic_long_set(&sem->owner, 0L);
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	osq_lock_init(&sem->osq);
 #endif
@@ -511,9 +549,10 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 {
 	struct task_struct *owner;
+	unsigned long flags;
 	bool ret = true;
 
-	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
+	BUILD_BUG_ON(!(RWSEM_OWNER_UNKNOWN & RWSEM_NONSPINNABLE));
 
 	if (need_resched()) {
 		lockevent_inc(rwsem_opt_fail);
@@ -522,11 +561,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 
 	preempt_disable();
 	rcu_read_lock();
-	owner = READ_ONCE(sem->owner);
-	if (owner) {
-		ret = is_rwsem_owner_spinnable(owner) &&
-		      owner_on_cpu(owner);
-	}
+	owner = rwsem_owner_flags(sem, &flags);
+	if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
+		ret = false;
 	rcu_read_unlock();
 	preempt_enable();
 
@@ -553,25 +590,26 @@ enum owner_state {
 };
 #define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
 
-static inline enum owner_state rwsem_owner_state(unsigned long owner)
+static inline enum owner_state
+rwsem_owner_state(struct task_struct *owner, unsigned long flags)
 {
-	if (!owner)
-		return OWNER_NULL;
-
-	if (owner & RWSEM_NONSPINNABLE)
+	if (flags & RWSEM_NONSPINNABLE)
 		return OWNER_NONSPINNABLE;
 
-	if (owner & RWSEM_READER_OWNED)
+	if (flags & RWSEM_READER_OWNED)
 		return OWNER_READER;
 
-	return OWNER_WRITER;
+	return owner ? OWNER_WRITER : OWNER_NULL;
 }
 
 static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 {
-	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
-	enum owner_state state = rwsem_owner_state((unsigned long)owner);
+	struct task_struct *new, *owner;
+	unsigned long flags, new_flags;
+	enum owner_state state;
 
+	owner = rwsem_owner_flags(sem, &flags);
+	state = rwsem_owner_state(owner, flags);
 	if (state != OWNER_WRITER)
 		return state;
 
@@ -582,9 +620,9 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 			break;
 		}
 
-		tmp = READ_ONCE(sem->owner);
-		if (tmp != owner) {
-			state = rwsem_owner_state((unsigned long)tmp);
+		new = rwsem_owner_flags(sem, &new_flags);
+		if ((new != owner) || (new_flags != flags)) {
+			state = rwsem_owner_state(new, new_flags);
 			break;
 		}
 
@@ -1001,8 +1039,7 @@ inline void __down_read(struct rw_semaphore *sem)
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
+		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
@@ -1014,8 +1051,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
+		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
@@ -1084,7 +1120,7 @@ inline void __up_read(struct rw_semaphore *sem)
 {
 	long tmp;
 
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
+	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
@@ -1103,8 +1139,8 @@ static inline void __up_write(struct rw_semaphore *sem)
 	 * sem->owner may differ from current if the ownership is transferred
 	 * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
 	 */
-	DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
-			    !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
+	DEBUG_RWSEMS_WARN_ON((rwsem_owner(sem) != current) &&
+			    !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
 	rwsem_clear_owner(sem);
 	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
@@ -1125,7 +1161,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	 * read-locked region is ok to be re-ordered into the
 	 * write side. As such, rely on RELEASE semantics.
 	 */
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	DEBUG_RWSEMS_WARN_ON(rwsem_owner(sem) != current, sem);
 	tmp = atomic_long_fetch_add_release(
 		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
 	rwsem_set_reader_owned(sem);
@@ -1296,8 +1332,7 @@ EXPORT_SYMBOL(down_write_killable_nested);
 
 void up_read_non_owner(struct rw_semaphore *sem)
 {
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
+	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	__up_read(sem);
 }
 EXPORT_SYMBOL(up_read_non_owner);

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-05-20 20:59 ` [PATCH v8 14/19] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
  2019-06-04  9:03   ` Peter Zijlstra
@ 2019-06-17 14:32   ` tip-bot for Waiman Long
  1 sibling, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, mingo, will.deacon, bp, longman, huang.ying.caritas,
	tim.c.chen, hpa, peterz, dave, linux-kernel, tglx

Commit-ID:  7d43f1ce9dd075d8b2aa3ad1f3970ef386a5c358
Gitweb:     https://git.kernel.org/tip/7d43f1ce9dd075d8b2aa3ad1f3970ef386a5c358
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:13 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:28:07 +0200

locking/rwsem: Enable time-based spinning on reader-owned rwsem

When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.

This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.

When the time threshold is exceeded, the nonspinnable bits will be set
in the owner field to indicate that no more optimistic spinning will
be allowed on this rwsem until it becomes writer owned again. Not even
readers is allowed to acquire the reader-locked rwsem by optimistic
spinning for fairness.

We also want a writer to acquire the lock after the readers hold the
lock for a relatively long time. In order to give preference to writers
under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
into two - one for reader and one for writer. When optimistic spinning
is disabled, both bits will be set. When the reader count drop down
to 0, the writer nonspinnable bit will be cleared to allow writers to
spin on the lock, but not the readers. When a writer acquires the lock,
it will write its own task structure pointer into sem->owner and clear
the reader nonspinnable bit in the process.

The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.

      System                 Time for 16 Iterations
      ------                 ----------------------
  1-socket Skylake                  ~800ns
  4-socket Broadwell                ~300ns
  2-socket ThunderX2 (arm64)        ~250ns

When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time.  So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        2          1,759        6,684
        4          1,684        6,738
        8          1,074        7,222
       16            900        7,163
       32            458        7,316
       64            208          520
      128            168          425
      240            143          474

This patch gives a big boost in performance for mixed reader/writer
workloads.

With 32 locking threads, the rwsem lock event data were:

rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663

With 64 locking threads, the data looked like:

rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281

So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-15-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/lock_events_list.h |   1 +
 kernel/locking/rwsem.c            | 173 +++++++++++++++++++++++++++++++-------
 2 files changed, 144 insertions(+), 30 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index ca954e4e00e4..baa998401052 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -59,6 +59,7 @@ LOCK_EVENT(rwsem_wake_writer)	/* # of writer wakeups			*/
 LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
+LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled reader opt-spinnings	*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index fae557be8334..2d7cabcfca50 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -23,6 +23,7 @@
 #include <linux/sched/debug.h>
 #include <linux/sched/wake_q.h>
 #include <linux/sched/signal.h>
+#include <linux/sched/clock.h>
 #include <linux/export.h>
 #include <linux/rwsem.h>
 #include <linux/atomic.h>
@@ -31,24 +32,28 @@
 #include "lock_events.h"
 
 /*
- * The least significant 2 bits of the owner value has the following
+ * The least significant 3 bits of the owner value has the following
  * meanings when set.
  *  - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
- *  - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
- *    The rwsem is anonymously owned, i.e. the owner(s) cannot be
- *    readily determined. It can be reader owned or the owning writer
- *    is indeterminate.
+ *  - Bit 1: RWSEM_RD_NONSPINNABLE - Readers cannot spin on this lock.
+ *  - Bit 2: RWSEM_WR_NONSPINNABLE - Writers cannot spin on this lock.
  *
+ * When the rwsem is either owned by an anonymous writer, or it is
+ * reader-owned, but a spinning writer has timed out, both nonspinnable
+ * bits will be set to disable optimistic spinning by readers and writers.
+ * In the later case, the last unlocking reader should then check the
+ * writer nonspinnable bit and clear it only to give writers preference
+ * to acquire the lock via optimistic spinning, but not readers. Similar
+ * action is also done in the reader slowpath.
+
  * When a writer acquires a rwsem, it puts its task_struct pointer
  * into the owner field. It is cleared after an unlock.
  *
  * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
+ * pointer into the owner field with the RWSEM_READER_OWNED bit set.
+ * On unlock, the owner field will largely be left untouched. So
+ * for a free or reader-owned rwsem, the owner value may contain
+ * information about the last reader that acquires the rwsem.
  *
  * That information may be helpful in debugging cases where the system
  * seems to hang on a reader owned rwsem especially if only one reader
@@ -56,7 +61,9 @@
  * a rwsem, but the overhead is simply too big.
  */
 #define RWSEM_READER_OWNED	(1UL << 0)
-#define RWSEM_NONSPINNABLE	(1UL << 1)
+#define RWSEM_RD_NONSPINNABLE	(1UL << 1)
+#define RWSEM_WR_NONSPINNABLE	(1UL << 2)
+#define RWSEM_NONSPINNABLE	(RWSEM_RD_NONSPINNABLE | RWSEM_WR_NONSPINNABLE)
 #define RWSEM_OWNER_FLAGS_MASK	(RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
 
 #ifdef CONFIG_DEBUG_RWSEMS
@@ -141,7 +148,7 @@ static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
 static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED | RWSEM_NONSPINNABLE;
+	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED;
 
 	atomic_long_set(&sem->owner, val);
 }
@@ -191,6 +198,23 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 }
 #endif
 
+/*
+ * Set the RWSEM_NONSPINNABLE bits if the RWSEM_READER_OWNED flag
+ * remains set. Otherwise, the operation will be aborted.
+ */
+static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
+{
+	unsigned long owner = atomic_long_read(&sem->owner);
+
+	do {
+		if (!(owner & RWSEM_READER_OWNED))
+			break;
+		if (owner & RWSEM_NONSPINNABLE)
+			break;
+	} while (!atomic_long_try_cmpxchg(&sem->owner, &owner,
+					  owner | RWSEM_NONSPINNABLE));
+}
+
 /*
  * Return just the real task structure pointer of the owner
  */
@@ -546,7 +570,8 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 }
 
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+					   unsigned long nonspinnable)
 {
 	struct task_struct *owner;
 	unsigned long flags;
@@ -562,7 +587,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	preempt_disable();
 	rcu_read_lock();
 	owner = rwsem_owner_flags(sem, &flags);
-	if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
+	if ((flags & nonspinnable) || (owner && !owner_on_cpu(owner)))
 		ret = false;
 	rcu_read_unlock();
 	preempt_enable();
@@ -588,12 +613,12 @@ enum owner_state {
 	OWNER_READER		= 1 << 2,
 	OWNER_NONSPINNABLE	= 1 << 3,
 };
-#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
+#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER | OWNER_READER)
 
 static inline enum owner_state
-rwsem_owner_state(struct task_struct *owner, unsigned long flags)
+rwsem_owner_state(struct task_struct *owner, unsigned long flags, unsigned long nonspinnable)
 {
-	if (flags & RWSEM_NONSPINNABLE)
+	if (flags & nonspinnable)
 		return OWNER_NONSPINNABLE;
 
 	if (flags & RWSEM_READER_OWNED)
@@ -602,14 +627,15 @@ rwsem_owner_state(struct task_struct *owner, unsigned long flags)
 	return owner ? OWNER_WRITER : OWNER_NULL;
 }
 
-static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
+static noinline enum owner_state
+rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable)
 {
 	struct task_struct *new, *owner;
 	unsigned long flags, new_flags;
 	enum owner_state state;
 
 	owner = rwsem_owner_flags(sem, &flags);
-	state = rwsem_owner_state(owner, flags);
+	state = rwsem_owner_state(owner, flags, nonspinnable);
 	if (state != OWNER_WRITER)
 		return state;
 
@@ -622,7 +648,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 
 		new = rwsem_owner_flags(sem, &new_flags);
 		if ((new != owner) || (new_flags != flags)) {
-			state = rwsem_owner_state(new, new_flags);
+			state = rwsem_owner_state(new, new_flags, nonspinnable);
 			break;
 		}
 
@@ -646,10 +672,39 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 	return state;
 }
 
+/*
+ * Calculate reader-owned rwsem spinning threshold for writer
+ *
+ * The more readers own the rwsem, the longer it will take for them to
+ * wind down and free the rwsem. So the empirical formula used to
+ * determine the actual spinning time limit here is:
+ *
+ *   Spinning threshold = (10 + nr_readers/2)us
+ *
+ * The limit is capped to a maximum of 25us (30 readers). This is just
+ * a heuristic and is subjected to change in the future.
+ */
+static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+	int readers = count >> RWSEM_READER_SHIFT;
+	u64 delta;
+
+	if (readers > 30)
+		readers = 30;
+	delta = (20 + readers) * NSEC_PER_USEC / 2;
+
+	return sched_clock() + delta;
+}
+
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	bool taken = false;
 	int prev_owner_state = OWNER_NULL;
+	int loop = 0;
+	u64 rspin_threshold = 0;
+	unsigned long nonspinnable = wlock ? RWSEM_WR_NONSPINNABLE
+					   : RWSEM_RD_NONSPINNABLE;
 
 	preempt_disable();
 
@@ -661,12 +716,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 	 * Optimistically spin on the owner field and attempt to acquire the
 	 * lock whenever the owner changes. Spinning will be stopped when:
 	 *  1) the owning writer isn't running; or
-	 *  2) readers own the lock as we can't determine if they are
-	 *     actively running or not.
+	 *  2) readers own the lock and spinning time has exceeded limit.
 	 */
 	for (;;) {
-		enum owner_state owner_state = rwsem_spin_on_owner(sem);
+		enum owner_state owner_state;
 
+		owner_state = rwsem_spin_on_owner(sem, nonspinnable);
 		if (!(owner_state & OWNER_SPINNABLE))
 			break;
 
@@ -679,6 +734,38 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 		if (taken)
 			break;
 
+		/*
+		 * Time-based reader-owned rwsem optimistic spinning
+		 */
+		if (wlock && (owner_state == OWNER_READER)) {
+			/*
+			 * Re-initialize rspin_threshold every time when
+			 * the owner state changes from non-reader to reader.
+			 * This allows a writer to steal the lock in between
+			 * 2 reader phases and have the threshold reset at
+			 * the beginning of the 2nd reader phase.
+			 */
+			if (prev_owner_state != OWNER_READER) {
+				if (rwsem_test_oflags(sem, nonspinnable))
+					break;
+				rspin_threshold = rwsem_rspin_threshold(sem);
+				loop = 0;
+			}
+
+			/*
+			 * Check time threshold once every 16 iterations to
+			 * avoid calling sched_clock() too frequently so
+			 * as to reduce the average latency between the times
+			 * when the lock becomes free and when the spinner
+			 * is ready to do a trylock.
+			 */
+			else if (!(++loop & 0xf) && (sched_clock() > rspin_threshold)) {
+				rwsem_set_nonspinnable(sem);
+				lockevent_inc(rwsem_opt_nospin);
+				break;
+			}
+		}
+
 		/*
 		 * An RT task cannot do optimistic spinning if it cannot
 		 * be sure the lock holder is running or live-lock may
@@ -733,8 +820,25 @@ done:
 	lockevent_cond_inc(rwsem_opt_fail, !taken);
 	return taken;
 }
+
+/*
+ * Clear the owner's RWSEM_WR_NONSPINNABLE bit if it is set. This should
+ * only be called when the reader count reaches 0.
+ *
+ * This give writers better chance to acquire the rwsem first before
+ * readers when the rwsem was being held by readers for a relatively long
+ * period of time. Race can happen that an optimistic spinner may have
+ * just stolen the rwsem and set the owner, but just clearing the
+ * RWSEM_WR_NONSPINNABLE bit will do no harm anyway.
+ */
+static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
+{
+	if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE))
+		atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner);
+}
 #else
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+					   unsigned long nonspinnable)
 {
 	return false;
 }
@@ -743,6 +847,8 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	return false;
 }
+
+static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
 #endif
 
 /*
@@ -752,10 +858,11 @@ static struct rw_semaphore __sched *
 rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count, adjustment = -RWSEM_READER_BIAS;
+	bool wake = false;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
-	if (!rwsem_can_spin_on_owner(sem))
+	if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE))
 		goto queue;
 
 	/*
@@ -815,8 +922,12 @@ queue:
 	 * If there are no writers and we are first in the queue,
 	 * wake our own waiter to join the existing active readers !
 	 */
-	if (!(count & RWSEM_LOCK_MASK) ||
-	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
+	if (!(count & RWSEM_LOCK_MASK)) {
+		clear_wr_nonspinnable(sem);
+		wake = true;
+	}
+	if (wake || (!(count & RWSEM_WRITER_MASK) &&
+		    (adjustment & RWSEM_FLAG_WAITERS)))
 		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
@@ -866,7 +977,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	DEFINE_WAKE_Q(wake_q);
 
 	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_can_spin_on_owner(sem) &&
+	if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) &&
 	    rwsem_optimistic_spin(sem, true))
 		return sem;
 
@@ -1124,8 +1235,10 @@ inline void __up_read(struct rw_semaphore *sem)
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
-		      RWSEM_FLAG_WAITERS))
+		      RWSEM_FLAG_WAITERS)) {
+		clear_wr_nonspinnable(sem);
 		rwsem_wake(sem, tmp);
+	}
 }
 
 /*

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-05-20 20:59 ` [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning Waiman Long
                     ` (3 preceding siblings ...)
  2019-06-04 10:58   ` Peter Zijlstra
@ 2019-06-17 14:33   ` tip-bot for Waiman Long
  4 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, dave, hpa, peterz, linux-kernel, will.deacon, longman, tglx,
	mingo, torvalds, huang.ying.caritas, tim.c.chen

Commit-ID:  5cfd92e12e13432251981b9d0cd68dbd7aa8d690
Gitweb:     https://git.kernel.org/tip/5cfd92e12e13432251981b9d0cd68dbd7aa8d690
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:14 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:28:09 +0200

locking/rwsem: Adaptive disabling of reader optimistic spinning

Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It makes readers
relatively more preferred than writers. When a writer times out spinning
on a reader-owned lock and set the nospinnable bits, there are two main
reasons for that.

 1) The reader critical section is long, perhaps the task sleeps after
    acquiring the read lock.
 2) There are just too many readers contending the lock causing it to
    take a while to service all of them.

In the former case, long reader critical section will impede the progress
of writers which is usually more important for system performance.
In the later case, reader optimistic spinning tends to make the reader
groups that contain readers that acquire the lock together smaller
leading to more of them. That may hurt performance in some cases. In
other words, the setting of nonspinnable bits indicates that reader
optimistic spinning may not be helpful for those workloads that cause it.

Therefore, any writers that have observed the setting of the writer
nonspinnable bit for a given rwsem after they fail to acquire the lock
via optimistic spinning will set the reader nonspinnable bit once they
acquire the write lock. Similarly, readers that observe the setting
of reader nonspinnable bit at slowpath entry will also set the reader
nonspinnable bit when they acquire the read lock via the wakeup path.

Once the reader nonspinnable bit is on, it will only be reset when
a writer is able to acquire the rwsem in the fast path or somehow a
reader or writer in the slowpath doesn't observe the nonspinable bit.

This is to discourage reader optmistic spinning on that particular
rwsem and make writers more preferred. This adaptive disabling of reader
optimistic spinning will alleviate some of the negative side effect of
this feature.

In addition, this patch tries to make readers in the spinning queue
follow the phase-fair principle after quitting optimistic spinning
by checking if another reader has somehow acquired a read lock after
this reader enters the optimistic spinning queue. If so and the rwsem
is still reader-owned, this reader is in the right read-phase and can
attempt to acquire the lock.

On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
the will-it-scale benchmark was run with various number of threads. The
number of operations done before reader optimistic spinning patches,
this patch and after this patch were:

  Threads  Before rspin  Before patch  After patch    %change
  -------  ------------  ------------  -----------    -------
    20        5541068      5345484       5455667    -3.5%/ +2.1%
    40       10185150      7292313       9219276   -28.5%/+26.4%
    60        8196733      6460517       7181209   -21.2%/+11.2%
    80        9508864      6739559       8107025   -29.1%/+20.3%

This patch doesn't recover all the lost performance, but it is more
than half. Given the fact that reader optimistic spinning does benefit
some workloads, this is a good compromise.

Using the rwsem locking microbenchmark with very short critical section,
this patch doesn't have too much impact on locking performance as shown
by the locking rates (kops/s) below with equal numbers of readers and
writers before and after this patch:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        2          4,730        4,969
        4          4,814        4,786
        8          4,866        4,815
       16          4,715        4,511
       32          3,338        3,500
       64          3,212        3,389
       80          3,110        3,044

When running the locking microbenchmark with 40 dedicated reader and writer
threads, however, the reader performance is curtailed to favor the writer.

Before patch:

  40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
  40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644

After patch:

  40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
  40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-16-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/lock_events_list.h |  10 +--
 kernel/locking/rwsem.c            | 133 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 135 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index baa998401052..239039d0ce21 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,10 +56,12 @@ LOCK_EVENT(rwsem_sleep_reader)	/* # of reader sleeps			*/
 LOCK_EVENT(rwsem_sleep_writer)	/* # of writer sleeps			*/
 LOCK_EVENT(rwsem_wake_reader)	/* # of reader wakeups			*/
 LOCK_EVENT(rwsem_wake_writer)	/* # of writer wakeups			*/
-LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
-LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
-LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
-LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled reader opt-spinnings	*/
+LOCK_EVENT(rwsem_opt_rlock)	/* # of opt-acquired read locks		*/
+LOCK_EVENT(rwsem_opt_wlock)	/* # of opt-acquired write locks	*/
+LOCK_EVENT(rwsem_opt_fail)	/* # of failed optspins			*/
+LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled optspins		*/
+LOCK_EVENT(rwsem_opt_norspin)	/* # of disabled reader-only optspins	*/
+LOCK_EVENT(rwsem_opt_rlock2)	/* # of opt-acquired 2ndary read locks	*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2d7cabcfca50..e1e0bac957c4 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -59,6 +59,42 @@
  * seems to hang on a reader owned rwsem especially if only one reader
  * is involved. Ideally we would like to track all the readers that own
  * a rwsem, but the overhead is simply too big.
+ *
+ * Reader optimistic spinning is helpful when the reader critical section
+ * is short and there aren't that many readers around. It makes readers
+ * relatively more preferred than writers. When a writer times out spinning
+ * on a reader-owned lock and set the nospinnable bits, there are two main
+ * reasons for that.
+ *
+ *  1) The reader critical section is long, perhaps the task sleeps after
+ *     acquiring the read lock.
+ *  2) There are just too many readers contending the lock causing it to
+ *     take a while to service all of them.
+ *
+ * In the former case, long reader critical section will impede the progress
+ * of writers which is usually more important for system performance. In
+ * the later case, reader optimistic spinning tends to make the reader
+ * groups that contain readers that acquire the lock together smaller
+ * leading to more of them. That may hurt performance in some cases. In
+ * other words, the setting of nonspinnable bits indicates that reader
+ * optimistic spinning may not be helpful for those workloads that cause
+ * it.
+ *
+ * Therefore, any writers that had observed the setting of the writer
+ * nonspinnable bit for a given rwsem after they fail to acquire the lock
+ * via optimistic spinning will set the reader nonspinnable bit once they
+ * acquire the write lock. Similarly, readers that observe the setting
+ * of reader nonspinnable bit at slowpath entry will set the reader
+ * nonspinnable bits when they acquire the read lock via the wakeup path.
+ *
+ * Once the reader nonspinnable bit is on, it will only be reset when
+ * a writer is able to acquire the rwsem in the fast path or somehow a
+ * reader or writer in the slowpath doesn't observe the nonspinable bit.
+ *
+ * This is to discourage reader optmistic spinning on that particular
+ * rwsem and make writers more preferred. This adaptive disabling of reader
+ * optimistic spinning will alleviate the negative side effect of this
+ * feature.
  */
 #define RWSEM_READER_OWNED	(1UL << 0)
 #define RWSEM_RD_NONSPINNABLE	(1UL << 1)
@@ -144,11 +180,14 @@ static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
  * Note that the owner value just indicates the task has owned the rwsem
  * previously, it may not be the real owner or one of the real owners
  * anymore when that field is examined, so take it with a grain of salt.
+ *
+ * The reader non-spinnable bit is preserved.
  */
 static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED;
+	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED |
+		(atomic_long_read(&sem->owner) & RWSEM_RD_NONSPINNABLE);
 
 	atomic_long_set(&sem->owner, val);
 }
@@ -287,6 +326,7 @@ struct rwsem_waiter {
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
 	unsigned long timeout;
+	unsigned long last_rowner;
 };
 #define rwsem_first_waiter(sem) \
 	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -368,6 +408,8 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 	 * so we can bail out early if a writer stole the lock.
 	 */
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
+		struct task_struct *owner;
+
 		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
 		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
@@ -388,8 +430,15 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		/*
 		 * Set it to reader-owned to give spinners an early
 		 * indication that readers now have the lock.
+		 * The reader nonspinnable bit seen at slowpath entry of
+		 * the reader is copied over.
 		 */
-		__rwsem_set_reader_owned(sem, waiter->task);
+		owner = waiter->task;
+		if (waiter->last_rowner & RWSEM_RD_NONSPINNABLE) {
+			owner = (void *)((unsigned long)owner | RWSEM_RD_NONSPINNABLE);
+			lockevent_inc(rwsem_opt_norspin);
+		}
+		__rwsem_set_reader_owned(sem, owner);
 	}
 
 	/*
@@ -836,6 +885,42 @@ static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
 	if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE))
 		atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner);
 }
+
+/*
+ * This function is called when the reader fails to acquire the lock via
+ * optimistic spinning. In this case we will still attempt to do a trylock
+ * when comparing the rwsem state right now with the state when entering
+ * the slowpath indicates that the reader is still in a valid reader phase.
+ * This happens when the following conditions are true:
+ *
+ * 1) The lock is currently reader owned, and
+ * 2) The lock is previously not reader-owned or the last read owner changes.
+ *
+ * In the former case, we have transitioned from a writer phase to a
+ * reader-phase while spinning. In the latter case, it means the reader
+ * phase hasn't ended when we entered the optimistic spinning loop. In
+ * both cases, the reader is eligible to acquire the lock. This is the
+ * secondary path where a read lock is acquired optimistically.
+ *
+ * The reader non-spinnable bit wasn't set at time of entry or it will
+ * not be here at all.
+ */
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+					      unsigned long last_rowner)
+{
+	unsigned long owner = atomic_long_read(&sem->owner);
+
+	if (!(owner & RWSEM_READER_OWNED))
+		return false;
+
+	if (((owner ^ last_rowner) & ~RWSEM_OWNER_FLAGS_MASK) &&
+	    rwsem_try_read_lock_unqueued(sem)) {
+		lockevent_inc(rwsem_opt_rlock2);
+		lockevent_add(rwsem_opt_fail, -1);
+		return true;
+	}
+	return false;
+}
 #else
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
 					   unsigned long nonspinnable)
@@ -849,6 +934,12 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 }
 
 static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
+
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+					      unsigned long last_rowner)
+{
+	return false;
+}
 #endif
 
 /*
@@ -862,6 +953,14 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	/*
+	 * Save the current read-owner of rwsem, if available, and the
+	 * reader nonspinnable bit.
+	 */
+	waiter.last_rowner = atomic_long_read(&sem->owner);
+	if (!(waiter.last_rowner & RWSEM_READER_OWNED))
+		waiter.last_rowner &= RWSEM_RD_NONSPINNABLE;
+
 	if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE))
 		goto queue;
 
@@ -884,6 +983,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 			wake_up_q(&wake_q);
 		}
 		return sem;
+	} else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
+		return sem;
 	}
 
 queue:
@@ -964,6 +1065,19 @@ out_nolock:
 	return ERR_PTR(-EINTR);
 }
 
+/*
+ * This function is called by the a write lock owner. So the owner value
+ * won't get changed by others.
+ */
+static inline void rwsem_disable_reader_optspin(struct rw_semaphore *sem,
+						bool disable)
+{
+	if (unlikely(disable)) {
+		atomic_long_or(RWSEM_RD_NONSPINNABLE, &sem->owner);
+		lockevent_inc(rwsem_opt_norspin);
+	}
+}
+
 /*
  * Wait until we successfully acquire the write lock
  */
@@ -971,6 +1085,7 @@ static struct rw_semaphore *
 rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count;
+	bool disable_rspin;
 	enum writer_wait_state wstate;
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
@@ -981,6 +1096,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	    rwsem_optimistic_spin(sem, true))
 		return sem;
 
+	/*
+	 * Disable reader optimistic spinning for this rwsem after
+	 * acquiring the write lock when the setting of the nonspinnable
+	 * bits are observed.
+	 */
+	disable_rspin = atomic_long_read(&sem->owner) & RWSEM_NONSPINNABLE;
+
 	/*
 	 * Optimistic spinning failed, proceed to the slowpath
 	 * and block until we can acquire the sem.
@@ -1077,6 +1199,7 @@ wait:
 	}
 	__set_current_state(TASK_RUNNING);
 	list_del(&waiter.list);
+	rwsem_disable_reader_optspin(sem, disable_rspin);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	lockevent_inc(rwsem_wlock);
 
@@ -1196,7 +1319,8 @@ static inline void __down_write(struct rw_semaphore *sem)
 	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
 						      RWSEM_WRITER_LOCKED)))
 		rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
-	rwsem_set_owner(sem);
+	else
+		rwsem_set_owner(sem);
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
@@ -1207,8 +1331,9 @@ static inline int __down_write_killable(struct rw_semaphore *sem)
 						      RWSEM_WRITER_LOCKED))) {
 		if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
+	} else {
+		rwsem_set_owner(sem);
 	}
-	rwsem_set_owner(sem);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Guard against making count negative
  2019-05-20 20:59 ` [PATCH v8 16/19] locking/rwsem: Guard against making count negative Waiman Long
  2019-06-11 13:11   ` Peter Zijlstra
  2019-06-11 13:13   ` Peter Zijlstra
@ 2019-06-17 14:34   ` tip-bot for Waiman Long
  2 siblings, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-06-17 14:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tim.c.chen, torvalds, tglx, huang.ying.caritas, longman, dave,
	will.deacon, linux-kernel, mingo, hpa, bp, peterz

Commit-ID:  a15ea1a35f1b2782befc8b958c123c5d6a7cab0a
Gitweb:     https://git.kernel.org/tip/a15ea1a35f1b2782befc8b958c123c5d6a7cab0a
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Mon, 20 May 2019 16:59:15 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:28:11 +0200

locking/rwsem: Guard against making count negative

The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.

To prevent this count overflow from happening, the most significant
bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock
attempts will now fail for both the fast and slow paths whenever this
bit is set. So all those extra readers will be put to sleep in the wait
list. Wakeup will not happen until the reader count reaches 0.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-17-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/rwsem.c | 53 ++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 41 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index e1e0bac957c4..37524a47f002 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -116,13 +116,28 @@
 #endif
 
 /*
- * The definition of the atomic counter in the semaphore:
+ * On 64-bit architectures, the bit definitions of the count are:
  *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bit  2   - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * Bit  0    - writer locked bit
+ * Bit  1    - waiters present bit
+ * Bit  2    - lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-62 - 55-bit reader count
+ * Bit  63   - read fail bit
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit  0    - writer locked bit
+ * Bit  1    - waiters present bit
+ * Bit  2    - lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-30 - 23-bit reader count
+ * Bit  31   - read fail bit
+ *
+ * It is not likely that the most significant bit (read fail bit) will ever
+ * be set. This guard bit is still checked anyway in the down_read() fastpath
+ * just in case we need to use up more of the reader bits for other purpose
+ * in the future.
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
@@ -139,6 +154,7 @@
 #define RWSEM_WRITER_LOCKED	(1UL << 0)
 #define RWSEM_FLAG_WAITERS	(1UL << 1)
 #define RWSEM_FLAG_HANDOFF	(1UL << 2)
+#define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
 
 #define RWSEM_READER_SHIFT	8
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
@@ -146,7 +162,7 @@
 #define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
-				 RWSEM_FLAG_HANDOFF)
+				 RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -254,6 +270,14 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 					  owner | RWSEM_NONSPINNABLE));
 }
 
+static inline bool rwsem_read_trylock(struct rw_semaphore *sem)
+{
+	long cnt = atomic_long_add_return_acquire(RWSEM_READER_BIAS, &sem->count);
+	if (WARN_ON_ONCE(cnt < 0))
+		rwsem_set_nonspinnable(sem);
+	return !(cnt & RWSEM_READ_FAILED_MASK);
+}
+
 /*
  * Return just the real task structure pointer of the owner
  */
@@ -402,6 +426,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 		return;
 	}
 
+	/*
+	 * No reader wakeup if there are too many of them already.
+	 */
+	if (unlikely(atomic_long_read(&sem->count) < 0))
+		return;
+
 	/*
 	 * Writers might steal the lock before we grant it to the next reader.
 	 * We prefer to do the first reader grant before counting readers
@@ -949,9 +979,9 @@ static struct rw_semaphore __sched *
 rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count, adjustment = -RWSEM_READER_BIAS;
-	bool wake = false;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
+	bool wake = false;
 
 	/*
 	 * Save the current read-owner of rwsem, if available, and the
@@ -1270,8 +1300,7 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
  */
 inline void __down_read(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
+	if (!rwsem_read_trylock(sem)) {
 		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
@@ -1281,8 +1310,7 @@ inline void __down_read(struct rw_semaphore *sem)
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
+	if (!rwsem_read_trylock(sem)) {
 		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
@@ -1359,6 +1387,7 @@ inline void __up_read(struct rw_semaphore *sem)
 	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
 		      RWSEM_FLAG_WAITERS)) {
 		clear_wr_nonspinnable(sem);

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-05-20 20:59 ` [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t Waiman Long
  2019-06-04  8:52   ` Peter Zijlstra
  2019-06-17 14:32   ` [tip:locking/core] " tip-bot for Waiman Long
@ 2019-07-19 18:45   ` Luis Henriques
  2019-07-19 19:32     ` Waiman Long
  2 siblings, 1 reply; 83+ messages in thread
From: Luis Henriques @ 2019-07-19 18:45 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
> The rwsem->owner contains not just the task structure pointer, it also
> holds some flags for storing the current state of the rwsem. Some of
> the flags may have to be atomically updated. To reflect the new reality,
> the owner is now changed to an atomic_long_t type.
> 
> New helper functions are added to properly separate out the task
> structure pointer and the embedded flags.

I started seeing KASAN use-after-free with current master, and a bisect
showed me that this commit 94a9717b3c40 ("locking/rwsem: Make
rwsem->owner an atomic_long_t") was the problem.  Does it ring any
bells?  I can easily reproduce it with xfstests (generic/464).

Cheers,
--
Luís

[ 6380.820179] run fstests generic/464 at 2019-07-19 12:04:05
[ 6381.504693] libceph: mon0 (1)192.168.155.1:40786 session established
[ 6381.506790] libceph: client4572 fsid 86b39301-7192-4052-8427-a241af35a591
[ 6381.618830] libceph: mon0 (1)192.168.155.1:40786 session established
[ 6381.619993] libceph: client4573 fsid 86b39301-7192-4052-8427-a241af35a591
[ 6384.464561] ==================================================================
[ 6384.466165] BUG: KASAN: use-after-free in rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.468288] Read of size 4 at addr ffff8881d5dc9478 by task xfs_io/17238

[ 6384.469545] CPU: 1 PID: 17238 Comm: xfs_io Not tainted 5.2.0+ #444
[ 6384.469550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
[ 6384.469554] Call Trace:
[ 6384.469563]  dump_stack+0x5b/0x90
[ 6384.469569]  print_address_description+0x6f/0x332
[ 6384.469573]  ? rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469575]  ? rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469579]  __kasan_report.cold+0x1a/0x3e
[ 6384.469583]  ? rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469588]  kasan_report+0xe/0x12
[ 6384.469591]  rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469596]  ? __ceph_caps_issued_mask+0xe7/0x280
[ 6384.469599]  ? find_held_lock+0xc9/0xf0
[ 6384.469604]  ? __ceph_do_getattr+0x19f/0x290
[ 6384.469608]  ? down_read_non_owner+0x1c0/0x1c0
[ 6384.469612]  ? do_raw_spin_unlock+0xa3/0x130
[ 6384.469617]  ? _raw_spin_unlock+0x24/0x30
[ 6384.469622]  ? __lock_acquire.isra.0+0x486/0x770
[ 6384.469629]  ? path_openat+0x7ef/0xfe0
[ 6384.469635]  ? down_write+0x11e/0x130
[ 6384.469638]  down_write+0x11e/0x130
[ 6384.469642]  ? down_read_killable+0x1e0/0x1e0
[ 6384.469646]  ? __sb_start_write+0x11c/0x170
[ 6384.469650]  ? __mnt_want_write+0xb4/0xd0
[ 6384.469655]  path_openat+0x7ef/0xfe0
[ 6384.469661]  ? path_mountpoint+0x4d0/0x4d0
[ 6384.469667]  ? __is_insn_slot_addr+0x93/0xb0
[ 6384.469671]  ? kernel_text_address+0x113/0x120
[ 6384.469674]  ? __kernel_text_address+0xe/0x30
[ 6384.469679]  ? unwind_get_return_address+0x2f/0x50
[ 6384.469683]  ? swiotlb_map.cold+0x25/0x25
[ 6384.469687]  ? arch_stack_walk+0x8f/0xe0
[ 6384.469692]  do_filp_open+0x12b/0x1c0
[ 6384.469695]  ? may_open_dev+0x50/0x50
[ 6384.469702]  ? __alloc_fd+0x115/0x280
[ 6384.469705]  ? lock_downgrade+0x350/0x350
[ 6384.469709]  ? do_raw_spin_lock+0x113/0x1d0
[ 6384.469713]  ? rwlock_bug.part.0+0x60/0x60
[ 6384.469718]  ? do_raw_spin_unlock+0xa3/0x130
[ 6384.469722]  ? _raw_spin_unlock+0x24/0x30
[ 6384.469725]  ? __alloc_fd+0x115/0x280
[ 6384.469731]  do_sys_open+0x1f0/0x2d0
[ 6384.469735]  ? filp_open+0x50/0x50
[ 6384.469738]  ? switch_fpu_return+0x13e/0x230
[ 6384.469742]  ? __do_page_fault+0x4b5/0x670
[ 6384.469748]  do_syscall_64+0x63/0x1c0
[ 6384.469753]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 6384.469756] RIP: 0033:0x7fe961434528
[ 6384.469760] Code: 00 00 41 00 3d 00 00 41 00 74 47 48 8d 05 20 4d 0d 00 8b 00 85 c0 75 6b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 94 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
[ 6384.469762] RSP: 002b:00007ffd9bbabb20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 6384.469765] RAX: ffffffffffffffda RBX: 0000000000000242 RCX: 00007fe961434528
[ 6384.469767] RDX: 0000000000000242 RSI: 00007ffd9bbae2a5 RDI: 00000000ffffff9c
[ 6384.469769] RBP: 00007ffd9bbae2a5 R08: 0000000000000001 R09: 0000000000000000
[ 6384.469771] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000000000242
[ 6384.469773] R13: 00007ffd9bbabe00 R14: 0000000000000180 R15: 0000000000000060

[ 6384.470018] Allocated by task 16593:
[ 6384.470562]  __kasan_kmalloc.part.0+0x3c/0xa0
[ 6384.470565]  kmem_cache_alloc+0xdc/0x240
[ 6384.470569]  copy_process+0x1dce/0x27b0
[ 6384.470572]  _do_fork+0xec/0x540
[ 6384.470576]  __se_sys_clone+0xb2/0x100
[ 6384.470581]  do_syscall_64+0x63/0x1c0
[ 6384.470586]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 6384.470823] Freed by task 9:
[ 6384.471235]  __kasan_slab_free+0x147/0x200
[ 6384.471240]  kmem_cache_free+0x111/0x330
[ 6384.471246]  rcu_core+0x2f9/0x830
[ 6384.471251]  __do_softirq+0x154/0x486

[ 6384.471493] The buggy address belongs to the object at ffff8881d5dc9440
                which belongs to the cache task_struct of size 4928
[ 6384.473081] The buggy address is located 56 bytes inside of
                4928-byte region [ffff8881d5dc9440, ffff8881d5dca780)
[ 6384.474453] The buggy address belongs to the page:
[ 6384.474989] page:ffffea0007577200 refcount:1 mapcount:0 mapping:ffff8881f6811800 index:0x0 compound_mapcount: 0
[ 6384.474993] flags: 0x8000000000010200(slab|head)
[ 6384.474997] raw: 8000000000010200 0000000000000000 0000000100000001 ffff8881f6811800
[ 6384.475000] raw: 0000000000000000 0000000000060006 00000001ffffffff 0000000000000000
[ 6384.475002] page dumped because: kasan: bad access detected

[ 6384.475176] Memory state around the buggy address:
[ 6384.475744]  ffff8881d5dc9300: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 6384.476571]  ffff8881d5dc9380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 6384.477390] >ffff8881d5dc9400: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[ 6384.478214]                                                                 ^
[ 6384.479052]  ffff8881d5dc9480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 6384.479898]  ffff8881d5dc9500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 6384.481300] ==================================================================
[ 6384.482408] Disabling lock debugging due to kernel taint


> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/percpu-rwsem.h |   4 +-
>  include/linux/rwsem.h        |  11 +--
>  kernel/locking/rwsem.c       | 125 ++++++++++++++++++++++-------------
>  3 files changed, 88 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 03cb4b6f842e..0a43830f1932 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -117,7 +117,7 @@ static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
>  	lock_release(&sem->rw_sem.dep_map, 1, ip);
>  #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
>  	if (!read)
> -		sem->rw_sem.owner = RWSEM_OWNER_UNKNOWN;
> +		atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN);
>  #endif
>  }
>  
> @@ -127,7 +127,7 @@ static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
>  	lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
>  #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
>  	if (!read)
> -		sem->rw_sem.owner = current;
> +		atomic_long_set(&sem->rw_sem.owner, (long)current);
>  #endif
>  }
>  
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index bb76e82398b2..e401358c4e7e 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -35,10 +35,11 @@
>  struct rw_semaphore {
>  	atomic_long_t count;
>  	/*
> -	 * Write owner or one of the read owners. Can be used as a
> -	 * speculative check to see if the owner is running on the cpu.
> +	 * Write owner or one of the read owners as well flags regarding
> +	 * the current state of the rwsem. Can be used as a speculative
> +	 * check to see if the write owner is running on the cpu.
>  	 */
> -	struct task_struct *owner;
> +	atomic_long_t owner;
>  #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
>  	struct optimistic_spin_queue osq; /* spinner MCS lock */
>  #endif
> @@ -53,7 +54,7 @@ struct rw_semaphore {
>   * Setting all bits of the owner field except bit 0 will indicate
>   * that the rwsem is writer-owned with an unknown owner.
>   */
> -#define RWSEM_OWNER_UNKNOWN	((struct task_struct *)-2L)
> +#define RWSEM_OWNER_UNKNOWN	(-2L)
>  
>  /* In all implementations count != 0 means locked */
>  static inline int rwsem_is_locked(struct rw_semaphore *sem)
> @@ -80,7 +81,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
>  
>  #define __RWSEM_INITIALIZER(name)				\
>  	{ __RWSEM_INIT_COUNT(name),				\
> -	  .owner = NULL,					\
> +	  .owner = ATOMIC_LONG_INIT(0),				\
>  	  .wait_list = LIST_HEAD_INIT((name).wait_list),	\
>  	  .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock)	\
>  	  __RWSEM_OPT_INIT(name)				\
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 9eb46ab9edaa..555da4868e54 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -64,7 +64,7 @@
>  	if (!debug_locks_silent &&				\
>  	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
>  		#c, atomic_long_read(&(sem)->count),		\
> -		(long)((sem)->owner), (long)current,		\
> +		atomic_long_read(&(sem)->owner), (long)current,	\
>  		list_empty(&(sem)->wait_list) ? "" : "not "))	\
>  			debug_locks_off();			\
>  	} while (0)
> @@ -114,12 +114,20 @@
>   */
>  static inline void rwsem_set_owner(struct rw_semaphore *sem)
>  {
> -	WRITE_ONCE(sem->owner, current);
> +	atomic_long_set(&sem->owner, (long)current);
>  }
>  
>  static inline void rwsem_clear_owner(struct rw_semaphore *sem)
>  {
> -	WRITE_ONCE(sem->owner, NULL);
> +	atomic_long_set(&sem->owner, 0);
> +}
> +
> +/*
> + * Test the flags in the owner field.
> + */
> +static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
> +{
> +	return atomic_long_read(&sem->owner) & flags;
>  }
>  
>  /*
> @@ -133,10 +141,9 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
>  static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
>  					    struct task_struct *owner)
>  {
> -	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
> -						 | RWSEM_NONSPINNABLE;
> +	long val = (long)owner | RWSEM_READER_OWNED | RWSEM_NONSPINNABLE;
>  
> -	WRITE_ONCE(sem->owner, (struct task_struct *)val);
> +	atomic_long_set(&sem->owner, val);
>  }
>  
>  static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
> @@ -145,13 +152,20 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
>  }
>  
>  /*
> - * Return true if the a rwsem waiter can spin on the rwsem's owner
> - * and steal the lock.
> - * N.B. !owner is considered spinnable.
> + * Return true if the rwsem is owned by a reader.
>   */
> -static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
> +static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
>  {
> -	return !((unsigned long)owner & RWSEM_NONSPINNABLE);
> +#ifdef CONFIG_DEBUG_RWSEMS
> +	/*
> +	 * Check the count to see if it is write-locked.
> +	 */
> +	long count = atomic_long_read(&sem->count);
> +
> +	if (count & RWSEM_WRITER_MASK)
> +		return false;
> +#endif
> +	return rwsem_test_oflags(sem, RWSEM_READER_OWNED);
>  }
>  
>  #ifdef CONFIG_DEBUG_RWSEMS
> @@ -163,11 +177,13 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
>   */
>  static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
>  {
> -	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
> -						   | RWSEM_NONSPINNABLE;
> -	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
> -		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
> -				RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
> +	long val = atomic_long_read(&sem->owner);
> +
> +	while ((val & ~RWSEM_OWNER_FLAGS_MASK) == (long)current) {
> +		if (atomic_long_try_cmpxchg(&sem->owner, &val,
> +					    val & RWSEM_OWNER_FLAGS_MASK))
> +			return;
> +	}
>  }
>  #else
>  static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
> @@ -175,6 +191,28 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
>  }
>  #endif
>  
> +/*
> + * Return just the real task structure pointer of the owner
> + */
> +static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
> +{
> +	return (struct task_struct *)(atomic_long_read(&sem->owner) &
> +					~RWSEM_OWNER_FLAGS_MASK);
> +}
> +
> +/*
> + * Return the real task structure pointer of the owner and the embedded
> + * flags in the owner. pflags must be non-NULL.
> + */
> +static inline struct task_struct *
> +rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
> +{
> +	long owner = atomic_long_read(&sem->owner);
> +
> +	*pflags = owner & RWSEM_OWNER_FLAGS_MASK;
> +	return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
> +}
> +
>  /*
>   * Guide to the rw_semaphore's count field.
>   *
> @@ -208,7 +246,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
>  	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
>  	raw_spin_lock_init(&sem->wait_lock);
>  	INIT_LIST_HEAD(&sem->wait_list);
> -	sem->owner = NULL;
> +	atomic_long_set(&sem->owner, 0L);
>  #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
>  	osq_lock_init(&sem->osq);
>  #endif
> @@ -511,9 +549,10 @@ static inline bool owner_on_cpu(struct task_struct *owner)
>  static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
>  {
>  	struct task_struct *owner;
> +	long flags;
>  	bool ret = true;
>  
> -	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
> +	BUILD_BUG_ON(!(RWSEM_OWNER_UNKNOWN & RWSEM_NONSPINNABLE));
>  
>  	if (need_resched()) {
>  		lockevent_inc(rwsem_opt_fail);
> @@ -522,11 +561,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
>  
>  	preempt_disable();
>  	rcu_read_lock();
> -	owner = READ_ONCE(sem->owner);
> -	if (owner) {
> -		ret = is_rwsem_owner_spinnable(owner) &&
> -		      owner_on_cpu(owner);
> -	}
> +	owner = rwsem_read_owner_flags(sem, &flags);
> +	if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
> +		ret = false;
>  	rcu_read_unlock();
>  	preempt_enable();
>  
> @@ -553,25 +590,26 @@ enum owner_state {
>  };
>  #define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
>  
> -static inline enum owner_state rwsem_owner_state(unsigned long owner)
> +static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
> +						 long flags)
>  {
> -	if (!owner)
> -		return OWNER_NULL;
> -
> -	if (owner & RWSEM_NONSPINNABLE)
> +	if (flags & RWSEM_NONSPINNABLE)
>  		return OWNER_NONSPINNABLE;
>  
> -	if (owner & RWSEM_READER_OWNED)
> +	if (flags & RWSEM_READER_OWNED)
>  		return OWNER_READER;
>  
> -	return OWNER_WRITER;
> +	return owner ? OWNER_WRITER : OWNER_NULL;
>  }
>  
>  static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
>  {
> -	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
> -	enum owner_state state = rwsem_owner_state((unsigned long)owner);
> +	struct task_struct *new, *owner;
> +	long flags, new_flags;
> +	enum owner_state state;
>  
> +	owner = rwsem_read_owner_flags(sem, &flags);
> +	state = rwsem_owner_state(owner, flags);
>  	if (state != OWNER_WRITER)
>  		return state;
>  
> @@ -582,9 +620,9 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
>  			break;
>  		}
>  
> -		tmp = READ_ONCE(sem->owner);
> -		if (tmp != owner) {
> -			state = rwsem_owner_state((unsigned long)tmp);
> +		new = rwsem_read_owner_flags(sem, &new_flags);
> +		if ((new != owner) || (new_flags != flags)) {
> +			state = rwsem_owner_state(new, new_flags);
>  			break;
>  		}
>  
> @@ -1001,8 +1039,7 @@ inline void __down_read(struct rw_semaphore *sem)
>  	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>  			&sem->count) & RWSEM_READ_FAILED_MASK)) {
>  		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
> -		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
> -					RWSEM_READER_OWNED), sem);
> +		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
>  	} else {
>  		rwsem_set_reader_owned(sem);
>  	}
> @@ -1014,8 +1051,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
>  			&sem->count) & RWSEM_READ_FAILED_MASK)) {
>  		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
>  			return -EINTR;
> -		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
> -					RWSEM_READER_OWNED), sem);
> +		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
>  	} else {
>  		rwsem_set_reader_owned(sem);
>  	}
> @@ -1084,7 +1120,7 @@ inline void __up_read(struct rw_semaphore *sem)
>  {
>  	long tmp;
>  
> -	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
> +	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
>  	rwsem_clear_reader_owned(sem);
>  	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
>  	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
> @@ -1103,8 +1139,8 @@ static inline void __up_write(struct rw_semaphore *sem)
>  	 * sem->owner may differ from current if the ownership is transferred
>  	 * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
>  	 */
> -	DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
> -			    !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
> +	DEBUG_RWSEMS_WARN_ON((rwsem_read_owner(sem) != current) &&
> +			    !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
>  	rwsem_clear_owner(sem);
>  	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
>  	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
> @@ -1125,7 +1161,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
>  	 * read-locked region is ok to be re-ordered into the
>  	 * write side. As such, rely on RELEASE semantics.
>  	 */
> -	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
> +	DEBUG_RWSEMS_WARN_ON(rwsem_read_owner(sem) != current, sem);
>  	tmp = atomic_long_fetch_add_release(
>  		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
>  	rwsem_set_reader_owned(sem);
> @@ -1296,8 +1332,7 @@ EXPORT_SYMBOL(down_write_killable_nested);
>  
>  void up_read_non_owner(struct rw_semaphore *sem)
>  {
> -	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
> -				sem);
> +	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
>  	__up_read(sem);
>  }
>  EXPORT_SYMBOL(up_read_non_owner);
> -- 
> 2.18.1
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-07-19 18:45   ` [PATCH v8 13/19] " Luis Henriques
@ 2019-07-19 19:32     ` Waiman Long
  2019-07-19 19:45       ` Luis Henriques
  2019-07-19 19:51       ` Linus Torvalds
  0 siblings, 2 replies; 83+ messages in thread
From: Waiman Long @ 2019-07-19 19:32 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 7/19/19 2:45 PM, Luis Henriques wrote:
> On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
>> The rwsem->owner contains not just the task structure pointer, it also
>> holds some flags for storing the current state of the rwsem. Some of
>> the flags may have to be atomically updated. To reflect the new reality,
>> the owner is now changed to an atomic_long_t type.
>>
>> New helper functions are added to properly separate out the task
>> structure pointer and the embedded flags.
> I started seeing KASAN use-after-free with current master, and a bisect
> showed me that this commit 94a9717b3c40 ("locking/rwsem: Make
> rwsem->owner an atomic_long_t") was the problem.  Does it ring any
> bells?  I can easily reproduce it with xfstests (generic/464).
>
> Cheers,
> --
> Luís

This patch shouldn't change the behavior of the rwsem code. The code
only access data within the rw_semaphore structures. I don't know why it
will cause a KASAN error. I will have to reproduce it and figure out
exactly which statement is doing the invalid access.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-07-19 19:32     ` Waiman Long
@ 2019-07-19 19:45       ` Luis Henriques
  2019-07-19 20:14         ` Waiman Long
  2019-07-19 19:51       ` Linus Torvalds
  1 sibling, 1 reply; 83+ messages in thread
From: Luis Henriques @ 2019-07-19 19:45 UTC (permalink / raw)
  To: Waiman Long
  Cc: Borislav Petkov, Will Deacon, huang ying, Peter Zijlstra, x86,
	Thomas Gleixner, Linus Torvalds, Tim Chen, Ingo Molnar,
	Davidlohr Bueso, linux-kernel, H. Peter Anvin

Waiman Long <longman@redhat.com> writes:

> On 7/19/19 2:45 PM, Luis Henriques wrote:
>> On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
>>> The rwsem->owner contains not just the task structure pointer, it also
>>> holds some flags for storing the current state of the rwsem. Some of
>>> the flags may have to be atomically updated. To reflect the new reality,
>>> the owner is now changed to an atomic_long_t type.
>>>
>>> New helper functions are added to properly separate out the task
>>> structure pointer and the embedded flags.
>> I started seeing KASAN use-after-free with current master, and a bisect
>> showed me that this commit 94a9717b3c40 ("locking/rwsem: Make
>> rwsem->owner an atomic_long_t") was the problem.  Does it ring any
>> bells?  I can easily reproduce it with xfstests (generic/464).
>>
>> Cheers,
>> --
>> Luís
>
> This patch shouldn't change the behavior of the rwsem code. The code
> only access data within the rw_semaphore structures. I don't know why it
> will cause a KASAN error. I will have to reproduce it and figure out
> exactly which statement is doing the invalid access.

Yeah, screwing the bisection is something I've done in the past so I may
have got the wrong commit.  Another detail is that I was running
xfstests against CephFS, I didn't tried with any other filesystem.  I
can try to reproduce with btrfs or xfs next week.

Cheers,
-- 
Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-07-19 19:32     ` Waiman Long
  2019-07-19 19:45       ` Luis Henriques
@ 2019-07-19 19:51       ` Linus Torvalds
  2019-07-20  8:41         ` Luis Henriques
  1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2019-07-19 19:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis Henriques, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying

On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <longman@redhat.com> wrote:
>
> This patch shouldn't change the behavior of the rwsem code. The code
> only access data within the rw_semaphore structures. I don't know why it
> will cause a KASAN error. I will have to reproduce it and figure out
> exactly which statement is doing the invalid access.

The stack traces should show line numbers if you run them through
scripts/decode_stacktrace.sh.

You need to have debug info enabled for that, though.

Luis?

             Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-07-19 19:45       ` Luis Henriques
@ 2019-07-19 20:14         ` Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: Waiman Long @ 2019-07-19 20:14 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Borislav Petkov, Will Deacon, huang ying, Peter Zijlstra, x86,
	Thomas Gleixner, Linus Torvalds, Tim Chen, Ingo Molnar,
	Davidlohr Bueso, linux-kernel, H. Peter Anvin

On 7/19/19 3:45 PM, Luis Henriques wrote:
> Waiman Long <longman@redhat.com> writes:
>
>> On 7/19/19 2:45 PM, Luis Henriques wrote:
>>> On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
>>>> The rwsem->owner contains not just the task structure pointer, it also
>>>> holds some flags for storing the current state of the rwsem. Some of
>>>> the flags may have to be atomically updated. To reflect the new reality,
>>>> the owner is now changed to an atomic_long_t type.
>>>>
>>>> New helper functions are added to properly separate out the task
>>>> structure pointer and the embedded flags.
>>> I started seeing KASAN use-after-free with current master, and a bisect
>>> showed me that this commit 94a9717b3c40 ("locking/rwsem: Make
>>> rwsem->owner an atomic_long_t") was the problem.  Does it ring any
>>> bells?  I can easily reproduce it with xfstests (generic/464).
>>>
>>> Cheers,
>>> --
>>> Luís
>> This patch shouldn't change the behavior of the rwsem code. The code
>> only access data within the rw_semaphore structures. I don't know why it
>> will cause a KASAN error. I will have to reproduce it and figure out
>> exactly which statement is doing the invalid access.
> Yeah, screwing the bisection is something I've done in the past so I may
> have got the wrong commit.  Another detail is that I was running
> xfstests against CephFS, I didn't tried with any other filesystem.  I
> can try to reproduce with btrfs or xfs next week.
>
> Cheers,

Oh, I don't have a CephFS setup. Will you use the
scripts/decode_stacktrace.sh to find what line number is the offending
statement? That will help in figuring out what has gone wrong.

Anyway, it seems like a structure that include a rwsem is freed while
another cpu is still waiting to acquire the lock. It is probably a
hidden bug in the filesystem code somewhere that the recent changes in
rwsem behavior make it easier for  the problem to show up.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-07-19 19:51       ` Linus Torvalds
@ 2019-07-20  8:41         ` Luis Henriques
  2019-07-20  9:32           ` Luis Henriques
                             ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Luis Henriques @ 2019-07-20  8:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Borislav Petkov, Will Deacon, huang ying,
	Peter Zijlstra, the arch/x86 maintainers, Thomas Gleixner,
	Tim Chen, Ingo Molnar, Davidlohr Bueso,
	Linux List Kernel Mailing, H. Peter Anvin, Jeff Layton

"Linus Torvalds" <torvalds@linux-foundation.org> writes:

> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <longman@redhat.com> wrote:
>>
>> This patch shouldn't change the behavior of the rwsem code. The code
>> only access data within the rw_semaphore structures. I don't know why it
>> will cause a KASAN error. I will have to reproduce it and figure out
>> exactly which statement is doing the invalid access.
>
> The stack traces should show line numbers if you run them through
> scripts/decode_stacktrace.sh.
>
> You need to have debug info enabled for that, though.
>
> Luis?
>
>              Linus

Yep, sure.  And I should have done this in the initial report.  It's a
different trace, I had to recompile the kernel.

(I'm also adding Jeff to the CC list.)

Cheers,
-- 
Luis

[   39.801179] ==================================================================
[   39.801973] BUG: KASAN: use-after-free in rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) 
[   39.802733] Read of size 4 at addr ffff8881f1f65138 by task xfs_io/2145

[   39.803598] CPU: 0 PID: 2145 Comm: xfs_io Not tainted 5.2.0+ #460
[   39.803600] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
[   39.803602] Call Trace:
[   39.803609] dump_stack (/home/miguel/kernel/linux/lib/dump_stack.c:115) 
[   39.803615] print_address_description (/home/miguel/kernel/linux/mm/kasan/report.c:352) 
[   39.803618] ? rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) 
[   39.803621] ? rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) 
[   39.803624] __kasan_report.cold (/home/miguel/kernel/linux/mm/kasan/report.c:483) 
[   39.803629] ? rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) 
[   39.803633] kasan_report (/home/miguel/kernel/linux/./arch/x86/include/asm/smap.h:69 /home/miguel/kernel/linux/mm/kasan/common.c:613) 
[   39.803636] rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) 
[   39.803641] ? __ceph_caps_issued_mask (/home/miguel/kernel/linux/fs/ceph/caps.c:914) 
[   39.803644] ? find_held_lock (/home/miguel/kernel/linux/kernel/locking/lockdep.c:4004) 
[   39.803649] ? __ceph_do_getattr (/home/miguel/kernel/linux/fs/ceph/inode.c:2246) 
[   39.803653] ? down_read_non_owner (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1116) 
[   39.803658] ? do_raw_spin_unlock (/home/miguel/kernel/linux/./include/linux/compiler.h:218 /home/miguel/kernel/linux/./include/asm-generic/qspinlock.h:94 /home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:139) 
[   39.803663] ? _raw_spin_unlock (/home/miguel/kernel/linux/kernel/locking/spinlock.c:184) 
[   39.803667] ? __lock_acquire.isra.0 (/home/miguel/kernel/linux/kernel/locking/lockdep.c:3884) 
[   39.803674] ? path_openat (/home/miguel/kernel/linux/fs/namei.c:3322 /home/miguel/kernel/linux/fs/namei.c:3533) 
[   39.803680] ? down_write (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1486) 
[   39.803683] down_write (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1486) 
[   39.803687] ? down_read_killable (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1482) 
[   39.803690] ? __sb_start_write (/home/miguel/kernel/linux/./include/linux/compiler.h:194 /home/miguel/kernel/linux/./include/linux/rcu_sync.h:38 /home/miguel/kernel/linux/./include/linux/percpu-rwsem.h:52 /home/miguel/kernel/linux/fs/super.c:1608) 
[   39.803694] ? __mnt_want_write (/home/miguel/kernel/linux/fs/namespace.c:253 /home/miguel/kernel/linux/fs/namespace.c:297 /home/miguel/kernel/linux/fs/namespace.c:337) 
[   39.803699] path_openat (/home/miguel/kernel/linux/fs/namei.c:3322 /home/miguel/kernel/linux/fs/namei.c:3533) 
[   39.803706] ? path_mountpoint (/home/miguel/kernel/linux/fs/namei.c:3518) 
[   39.803711] ? __is_insn_slot_addr (/home/miguel/kernel/linux/kernel/kprobes.c:291) 
[   39.803716] ? kernel_text_address (/home/miguel/kernel/linux/kernel/extable.c:113) 
[   39.803719] ? __kernel_text_address (/home/miguel/kernel/linux/kernel/extable.c:95) 
[   39.803724] ? unwind_get_return_address (/home/miguel/kernel/linux/arch/x86/kernel/unwind_orc.c:311 /home/miguel/kernel/linux/arch/x86/kernel/unwind_orc.c:306) 
[   39.803727] ? swiotlb_map.cold (/home/miguel/kernel/linux/kernel/stacktrace.c:83) 
[   39.803730] ? arch_stack_walk (/home/miguel/kernel/linux/arch/x86/kernel/stacktrace.c:26) 
[   39.803735] do_filp_open (/home/miguel/kernel/linux/fs/namei.c:3563) 
[   39.803739] ? may_open_dev (/home/miguel/kernel/linux/fs/namei.c:3557) 
[   39.803746] ? __alloc_fd (/home/miguel/kernel/linux/fs/file.c:536) 
[   39.803749] ? lock_downgrade (/home/miguel/kernel/linux/kernel/locking/lockdep.c:4422) 
[   39.803753] ? do_raw_spin_lock (/home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:92 /home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:115) 
[   39.803757] ? rwlock_bug.part.0 (/home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:111) 
[   39.803762] ? do_raw_spin_unlock (/home/miguel/kernel/linux/./include/linux/compiler.h:218 /home/miguel/kernel/linux/./include/asm-generic/qspinlock.h:94 /home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:139) 
[   39.803766] ? _raw_spin_unlock (/home/miguel/kernel/linux/kernel/locking/spinlock.c:184) 
[   39.803769] ? __alloc_fd (/home/miguel/kernel/linux/fs/file.c:536) 
[   39.803774] do_sys_open (/home/miguel/kernel/linux/fs/open.c:1070) 
[   39.803778] ? filp_open (/home/miguel/kernel/linux/fs/open.c:1056) 
[   39.803781] ? switch_fpu_return (/home/miguel/kernel/linux/./arch/x86/include/asm/bitops.h:76 /home/miguel/kernel/linux/./include/asm-generic/bitops-instrumented.h:57 /home/miguel/kernel/linux/./include/linux/thread_info.h:60 /home/miguel/kernel/linux/./arch/x86/include/asm/fpu/internal.h:547 /home/miguel/kernel/linux/arch/x86/kernel/fpu/core.c:343) 
[   39.803786] ? __do_page_fault (/home/miguel/kernel/linux/./include/linux/compiler.h:194 /home/miguel/kernel/linux/./arch/x86/include/asm/atomic.h:31 /home/miguel/kernel/linux/./include/asm-generic/atomic-instrumented.h:27 /home/miguel/kernel/linux/./include/linux/jump_label.h:254 /home/miguel/kernel/linux/./include/linux/jump_label.h:264 /home/miguel/kernel/linux/./include/linux/perf_event.h:1094 /home/miguel/kernel/linux/arch/x86/mm/fault.c:1485 /home/miguel/kernel/linux/arch/x86/mm/fault.c:1510) 
[   39.803792] do_syscall_64 (/home/miguel/kernel/linux/arch/x86/entry/common.c:296) 
[   39.803796] entry_SYSCALL_64_after_hwframe (/home/miguel/kernel/linux/arch/x86/entry/entry_64.S:184) 
[   39.803799] RIP: 0033:0x7f62b41a2528
[ 39.803803] Code: 00 00 41 00 3d 00 00 41 00 74 47 48 8d 05 20 4d 0d 00 8b 00 85 c0 75 6b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 94 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
All code
========
   0:	00 00                	add    %al,(%rax)
   2:	41 00 3d 00 00 41 00 	add    %dil,0x410000(%rip)        # 0x410009
   9:	74 47                	je     0x52
   b:	48 8d 05 20 4d 0d 00 	lea    0xd4d20(%rip),%rax        # 0xd4d32
  12:	8b 00                	mov    (%rax),%eax
  14:	85 c0                	test   %eax,%eax
  16:	75 6b                	jne    0x83
  18:	44 89 e2             	mov    %r12d,%edx
  1b:	48 89 ee             	mov    %rbp,%rsi
  1e:	bf 9c ff ff ff       	mov    $0xffffff9c,%edi
  23:	b8 01 01 00 00       	mov    $0x101,%eax
  28:	0f 05                	syscall 
  2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
  30:	0f 87 94 00 00 00    	ja     0xca
  36:	48 8b 4c 24 28       	mov    0x28(%rsp),%rcx
  3b:	64                   	fs
  3c:	48                   	rex.W
  3d:	33                   	.byte 0x33
  3e:	0c 25                	or     $0x25,%al

Code starting with the faulting instruction
===========================================
   0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
   6:	0f 87 94 00 00 00    	ja     0xa0
   c:	48 8b 4c 24 28       	mov    0x28(%rsp),%rcx
  11:	64                   	fs
  12:	48                   	rex.W
  13:	33                   	.byte 0x33
  14:	0c 25                	or     $0x25,%al
[   39.803805] RSP: 002b:00007ffe6c3359e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[   39.803808] RAX: ffffffffffffffda RBX: 0000000000000242 RCX: 00007f62b41a2528
[   39.803810] RDX: 0000000000000242 RSI: 00007ffe6c3382a5 RDI: 00000000ffffff9c
[   39.803812] RBP: 00007ffe6c3382a5 R08: 0000000000000001 R09: 0000000000000000
[   39.803814] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000000000242
[   39.803816] R13: 00007ffe6c335cc0 R14: 0000000000000180 R15: 0000000000000060

[   39.803996] Allocated by task 2093:
[   39.804373] __kasan_kmalloc.part.0 (/home/miguel/kernel/linux/mm/kasan/common.c:69 /home/miguel/kernel/linux/mm/kasan/common.c:77 /home/miguel/kernel/linux/mm/kasan/common.c:487) 
[   39.804376] kmem_cache_alloc (/home/miguel/kernel/linux/mm/slab.h:522 /home/miguel/kernel/linux/mm/slub.c:2766 /home/miguel/kernel/linux/mm/slub.c:2774 /home/miguel/kernel/linux/mm/slub.c:2779) 
[   39.804380] copy_process (/home/miguel/kernel/linux/kernel/fork.c:852 /home/miguel/kernel/linux/kernel/fork.c:1856) 
[   39.804382] _do_fork (/home/miguel/kernel/linux/kernel/fork.c:2369) 
[   39.804385] __se_sys_clone (/home/miguel/kernel/linux/kernel/fork.c:2505) 
[   39.804387] do_syscall_64 (/home/miguel/kernel/linux/arch/x86/entry/common.c:296) 
[   39.804390] entry_SYSCALL_64_after_hwframe (/home/miguel/kernel/linux/arch/x86/entry/entry_64.S:184) 

[   39.804558] Freed by task 16:
[   39.804871] __kasan_slab_free (/home/miguel/kernel/linux/mm/kasan/common.c:69 /home/miguel/kernel/linux/mm/kasan/common.c:77 /home/miguel/kernel/linux/mm/kasan/common.c:449) 
[   39.804874] kmem_cache_free (/home/miguel/kernel/linux/mm/slub.c:1470 /home/miguel/kernel/linux/mm/slub.c:3012 /home/miguel/kernel/linux/mm/slub.c:3028) 
[   39.804877] rcu_core (/home/miguel/kernel/linux/./include/linux/rcupdate.h:213 /home/miguel/kernel/linux/kernel/rcu/rcu.h:223 /home/miguel/kernel/linux/kernel/rcu/tree.c:2114 /home/miguel/kernel/linux/kernel/rcu/tree.c:2314) 
[   39.804880] __do_softirq (/home/miguel/kernel/linux/./include/asm-generic/atomic-instrumented.h:26 /home/miguel/kernel/linux/./include/linux/jump_label.h:254 /home/miguel/kernel/linux/./include/linux/jump_label.h:264 /home/miguel/kernel/linux/./include/trace/events/irq.h:142 /home/miguel/kernel/linux/kernel/softirq.c:293) 

[   39.805048] The buggy address belongs to the object at ffff8881f1f65100
which belongs to the cache task_struct of size 4928
[   39.806345] The buggy address is located 56 bytes inside of
4928-byte region [ffff8881f1f65100, ffff8881f1f66440)
[   39.807543] The buggy address belongs to the page:
[   39.808045] page:ffffea0007c7d800 refcount:1 mapcount:0 mapping:ffff8881f6811800 index:0x0 compound_mapcount: 0
[   39.808049] flags: 0x8000000000010200(slab|head)
[   39.808053] raw: 8000000000010200 dead000000000100 dead000000000122 ffff8881f6811800
[   39.808056] raw: 0000000000000000 0000000000060006 00000001ffffffff 0000000000000000
[   39.808058] page dumped because: kasan: bad access detected

[   39.808224] Memory state around the buggy address:
[   39.808723]  ffff8881f1f65000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   39.809476]  ffff8881f1f65080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   39.810220] >ffff8881f1f65100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   39.810968]                                         ^
[   39.811504]  ffff8881f1f65180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   39.812237]  ffff8881f1f65200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   39.812972] ==================================================================
[   39.813710] Disabling lock debugging due to kernel taint

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-07-20  8:41         ` Luis Henriques
@ 2019-07-20  9:32           ` Luis Henriques
  2019-07-20  9:45             ` Luis Henriques
  2019-07-20 11:10           ` Peter Zijlstra
  2019-07-20 15:04           ` Waiman Long
  2 siblings, 1 reply; 83+ messages in thread
From: Luis Henriques @ 2019-07-20  9:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Borislav Petkov, Will Deacon, huang ying,
	Peter Zijlstra, the arch/x86 maintainers, Thomas Gleixner,
	Tim Chen, Ingo Molnar, Davidlohr Bueso,
	Linux List Kernel Mailing, H. Peter Anvin, Jeff Layton

Luis Henriques <lhenriques@suse.com> writes:

> "Linus Torvalds" <torvalds@linux-foundation.org> writes:
>
>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <longman@redhat.com> wrote:
>>>
>>> This patch shouldn't change the behavior of the rwsem code. The code
>>> only access data within the rw_semaphore structures. I don't know why it
>>> will cause a KASAN error. I will have to reproduce it and figure out
>>> exactly which statement is doing the invalid access.
>>
>> The stack traces should show line numbers if you run them through
>> scripts/decode_stacktrace.sh.
>>
>> You need to have debug info enabled for that, though.
>>
>> Luis?
>>
>>              Linus
>
> Yep, sure.  And I should have done this in the initial report.  It's a
> different trace, I had to recompile the kernel.
>
> (I'm also adding Jeff to the CC list.)
>

Ah, and I also managed to reproduce this on btrfs so I guess this rules
out a bug in the filesystem code.

Cheers,
-- 
Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-07-20  9:32           ` Luis Henriques
@ 2019-07-20  9:45             ` Luis Henriques
  0 siblings, 0 replies; 83+ messages in thread
From: Luis Henriques @ 2019-07-20  9:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Borislav Petkov, Will Deacon, huang ying,
	Peter Zijlstra, the arch/x86 maintainers, Thomas Gleixner,
	Tim Chen, Ingo Molnar, Davidlohr Bueso,
	Linux List Kernel Mailing, H. Peter Anvin, Jeff Layton

Luis Henriques <lhenriques@suse.com> writes:

> Luis Henriques <lhenriques@suse.com> writes:
>
>> "Linus Torvalds" <torvalds@linux-foundation.org> writes:
>>
>>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <longman@redhat.com> wrote:
>>>>
>>>> This patch shouldn't change the behavior of the rwsem code. The code
>>>> only access data within the rw_semaphore structures. I don't know why it
>>>> will cause a KASAN error. I will have to reproduce it and figure out
>>>> exactly which statement is doing the invalid access.
>>>
>>> The stack traces should show line numbers if you run them through
>>> scripts/decode_stacktrace.sh.
>>>
>>> You need to have debug info enabled for that, though.
>>>
>>> Luis?
>>>
>>>              Linus
>>
>> Yep, sure.  And I should have done this in the initial report.  It's a
>> different trace, I had to recompile the kernel.
>>
>> (I'm also adding Jeff to the CC list.)
>>
>
> Ah, and I also managed to reproduce this on btrfs so I guess this rules
> out a bug in the filesystem code.

Just another detail (before I go completely offline until tomorrow
evening): in the btrfs case I'm seeing the bug on the
rwsem_down_read_slowpath path, not on rwsem_down_write_slowpath.  But it
seems to be on the same place (i.e. rwsem_can_spin_on_owner).

Cheers,
-- 
Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-07-20  8:41         ` Luis Henriques
  2019-07-20  9:32           ` Luis Henriques
@ 2019-07-20 11:10           ` Peter Zijlstra
  2019-07-20 15:04           ` Waiman Long
  2 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2019-07-20 11:10 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Linus Torvalds, Waiman Long, Borislav Petkov, Will Deacon,
	huang ying, the arch/x86 maintainers, Thomas Gleixner, Tim Chen,
	Ingo Molnar, Davidlohr Bueso, Linux List Kernel Mailing,
	H. Peter Anvin, Jeff Layton

On Sat, Jul 20, 2019 at 09:41:05AM +0100, Luis Henriques wrote:
> [   39.801179] ==================================================================
> [   39.801973] BUG: KASAN: use-after-free in rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) 

That's rwsem_can_spin_on_owner(), specifically line 669 seems to suggest
owner_on_cpu().

So we'd somehow have a dead owner; I'm not immediately seeing how that
can happen.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-07-20  8:41         ` Luis Henriques
  2019-07-20  9:32           ` Luis Henriques
  2019-07-20 11:10           ` Peter Zijlstra
@ 2019-07-20 15:04           ` Waiman Long
  2019-07-21 20:49             ` Luis Henriques
  2019-07-25 15:59             ` [tip:locking/core] locking/rwsem: Don't call owner_on_cpu() on read-owner tip-bot for Waiman Long
  2 siblings, 2 replies; 83+ messages in thread
From: Waiman Long @ 2019-07-20 15:04 UTC (permalink / raw)
  To: Luis Henriques, Linus Torvalds
  Cc: Borislav Petkov, Will Deacon, huang ying, Peter Zijlstra,
	the arch/x86 maintainers, Thomas Gleixner, Tim Chen, Ingo Molnar,
	Davidlohr Bueso, Linux List Kernel Mailing, H. Peter Anvin,
	Jeff Layton

[-- Attachment #1: Type: text/plain, Size: 1002 bytes --]

On 7/20/19 4:41 AM, Luis Henriques wrote:
> "Linus Torvalds" <torvalds@linux-foundation.org> writes:
>
>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <longman@redhat.com> wrote:
>>> This patch shouldn't change the behavior of the rwsem code. The code
>>> only access data within the rw_semaphore structures. I don't know why it
>>> will cause a KASAN error. I will have to reproduce it and figure out
>>> exactly which statement is doing the invalid access.
>> The stack traces should show line numbers if you run them through
>> scripts/decode_stacktrace.sh.
>>
>> You need to have debug info enabled for that, though.
>>
>> Luis?
>>
>>              Linus
> Yep, sure.  And I should have done this in the initial report.  It's a
> different trace, I had to recompile the kernel.
>
> (I'm also adding Jeff to the CC list.)
>
> Cheers,

Thanks for the information. I think I know where the problem is. Would
you mind applying the attached patch to see if it can fix the KASAN error.

Thanks,
Longman



[-- Attachment #2: 0001-locking-rwsem-Don-t-call-owner_on_cpu-on-read-owner.patch --]
[-- Type: text/x-patch, Size: 1806 bytes --]

From 445dc58add71f976c20359568d370f5059d30f77 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Sat, 20 Jul 2019 10:45:39 -0400
Subject: [PATCH] locking/rwsem: Don't call owner_on_cpu() on read-owner

For writer, the owner value is cleared on unlock. For reader, it is
left intact on unlock for providing better debugging aid on crash dump
and the unlock of one reader may not mean the lock is free.

As a result, the owner_on_cpu() shouldn't be used on read-owner
as the task pointer value may not be valid and it might have
been freed. That is the case in rwsem_spin_on_owner(), but not in
rwsem_can_spin_on_owner(). This can lead to use-after-free error from
KASAN. For example,

  BUG: KASAN: use-after-free in rwsem_down_write_slowpath
  (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669
  /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)

Fix this by checking for RWSEM_READER_OWNED flag before calling
owner_on_cpu().

Fixes: 94a9717b3c40e ("locking/rwsem: Make rwsem->owner an atomic_long_t")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..bc91aacaab58 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -666,7 +666,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
 	preempt_disable();
 	rcu_read_lock();
 	owner = rwsem_owner_flags(sem, &flags);
-	if ((flags & nonspinnable) || (owner && !owner_on_cpu(owner)))
+	/*
+	 * Don't check the read-owner as the entry may be stale.
+	 */
+	if ((flags & nonspinnable) ||
+	    (owner && !(flags & RWSEM_READER_OWNED) && !owner_on_cpu(owner)))
 		ret = false;
 	rcu_read_unlock();
 	preempt_enable();
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-07-20 15:04           ` Waiman Long
@ 2019-07-21 20:49             ` Luis Henriques
  2019-07-23  2:57               ` Waiman Long
  2019-07-25 15:59             ` [tip:locking/core] locking/rwsem: Don't call owner_on_cpu() on read-owner tip-bot for Waiman Long
  1 sibling, 1 reply; 83+ messages in thread
From: Luis Henriques @ 2019-07-21 20:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Borislav Petkov, Will Deacon, huang ying,
	Peter Zijlstra, Jeff Layton, the arch/x86 maintainers,
	Thomas Gleixner, Tim Chen, Ingo Molnar, Davidlohr Bueso,
	Linux List Kernel Mailing, H. Peter Anvin

Waiman Long <longman@redhat.com> writes:

> On 7/20/19 4:41 AM, Luis Henriques wrote:
>> "Linus Torvalds" <torvalds@linux-foundation.org> writes:
>>
>>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <longman@redhat.com> wrote:
>>>> This patch shouldn't change the behavior of the rwsem code. The code
>>>> only access data within the rw_semaphore structures. I don't know why it
>>>> will cause a KASAN error. I will have to reproduce it and figure out
>>>> exactly which statement is doing the invalid access.
>>> The stack traces should show line numbers if you run them through
>>> scripts/decode_stacktrace.sh.
>>>
>>> You need to have debug info enabled for that, though.
>>>
>>> Luis?
>>>
>>>              Linus
>> Yep, sure.  And I should have done this in the initial report.  It's a
>> different trace, I had to recompile the kernel.
>>
>> (I'm also adding Jeff to the CC list.)
>>
>> Cheers,
>
> Thanks for the information. I think I know where the problem is. Would
> you mind applying the attached patch to see if it can fix the KASAN error.

Yep, that seems to work -- I can't reproduce the error anymore (and
sorry for the delay).  Thanks!  And feel free to add my Tested-by.

Cheers,
-- 
Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
  2019-07-21 20:49             ` Luis Henriques
@ 2019-07-23  2:57               ` Waiman Long
  0 siblings, 0 replies; 83+ messages in thread
From: Waiman Long @ 2019-07-23  2:57 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Linus Torvalds, Borislav Petkov, Will Deacon, huang ying,
	Peter Zijlstra, Jeff Layton, the arch/x86 maintainers,
	Thomas Gleixner, Tim Chen, Ingo Molnar, Davidlohr Bueso,
	Linux List Kernel Mailing, H. Peter Anvin

On 7/21/19 4:49 PM, Luis Henriques wrote:
> Waiman Long <longman@redhat.com> writes:
>
>> On 7/20/19 4:41 AM, Luis Henriques wrote:
>>> "Linus Torvalds" <torvalds@linux-foundation.org> writes:
>>>
>>>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <longman@redhat.com> wrote:
>>>>> This patch shouldn't change the behavior of the rwsem code. The code
>>>>> only access data within the rw_semaphore structures. I don't know why it
>>>>> will cause a KASAN error. I will have to reproduce it and figure out
>>>>> exactly which statement is doing the invalid access.
>>>> The stack traces should show line numbers if you run them through
>>>> scripts/decode_stacktrace.sh.
>>>>
>>>> You need to have debug info enabled for that, though.
>>>>
>>>> Luis?
>>>>
>>>>              Linus
>>> Yep, sure.  And I should have done this in the initial report.  It's a
>>> different trace, I had to recompile the kernel.
>>>
>>> (I'm also adding Jeff to the CC list.)
>>>
>>> Cheers,
>> Thanks for the information. I think I know where the problem is. Would
>> you mind applying the attached patch to see if it can fix the KASAN error.
> Yep, that seems to work -- I can't reproduce the error anymore (and
> sorry for the delay).  Thanks!  And feel free to add my Tested-by.
>
> Cheers,

Thanks for the testing. I will post the official patch tomorrow.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [tip:locking/core] locking/rwsem: Don't call owner_on_cpu() on read-owner
  2019-07-20 15:04           ` Waiman Long
  2019-07-21 20:49             ` Luis Henriques
@ 2019-07-25 15:59             ` tip-bot for Waiman Long
  1 sibling, 0 replies; 83+ messages in thread
From: tip-bot for Waiman Long @ 2019-07-25 15:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, will.deacon, tim.c.chen, huang.ying.caritas, linux-kernel,
	lhenriques, hpa, longman, jlayton, bp, peterz, mingo, dave,
	torvalds

Commit-ID:  78134300579a45f527ca173ec8fdb4701b69f16e
Gitweb:     https://git.kernel.org/tip/78134300579a45f527ca173ec8fdb4701b69f16e
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Sat, 20 Jul 2019 11:04:10 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 25 Jul 2019 15:39:22 +0200

locking/rwsem: Don't call owner_on_cpu() on read-owner

For writer, the owner value is cleared on unlock. For reader, it is
left intact on unlock for providing better debugging aid on crash dump
and the unlock of one reader may not mean the lock is free.

As a result, the owner_on_cpu() shouldn't be used on read-owner
as the task pointer value may not be valid and it might have
been freed. That is the case in rwsem_spin_on_owner(), but not in
rwsem_can_spin_on_owner(). This can lead to use-after-free error from
KASAN. For example,

  BUG: KASAN: use-after-free in rwsem_down_write_slowpath
  (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669
  /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)

Fix this by checking for RWSEM_READER_OWNED flag before calling
owner_on_cpu().

Reported-by: Luis Henriques <lhenriques@suse.com>
Tested-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Fixes: 94a9717b3c40e ("locking/rwsem: Make rwsem->owner an atomic_long_t")
Link: https://lkml.kernel.org/r/81e82d5b-5074-77e8-7204-28479bbe0df0@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/rwsem.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..bc91aacaab58 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -666,7 +666,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
 	preempt_disable();
 	rcu_read_lock();
 	owner = rwsem_owner_flags(sem, &flags);
-	if ((flags & nonspinnable) || (owner && !owner_on_cpu(owner)))
+	/*
+	 * Don't check the read-owner as the entry may be stale.
+	 */
+	if ((flags & nonspinnable) ||
+	    (owner && !(flags & RWSEM_READER_OWNED) && !owner_on_cpu(owner)))
 		ret = false;
 	rcu_read_unlock();
 	preempt_enable();

^ permalink raw reply related	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2019-07-25 16:00 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-20 20:58 [PATCH v8 00/19] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
2019-05-20 20:59 ` [PATCH v8 01/19] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER Waiman Long
2019-06-17 14:23   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 02/19] locking/rwsem: Remove rwsem_wake() wakeup optimization Waiman Long
2019-06-17 14:24   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 03/19] locking/rwsem: Implement a new locking scheme Waiman Long
2019-06-17 14:24   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 04/19] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c Waiman Long
2019-06-17 14:25   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 05/19] locking/rwsem: Code cleanup after files merging Waiman Long
2019-06-17 14:26   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 06/19] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
2019-06-17 14:27   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 07/19] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
2019-06-04  3:03   ` Yuyang Du
2019-06-04  3:26     ` Yuyang Du
2019-06-04  9:12       ` Boqun Feng
2019-06-04 16:00         ` Waiman Long
2019-06-05  7:48           ` Yuyang Du
2019-06-04 13:21       ` Waiman Long
2019-06-17 14:27   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 08/19] locking/rwsem: Always release wait_lock before waking up tasks Waiman Long
2019-06-17 14:28   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 09/19] locking/rwsem: More optimal RT task handling of null owner Waiman Long
2019-06-17 14:29   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 10/19] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
2019-06-17 14:29   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 11/19] locking/rwsem: Clarify usage of owner's nonspinaable bit Waiman Long
2019-06-17 14:30   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 12/19] locking/rwsem: Enable readers spinning on writer Waiman Long
2019-06-17 14:31   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t Waiman Long
2019-06-04  8:52   ` Peter Zijlstra
2019-06-04 15:44     ` Waiman Long
2019-06-17 14:32   ` [tip:locking/core] " tip-bot for Waiman Long
2019-07-19 18:45   ` [PATCH v8 13/19] " Luis Henriques
2019-07-19 19:32     ` Waiman Long
2019-07-19 19:45       ` Luis Henriques
2019-07-19 20:14         ` Waiman Long
2019-07-19 19:51       ` Linus Torvalds
2019-07-20  8:41         ` Luis Henriques
2019-07-20  9:32           ` Luis Henriques
2019-07-20  9:45             ` Luis Henriques
2019-07-20 11:10           ` Peter Zijlstra
2019-07-20 15:04           ` Waiman Long
2019-07-21 20:49             ` Luis Henriques
2019-07-23  2:57               ` Waiman Long
2019-07-25 15:59             ` [tip:locking/core] locking/rwsem: Don't call owner_on_cpu() on read-owner tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 14/19] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
2019-06-04  9:03   ` Peter Zijlstra
2019-06-04 16:54     ` Waiman Long
2019-06-17 14:32   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 15/19] locking/rwsem: Adaptive disabling of reader optimistic spinning Waiman Long
2019-06-04  9:10   ` Peter Zijlstra
2019-06-04 17:28     ` Waiman Long
2019-06-04  9:14   ` Peter Zijlstra
2019-06-04 17:29     ` Waiman Long
2019-06-04  9:20   ` Peter Zijlstra
2019-06-04 17:30     ` Waiman Long
2019-06-04 17:38       ` Peter Zijlstra
2019-06-04 18:04         ` Waiman Long
2019-06-04 18:14           ` Peter Zijlstra
2019-06-04 18:21             ` Waiman Long
2019-06-05 18:13               ` Waiman Long
2019-06-05 20:19                 ` Peter Zijlstra
2019-06-05 20:52                   ` Linus Torvalds
2019-06-06  8:03                     ` Peter Zijlstra
2019-06-06  8:11                       ` Peter Zijlstra
2019-06-04 10:58   ` Peter Zijlstra
2019-06-17 14:33   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 16/19] locking/rwsem: Guard against making count negative Waiman Long
2019-06-11 13:11   ` Peter Zijlstra
2019-06-11 13:27     ` Peter Zijlstra
2019-06-11 13:13   ` Peter Zijlstra
2019-06-17 14:34   ` [tip:locking/core] " tip-bot for Waiman Long
2019-05-20 20:59 ` [PATCH v8 17/19] locking/rwsem: Merge owner into count on x86-64 Waiman Long
2019-06-04  9:45   ` Peter Zijlstra
2019-06-04 15:47     ` Waiman Long
2019-06-04 17:02       ` Peter Zijlstra
2019-06-04 17:06         ` Waiman Long
2019-06-04 17:18           ` Peter Zijlstra
2019-05-20 20:59 ` [PATCH v8 18/19] locking/rwsem: Remove redundant computation of writer lock word Waiman Long
2019-05-20 20:59 ` [PATCH v8 19/19] locking/rwsem: Disable preemption in down_read*() if owner in count Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.