All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2
@ 2019-04-13 17:22 Waiman Long
  2019-04-13 17:22 ` [PATCH v4 01/16] locking/rwsem: Prevent unneeded warning during locking selftest Waiman Long
                   ` (15 more replies)
  0 siblings, 16 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

 v4:
  - Fix the missing initialization bug with !CONFIG_RWSEM_SPIN_ON_OWNER
    in patch 2.
  - Move the "Remove rwsem_wake() wakeup optimization" patch before
    the "Implement a new locking scheme" patch.
  - Add two new patches to merge the relevant content of rwsem.h and
    rwsem-xadd.c into rwsem.c as suggested by PeterZ.
  - Refactor the lock handoff patch to make all setting and clearing
    of the handoff bit serialized by wait_lock to ensure correctness.
  - Adapt the rest of the patches to the new code base.

 v3:
  - Add 2 more patches in front to fix build and testing issues found.
    Patch 1 can actually be merged on top of the patch "locking/rwsem:
    Enhance DEBUG_RWSEMS_WARN_ON() macro" in part 1.
  - Change the handoff patch (now patch 4) to set handoff bit immediately
    after wakeup for RT writers. The timeout limit is also tightened to
    4ms.
  - There is no code changes in other patches other than resolving conflicts
    with patches 1, 2 and 4.

 v2:
  - Move the negative reader count checking patch (patch 12->10)
    forward to before the merge owner to count patch as suggested by
    Linus & expand the comment.
  - Change the reader-owned rwsem spinning from count based to time
    based to have better control of the max time allowed.

This is part 2 of a 3-part (0/1/2) series to rearchitect the internal
operation of rwsem.

part 0: merged into tip
part 1: https://lore.kernel.org/lkml/20190404174320.22416-1-longman@redhat.com/

This patchset revamps the current rwsem-xadd implementation to make
it saner and easier to work with. It also implements the following 3
new features:

 1) Waiter lock handoff
 2) Reader optimistic spinning
 3) Store write-lock owner in the atomic count (x86-64 only)

Waiter lock handoff is similar to the mechanism currently in the mutex
code. This ensures that lock starvation won't happen.

Reader optimistic spinning enables readers to acquire the lock more
quickly.  So workloads that use a mix of readers and writers should
see an increase in performance as long as the reader critical sections
are short.

Finally, storing the write-lock owner into the count will allow
optimistic spinners to get to the lock holder's task structure more
quickly and eliminating the timing gap where the write lock is acquired
but the owner isn't known yet. This is important for RT tasks where
spinning on a lock with an unknown owner is not allowed.

Because of the fact that multiple readers can share the same lock,
there is a natural preference for readers when measuring in term of
locking throughput as more readers are likely to get into the locking
fast path than the writers. With waiter lock handoff, we are not going
to starve the writers.

On a 8-socket 120-core 240-thread IvyBridge-EX system with 120 reader
and writer locking threads, the min/mean/max locking operations done
in a 5-second testing window before the patchset were:

  120 readers, Iterations Min/Mean/Max = 399/400/401
  120 writers, Iterations Min/Mean/Max = 400/33,389/211,359

After the patchset, they became:

  120 readers, Iterations Min/Mean/Max = 584/10,266/26,609
  120 writers, Iterations Min/Mean/Max = 22,080/29,016/38,728

So it was much fairer to readers. With less locking threads, the readers
were preferred than writers.

Patch 1 fixes an testing issue with locking selftest introduced by the
patch "locking/rwsem: Enhance DEBUG_RWSEMS_WARN_ON() macro" in part 1.

Patch 2 makes owner a permanent member of the rw_semaphore structure and
set it irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Patch 3 removes rwsem_wake() wakeup optimization as it doesn't work
with lock handoff.

Patch 4 implements a new rwsem locking scheme similar to what qrwlock
is current doing. Write lock is done by atomic_cmpxchg() while read
lock is still being done by atomic_add().

Patch 5 merges the content of rwsem.h and rwsem-xadd.c into rwsem.c just
like the mutex. The rwsem-xadd.c is removed and a bare-bone rwsem.h is
left for internal function declaration needed by percpu-rwsem.c.

Patch 6 optimizes the merged rwsem.c file to generate smaller object
file.

Patch 7 implments lock handoff to prevent lock starvation. It is expected
that throughput will be lower on workloads with highly contended rwsems
for better fairness.

Patch 8 makes rwsem_spin_on_owner() returns owner state.

Patch 9 disallows RT tasks to spin on a rwsem with unknown owner.

Patch 10 makes reader wakeup to wake almost all the readers in the wait
queue instead of just those in the front.

Patch 11 enables reader to spin on a writer-owned rwsem.

Patch 12 enables a writer to spin on a reader-owned rwsem for at most
25us.

Patch 13 adds some new rwsem owner access helper functions.

Patch 14 handles the case of too many readers by reserving the sign
bit to designate that a reader lock attempt will fail and the locking
reader will be put to sleep. This will ensure that we will not overflow
the reader count.

Patch 15 merges the write-lock owner task pointer into the count.
Only 64-bit count has enough space to provide a reasonable number of
bits for reader count. This is for x86-64 only for the time being.

Patch 16 eliminates redundant computation of the merged owner-count.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with equal
numbers of readers and writers (mixed) before and after this patchset
were:

   # of Threads   Before Patch      After Patch
   ------------   ------------      -----------
        2            1,179             9,436
        4            1,505             8,268
        8              721             7,041
       16              575             7,652
       32               70             2,189
       64               39               534

On workloads where the rwsem reader critical section is relatively long
(longer than the spinning period), optimistic of writer on reader-owned
rwsem may not be that helpful. In fact, the performance may regress
in some cases like the will-it-sclae page_fault1 microbenchmark. This
is likely due to the fact that larger reader groups where the readers
acquire the lock together are broken into smaller ones. So more work
will be needed to better tune the rwsem code to that kind of workload.

Waiman Long (16):
  locking/rwsem: Prevent unneeded warning during locking selftest
  locking/rwsem: Make owner available even if
    !CONFIG_RWSEM_SPIN_ON_OWNER
  locking/rwsem: Remove rwsem_wake() wakeup optimization
  locking/rwsem: Implement a new locking scheme
  locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
  locking/rwsem: Code cleanup after files merging
  locking/rwsem: Implement lock handoff to prevent lock starvation
  locking/rwsem: Make rwsem_spin_on_owner() return owner state
  locking/rwsem: Ensure an RT task will not spin on reader
  locking/rwsem: Wake up almost all readers in wait queue
  locking/rwsem: Enable readers spinning on writer
  locking/rwsem: Enable time-based spinning on reader-owned rwsem
  locking/rwsem: Add more rwsem owner access helpers
  locking/rwsem: Guard against making count negative
  locking/rwsem: Merge owner into count on x86-64
  locking/rwsem: Remove redundant computation of writer lock word

 include/linux/rwsem.h             |    9 +-
 kernel/locking/Makefile           |    2 +-
 kernel/locking/lock_events_list.h |    4 +
 kernel/locking/rwsem-xadd.c       |  729 ---------------
 kernel/locking/rwsem.c            | 1403 ++++++++++++++++++++++++++++-
 kernel/locking/rwsem.h            |  305 +------
 lib/Kconfig.debug                 |    8 +-
 7 files changed, 1396 insertions(+), 1064 deletions(-)
 delete mode 100644 kernel/locking/rwsem-xadd.c

-- 
2.18.1


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCH v4 01/16] locking/rwsem: Prevent unneeded warning during locking selftest
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-18  8:04   ` [tip:locking/core] " tip-bot for Waiman Long
  2019-04-13 17:22 ` [PATCH v4 02/16] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER Waiman Long
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Disable the DEBUG_RWSEMS check when locking selftest is running with
debug_locks_silent flag set.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 37db17890e36..64877f5294e3 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -30,7 +30,8 @@
 
 #ifdef CONFIG_DEBUG_RWSEMS
 # define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
-	if (WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
+	if (!debug_locks_silent &&				\
+	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
 		#c, atomic_long_read(&(sem)->count),		\
 		(long)((sem)->owner), (long)current,		\
 		list_empty(&(sem)->wait_list) ? "" : "not "))	\
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 02/16] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
  2019-04-13 17:22 ` [PATCH v4 01/16] locking/rwsem: Prevent unneeded warning during locking selftest Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-13 17:22 ` [PATCH v4 03/16] locking/rwsem: Remove rwsem_wake() wakeup optimization Waiman Long
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

The owner field in the rw_semaphore structure is used primarily for
optimistic spinning. However, identifying the rwsem owner can also be
helpful in debugging as well as tracing locking related issues when
analyzing crash dump. The owner field may also store state information
that can be important to the operation of the rwsem.

So the owner field is now made a permanent member of the rw_semaphore
structure irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/rwsem.h       |  9 +++++----
 kernel/locking/rwsem-xadd.c |  2 +-
 kernel/locking/rwsem.h      | 23 -----------------------
 lib/Kconfig.debug           |  8 ++++----
 4 files changed, 10 insertions(+), 32 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 2ea18a3def04..148983e21d47 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -34,12 +34,12 @@
  */
 struct rw_semaphore {
 	atomic_long_t count;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	/*
-	 * Write owner. Used as a speculative check to see
-	 * if the owner is running on the cpu.
+	 * Write owner or one of the read owners. Can be used as a
+	 * speculative check to see if the owner is running on the cpu.
 	 */
 	struct task_struct *owner;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	struct optimistic_spin_queue osq; /* spinner MCS lock */
 #endif
 	raw_spinlock_t wait_lock;
@@ -73,13 +73,14 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #endif
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL
+#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED
 #else
 #define __RWSEM_OPT_INIT(lockname)
 #endif
 
 #define __RWSEM_INITIALIZER(name)				\
 	{ __RWSEM_INIT_COUNT(name),				\
+	  .owner = NULL,					\
 	  .wait_list = LIST_HEAD_INIT((name).wait_list),	\
 	  .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock)	\
 	  __RWSEM_OPT_INIT(name)				\
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 6b3ee9948bf1..7fd4f1de794a 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -86,8 +86,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	osq_lock_init(&sem->osq);
 #endif
 }
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 64877f5294e3..eb9c8534299b 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -61,7 +61,6 @@
 #define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
 #define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
 
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
  * store tearing can't happen as optimistic spinners may read and use
@@ -126,7 +125,6 @@ static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
  * real owner or one of the real owners. The only exception is when the
  * unlock is done by up_read_non_owner().
  */
-#define rwsem_clear_reader_owned rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
@@ -135,28 +133,7 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
 				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
 }
-#endif
-
 #else
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
-					   struct task_struct *owner)
-{
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-#ifndef rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 }
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 0d9e81779e37..2047f3884540 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1067,7 +1067,7 @@ config PROVE_LOCKING
 	select DEBUG_SPINLOCK
 	select DEBUG_MUTEXES
 	select DEBUG_RT_MUTEXES if RT_MUTEXES
-	select DEBUG_RWSEMS if RWSEM_SPIN_ON_OWNER
+	select DEBUG_RWSEMS
 	select DEBUG_WW_MUTEX_SLOWPATH
 	select DEBUG_LOCK_ALLOC
 	select TRACE_IRQFLAGS
@@ -1171,10 +1171,10 @@ config DEBUG_WW_MUTEX_SLOWPATH
 
 config DEBUG_RWSEMS
 	bool "RW Semaphore debugging: basic checks"
-	depends on DEBUG_KERNEL && RWSEM_SPIN_ON_OWNER
+	depends on DEBUG_KERNEL
 	help
-	  This debugging feature allows mismatched rw semaphore locks and unlocks
-	  to be detected and reported.
+	  This debugging feature allows mismatched rw semaphore locks
+	  and unlocks to be detected and reported.
 
 config DEBUG_LOCK_ALLOC
 	bool "Lock debugging: detect incorrect freeing of live locks"
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 03/16] locking/rwsem: Remove rwsem_wake() wakeup optimization
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
  2019-04-13 17:22 ` [PATCH v4 01/16] locking/rwsem: Prevent unneeded warning during locking selftest Waiman Long
  2019-04-13 17:22 ` [PATCH v4 02/16] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-13 17:22 ` [PATCH v4 04/16] locking/rwsem: Implement a new locking scheme Waiman Long
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

With the commit 59aabfc7e959 ("locking/rwsem: Reduce spinlock contention
in wakeup after up_read()/up_write()"), the rwsem_wake() forgoes doing
a wakeup if the wait_lock cannot be directly acquired and an optimistic
spinning locker is present.  This can help performance by avoiding
spinning on the wait_lock when it is contended.

With the later commit 133e89ef5ef3 ("locking/rwsem: Enable lockless
waiter wakeup(s)"), the performance advantage of the above optimization
diminishes as the average wait_lock hold time become much shorter.

With a later patch that supports rwsem lock handoff, we can no
longer relies on the fact that the presence of an optimistic spinning
locker will ensure that the lock will be acquired by a task soon and
rwsem_wake() will be called later on to wake up waiters. This can lead
to missed wakeup and application hang. So the commit 59aabfc7e959
("locking/rwsem: Reduce spinlock contention in wakeup after
up_read()/up_write()") will have to be reverted.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem-xadd.c | 72 -------------------------------------
 1 file changed, 72 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 7fd4f1de794a..98de7f0cfedd 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -395,25 +395,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	lockevent_cond_inc(rwsem_opt_fail, !taken);
 	return taken;
 }
-
-/*
- * Return true if the rwsem has active spinner
- */
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-	return osq_is_locked(&sem->osq);
-}
-
 #else
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
 	return false;
 }
-
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-	return false;
-}
 #endif
 
 /*
@@ -635,65 +621,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
 
-	/*
-	* __rwsem_down_write_failed_common(sem)
-	*   rwsem_optimistic_spin(sem)
-	*     osq_unlock(sem->osq)
-	*   ...
-	*   atomic_long_add_return(&sem->count)
-	*
-	*      - VS -
-	*
-	*              __up_write()
-	*                if (atomic_long_sub_return_release(&sem->count) < 0)
-	*                  rwsem_wake(sem)
-	*                    osq_is_locked(&sem->osq)
-	*
-	* And __up_write() must observe !osq_is_locked() when it observes the
-	* atomic_long_add_return() in order to not miss a wakeup.
-	*
-	* This boils down to:
-	*
-	* [S.rel] X = 1                [RmW] r0 = (Y += 0)
-	*         MB                         RMB
-	* [RmW]   Y += 1               [L]   r1 = X
-	*
-	* exists (r0=1 /\ r1=0)
-	*/
-	smp_rmb();
-
-	/*
-	 * If a spinner is present, it is not necessary to do the wakeup.
-	 * Try to do wakeup only if the trylock succeeds to minimize
-	 * spinlock contention which may introduce too much delay in the
-	 * unlock operation.
-	 *
-	 *    spinning writer		up_write/up_read caller
-	 *    ---------------		-----------------------
-	 * [S]   osq_unlock()		[L]   osq
-	 *	 MB			      RMB
-	 * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
-	 *
-	 * Here, it is important to make sure that there won't be a missed
-	 * wakeup while the rwsem is free and the only spinning writer goes
-	 * to sleep without taking the rwsem. Even when the spinning writer
-	 * is just going to break out of the waiting loop, it will still do
-	 * a trylock in rwsem_down_write_failed() before sleeping. IOW, if
-	 * rwsem_has_spinner() is true, it will guarantee at least one
-	 * trylock attempt on the rwsem later on.
-	 */
-	if (rwsem_has_spinner(sem)) {
-		/*
-		 * The smp_rmb() here is to make sure that the spinner
-		 * state is consulted before reading the wait_lock.
-		 */
-		smp_rmb();
-		if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
-			return sem;
-		goto locked;
-	}
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-locked:
 
 	if (!list_empty(&sem->wait_list))
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 04/16] locking/rwsem: Implement a new locking scheme
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (2 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 03/16] locking/rwsem: Remove rwsem_wake() wakeup optimization Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-16 13:22   ` Peter Zijlstra
  2019-04-13 17:22 ` [PATCH v4 05/16] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c Waiman Long
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.

To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used.  The atomic long count has the following
bit definitions:

  Bit  0   - writer locked bit
  Bit  1   - waiters present bit
  Bits 2-7 - reserved for future extension
  Bits 8-X - reader count (24/56 bits)

The cmpxchg instruction is now used to acquire the write lock. The read
lock is still acquired with xadd instruction, so there is no change here.
This scheme will allow up to 16M/64P active readers which should be
more than enough. We can always use some more reserved bits if necessary.

With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) of the benchmark on a 8-socket 120-core
IvyBridge-EX system before and after the patch were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        30,659   31,341   31,055   31,283
        2         8,909   16,457    9,884   17,659
        4         9,028   15,823    8,933   20,233
        8         8,410   14,212    7,230   17,140
       16         8,217   25,240    7,479   24,607

The locking rates of the benchmark on a Power8 system were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        12,963   13,647   13,275   13,601
        2         7,570   11,569    7,902   10,829
        4         5,232    5,516    5,466    5,435
        8         5,233    3,386    5,467    3,168

The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        21,495   21,046   21,524   21,074
        2         5,293   10,502    5,333   10,504
        4         5,325   11,463    5,358   11,631
        8         5,391   11,712    5,470   11,680

The performance are roughly the same before and after the patch. There
are run-to-run variations in performance. Runs with higher variances
usually have higher throughput.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem-xadd.c | 147 ++++++++++++------------------------
 kernel/locking/rwsem.h      |  74 +++++++++---------
 2 files changed, 86 insertions(+), 135 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 98de7f0cfedd..92f7d7b6bfa3 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -9,6 +9,8 @@
  *
  * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
  * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition by Waiman Long <longman@redhat.com>.
  */
 #include <linux/rwsem.h>
 #include <linux/init.h>
@@ -22,52 +24,20 @@
 #include "rwsem.h"
 
 /*
- * Guide to the rw_semaphore's count field for common values.
- * (32-bit case illustrated, similar for 64-bit)
- *
- * 0x0000000X	(1) X readers active or attempting lock, no writer waiting
- *		    X = #active_readers + #readers attempting to lock
- *		    (X*ACTIVE_BIAS)
- *
- * 0x00000000	rwsem is unlocked, and no one is waiting for the lock or
- *		attempting to read lock or write lock.
- *
- * 0xffff000X	(1) X readers active or attempting lock, with waiters for lock
- *		    X = #active readers + # readers attempting lock
- *		    (X*ACTIVE_BIAS + WAITING_BIAS)
- *		(2) 1 writer attempting lock, no waiters for lock
- *		    X-1 = #active readers + #readers attempting lock
- *		    ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *		(3) 1 writer active, no waiters for lock
- *		    X-1 = #active readers + #readers attempting lock
- *		    ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *
- * 0xffff0001	(1) 1 reader active or attempting lock, waiters for lock
- *		    (WAITING_BIAS + ACTIVE_BIAS)
- *		(2) 1 writer active or attempting lock, no waiters for lock
- *		    (ACTIVE_WRITE_BIAS)
+ * Guide to the rw_semaphore's count field.
  *
- * 0xffff0000	(1) There are writers or readers queued but none active
- *		    or in the process of attempting lock.
- *		    (WAITING_BIAS)
- *		Note: writer can attempt to steal lock for this count by adding
- *		ACTIVE_WRITE_BIAS in cmpxchg and checking the old count
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
  *
- * 0xfffe0001	(1) 1 writer active, or attempting lock. Waiters on queue.
- *		    (ACTIVE_WRITE_BIAS + WAITING_BIAS)
- *
- * Note: Readers attempt to lock by adding ACTIVE_BIAS in down_read and checking
- *	 the count becomes more than 0 for successful lock acquisition,
- *	 i.e. the case where there are only readers or nobody has lock.
- *	 (1st and 2nd case above).
- *
- *	 Writers attempt to lock by adding ACTIVE_WRITE_BIAS in down_write and
- *	 checking the count becomes ACTIVE_WRITE_BIAS for successful lock
- *	 acquisition (i.e. nobody else has lock or attempts lock).  If
- *	 unsuccessful, in rwsem_down_write_failed, we'll check to see if there
- *	 are only waiters but none active (5th case above), and attempt to
- *	 steal the lock.
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
  *
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
  */
 
 /*
@@ -113,9 +83,8 @@ enum rwsem_wake_type {
 
 /*
  * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then:
- *   - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
- *   - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ *   have been set.
  * - there must be someone on the queue
  * - the wait_lock must be held by the caller
  * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
@@ -159,22 +128,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	 * so we can bail out early if a writer stole the lock.
 	 */
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
-		adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
+		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
-		if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-			/*
-			 * If the count is still less than RWSEM_WAITING_BIAS
-			 * after removing the adjustment, it is assumed that
-			 * a writer has stolen the lock. We have to undo our
-			 * reader grant.
-			 */
-			if (atomic_long_add_return(-adjustment, &sem->count) <
-			    RWSEM_WAITING_BIAS)
-				return;
-
-			/* Last active locker left. Retry waking readers. */
-			goto try_reader_grant;
+		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+			atomic_long_sub(adjustment, &sem->count);
+			return;
 		}
 		/*
 		 * Set it to reader-owned to give spinners an early
@@ -214,11 +172,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		wake_q_add_safe(wake_q, tsk);
 	}
 
-	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
+	adjustment = woken * RWSEM_READER_BIAS - adjustment;
 	lockevent_cond_inc(rwsem_wake_reader, woken);
 	if (list_empty(&sem->wait_list)) {
 		/* hit end of list above */
-		adjustment -= RWSEM_WAITING_BIAS;
+		adjustment -= RWSEM_FLAG_WAITERS;
 	}
 
 	if (adjustment)
@@ -232,22 +190,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  */
 static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
 {
-	/*
-	 * Avoid trying to acquire write lock if count isn't RWSEM_WAITING_BIAS.
-	 */
-	if (count != RWSEM_WAITING_BIAS)
+	long new;
+
+	if (RWSEM_COUNT_LOCKED(count))
 		return false;
 
-	/*
-	 * Acquire the lock by trying to set it to ACTIVE_WRITE_BIAS. If there
-	 * are other tasks on the wait list, we need to add on WAITING_BIAS.
-	 */
-	count = list_is_singular(&sem->wait_list) ?
-			RWSEM_ACTIVE_WRITE_BIAS :
-			RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS;
+	new = count + RWSEM_WRITER_LOCKED -
+	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
 
-	if (atomic_long_cmpxchg_acquire(&sem->count, RWSEM_WAITING_BIAS, count)
-							== RWSEM_WAITING_BIAS) {
+	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
 		rwsem_set_owner(sem);
 		return true;
 	}
@@ -263,9 +214,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 {
 	long count = atomic_long_read(&sem->count);
 
-	while (!count || count == RWSEM_WAITING_BIAS) {
+	while (!RWSEM_COUNT_LOCKED(count)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_ACTIVE_WRITE_BIAS)) {
+					count + RWSEM_WRITER_LOCKED)) {
 			rwsem_set_owner(sem);
 			lockevent_inc(rwsem_opt_wlock);
 			return true;
@@ -408,7 +359,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 static inline struct rw_semaphore __sched *
 __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 {
-	long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
+	long count, adjustment = -RWSEM_READER_BIAS;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
@@ -420,16 +371,16 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 		/*
 		 * In case the wait queue is empty and the lock isn't owned
 		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_ACTIVE_READ_BIAS has already
-		 * been set in the count.
+		 * immediately as its RWSEM_READER_BIAS has already been
+		 * set in the count.
 		 */
-		if (atomic_long_read(&sem->count) >= 0) {
+		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
 			return sem;
 		}
-		adjustment += RWSEM_WAITING_BIAS;
+		adjustment += RWSEM_FLAG_WAITERS;
 	}
 	list_add_tail(&waiter.list, &sem->wait_list);
 
@@ -442,9 +393,8 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 	 * If there are no writers and we are first in the queue,
 	 * wake our own waiter to join the existing active readers !
 	 */
-	if (count == RWSEM_WAITING_BIAS ||
-	    (count > RWSEM_WAITING_BIAS &&
-	     adjustment != -RWSEM_ACTIVE_READ_BIAS))
+	if (!RWSEM_COUNT_LOCKED(count) ||
+	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
@@ -472,7 +422,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 out_nolock:
 	list_del(&waiter.list);
 	if (list_empty(&sem->wait_list))
-		atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock_fail);
@@ -505,9 +455,6 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
 
-	/* undo write bias from down_write operation, stop active locking */
-	count = atomic_long_sub_return(RWSEM_ACTIVE_WRITE_BIAS, &sem->count);
-
 	/* do optimistic spinning and steal lock if possible */
 	if (rwsem_optimistic_spin(sem))
 		return sem;
@@ -527,16 +474,18 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 
 	list_add_tail(&waiter.list, &sem->wait_list);
 
-	/* we're now waiting on the lock, but no longer actively locking */
+	/* we're now waiting on the lock */
 	if (waiting) {
 		count = atomic_long_read(&sem->count);
 
 		/*
 		 * If there were already threads queued before us and there are
-		 * no active writers, the lock must be read owned; so we try to
-		 * wake any read locks that were queued ahead of us.
+		 * no active writers and some readers, the lock must be read
+		 * owned; so we try to  any read locks that were queued ahead
+		 * of us.
 		 */
-		if (count > RWSEM_WAITING_BIAS) {
+		if (!(count & RWSEM_WRITER_MASK) &&
+		     (count & RWSEM_READER_MASK)) {
 			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
 			/*
 			 * The wakeup is normally called _after_ the wait_lock
@@ -553,8 +502,9 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 			wake_q_init(&wake_q);
 		}
 
-	} else
-		count = atomic_long_add_return(RWSEM_WAITING_BIAS, &sem->count);
+	} else {
+		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+	}
 
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
@@ -571,7 +521,8 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 			schedule();
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
-		} while ((count = atomic_long_read(&sem->count)) & RWSEM_ACTIVE_MASK);
+			count = atomic_long_read(&sem->count);
+		} while (RWSEM_COUNT_LOCKED(count));
 
 		raw_spin_lock_irq(&sem->wait_lock);
 	}
@@ -587,7 +538,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	raw_spin_lock_irq(&sem->wait_lock);
 	list_del(&waiter.list);
 	if (list_empty(&sem->wait_list))
-		atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	else
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 	raw_spin_unlock_irq(&sem->wait_lock);
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index eb9c8534299b..e7cbabfe0ad1 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -42,24 +42,26 @@
 #endif
 
 /*
- * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
- * Adapted largely from include/asm-i386/rwsem.h
- * by Paul Mackerras <paulus@samba.org>.
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit  0   - writer locked bit
+ * Bit  1   - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
  */
+#define RWSEM_WRITER_LOCKED	(1UL << 0)
+#define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_READER_SHIFT	8
+#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
 
-/*
- * the semaphore definition
- */
-#ifdef CONFIG_64BIT
-# define RWSEM_ACTIVE_MASK		0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK		0x0000ffffL
-#endif
-
-#define RWSEM_ACTIVE_BIAS		0x00000001L
-#define RWSEM_WAITING_BIAS		(-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_COUNT_LOCKED(c)	((c) & RWSEM_LOCK_MASK)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -151,7 +153,8 @@ extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
  */
 static inline void __down_read(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		rwsem_down_read_failed(sem);
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
@@ -162,7 +165,8 @@ static inline void __down_read(struct rw_semaphore *sem)
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
@@ -183,11 +187,11 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
 	lockevent_inc(rwsem_rtrylock);
 	do {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
-					tmp + RWSEM_ACTIVE_READ_BIAS)) {
+					tmp + RWSEM_READER_BIAS)) {
 			rwsem_set_reader_owned(sem);
 			return 1;
 		}
-	} while (tmp >= 0);
+	} while (!(tmp & RWSEM_READ_FAILED_MASK));
 	return 0;
 }
 
@@ -196,22 +200,16 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
  */
 static inline void __down_write(struct rw_semaphore *sem)
 {
-	long tmp;
-
-	tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
-					     &sem->count);
-	if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
 		rwsem_down_write_failed(sem);
 	rwsem_set_owner(sem);
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
 {
-	long tmp;
-
-	tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
-					     &sem->count);
-	if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
 		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
 			return -EINTR;
 	rwsem_set_owner(sem);
@@ -224,7 +222,7 @@ static inline int __down_write_trylock(struct rw_semaphore *sem)
 
 	lockevent_inc(rwsem_wtrylock);
 	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-		      RWSEM_ACTIVE_WRITE_BIAS);
+					  RWSEM_WRITER_LOCKED);
 	if (tmp == RWSEM_UNLOCKED_VALUE) {
 		rwsem_set_owner(sem);
 		return true;
@@ -242,8 +240,9 @@ static inline void __up_read(struct rw_semaphore *sem)
 	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
 				sem);
 	rwsem_clear_reader_owned(sem);
-	tmp = atomic_long_dec_return_release(&sem->count);
-	if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
+	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+			== RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -254,8 +253,8 @@ static inline void __up_write(struct rw_semaphore *sem)
 {
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
 	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
-						    &sem->count) < 0))
+	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+			&sem->count) & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -274,8 +273,9 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	 * write side. As such, rely on RELEASE semantics.
 	 */
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
+	tmp = atomic_long_fetch_add_release(
+		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
 	rwsem_set_reader_owned(sem);
-	if (tmp < 0)
+	if (tmp & RWSEM_FLAG_WAITERS)
 		rwsem_downgrade_wake(sem);
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 05/16] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (3 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 04/16] locking/rwsem: Implement a new locking scheme Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-13 17:22 ` [PATCH v4 06/16] locking/rwsem: Code cleanup after files merging Waiman Long
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Now we only have one implementation of rwsem. Even though we still use
xadd to handle reader locking, we use cmpxchg for writer instead. So
the filename rwsem-xadd.c is not strictly correct. Also no one outside
of the rwsem code need to know the internal implementation other than
function prototypes for two internal functions that are called directly
from percpu-rwsem.c.

So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
the following order:

  <upper part of rwsem.h>
  <rwsem-xadd.c>
  <lower part of rwsem.h>
  <rwsem.c>

The rwsem.h file now contains only 2 function declarations for
__up_read() and __down_read().

This is a code relocation patch with no code change at all except
making __up_read() and __down_read() non-static functions so they
can be used by percpu-rwsem.c.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/Makefile     |   2 +-
 kernel/locking/rwsem-xadd.c | 608 -------------------------
 kernel/locking/rwsem.c      | 870 ++++++++++++++++++++++++++++++++++++
 kernel/locking/rwsem.h      | 283 +-----------
 4 files changed, 877 insertions(+), 886 deletions(-)
 delete mode 100644 kernel/locking/rwsem-xadd.c

diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 6fe2f333aecb..45452facff3b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -3,7 +3,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT		:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o rwsem-xadd.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
deleted file mode 100644
index 92f7d7b6bfa3..000000000000
--- a/kernel/locking/rwsem-xadd.c
+++ /dev/null
@@ -1,608 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* rwsem.c: R/W semaphores: contention handling functions
- *
- * Written by David Howells (dhowells@redhat.com).
- * Derived from arch/i386/kernel/semaphore.c
- *
- * Writer lock-stealing by Alex Shi <alex.shi@intel.com>
- * and Michel Lespinasse <walken@google.com>
- *
- * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
- * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
- *
- * Rwsem count bit fields re-definition by Waiman Long <longman@redhat.com>.
- */
-#include <linux/rwsem.h>
-#include <linux/init.h>
-#include <linux/export.h>
-#include <linux/sched/signal.h>
-#include <linux/sched/rt.h>
-#include <linux/sched/wake_q.h>
-#include <linux/sched/debug.h>
-#include <linux/osq_lock.h>
-
-#include "rwsem.h"
-
-/*
- * Guide to the rw_semaphore's count field.
- *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
- *
- * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
- * (2) some of the reader bits are set in count, and
- * (3) the owner field has RWSEM_READ_OWNED bit set.
- *
- * Having some reader bits set is not enough to guarantee a readers owned
- * lock as the readers may be in the process of backing out from the count
- * and a writer has just released the lock. So another writer may steal
- * the lock immediately after that.
- */
-
-/*
- * Initialize an rwsem:
- */
-void __init_rwsem(struct rw_semaphore *sem, const char *name,
-		  struct lock_class_key *key)
-{
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-	/*
-	 * Make sure we are not reinitializing a held semaphore:
-	 */
-	debug_check_no_locks_freed((void *)sem, sizeof(*sem));
-	lockdep_init_map(&sem->dep_map, name, key, 0);
-#endif
-	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
-	raw_spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-	sem->owner = NULL;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-	osq_lock_init(&sem->osq);
-#endif
-}
-
-EXPORT_SYMBOL(__init_rwsem);
-
-enum rwsem_waiter_type {
-	RWSEM_WAITING_FOR_WRITE,
-	RWSEM_WAITING_FOR_READ
-};
-
-struct rwsem_waiter {
-	struct list_head list;
-	struct task_struct *task;
-	enum rwsem_waiter_type type;
-};
-
-enum rwsem_wake_type {
-	RWSEM_WAKE_ANY,		/* Wake whatever's at head of wait list */
-	RWSEM_WAKE_READERS,	/* Wake readers only */
-	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
-};
-
-/*
- * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
- *   have been set.
- * - there must be someone on the queue
- * - the wait_lock must be held by the caller
- * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
- *   to actually wakeup the blocked task(s) and drop the reference count,
- *   preferably when the wait_lock is released
- * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only marked woken if downgrading is false
- */
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
-			      enum rwsem_wake_type wake_type,
-			      struct wake_q_head *wake_q)
-{
-	struct rwsem_waiter *waiter, *tmp;
-	long oldcount, woken = 0, adjustment = 0;
-
-	/*
-	 * Take a peek at the queue head waiter such that we can determine
-	 * the wakeup(s) to perform.
-	 */
-	waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
-
-	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
-		if (wake_type == RWSEM_WAKE_ANY) {
-			/*
-			 * Mark writer at the front of the queue for wakeup.
-			 * Until the task is actually later awoken later by
-			 * the caller, other writers are able to steal it.
-			 * Readers, on the other hand, will block as they
-			 * will notice the queued writer.
-			 */
-			wake_q_add(wake_q, waiter->task);
-			lockevent_inc(rwsem_wake_writer);
-		}
-
-		return;
-	}
-
-	/*
-	 * Writers might steal the lock before we grant it to the next reader.
-	 * We prefer to do the first reader grant before counting readers
-	 * so we can bail out early if a writer stole the lock.
-	 */
-	if (wake_type != RWSEM_WAKE_READ_OWNED) {
-		adjustment = RWSEM_READER_BIAS;
-		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
-		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
-			atomic_long_sub(adjustment, &sem->count);
-			return;
-		}
-		/*
-		 * Set it to reader-owned to give spinners an early
-		 * indication that readers now have the lock.
-		 */
-		__rwsem_set_reader_owned(sem, waiter->task);
-	}
-
-	/*
-	 * Grant an infinite number of read locks to the readers at the front
-	 * of the queue. We know that woken will be at least 1 as we accounted
-	 * for above. Note we increment the 'active part' of the count by the
-	 * number of readers before waking any processes up.
-	 */
-	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
-		struct task_struct *tsk;
-
-		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-			break;
-
-		woken++;
-		tsk = waiter->task;
-
-		get_task_struct(tsk);
-		list_del(&waiter->list);
-		/*
-		 * Ensure calling get_task_struct() before setting the reader
-		 * waiter to nil such that rwsem_down_read_failed() cannot
-		 * race with do_exit() by always holding a reference count
-		 * to the task to wakeup.
-		 */
-		smp_store_release(&waiter->task, NULL);
-		/*
-		 * Ensure issuing the wakeup (either by us or someone else)
-		 * after setting the reader waiter to nil.
-		 */
-		wake_q_add_safe(wake_q, tsk);
-	}
-
-	adjustment = woken * RWSEM_READER_BIAS - adjustment;
-	lockevent_cond_inc(rwsem_wake_reader, woken);
-	if (list_empty(&sem->wait_list)) {
-		/* hit end of list above */
-		adjustment -= RWSEM_FLAG_WAITERS;
-	}
-
-	if (adjustment)
-		atomic_long_add(adjustment, &sem->count);
-}
-
-/*
- * This function must be called with the sem->wait_lock held to prevent
- * race conditions between checking the rwsem wait list and setting the
- * sem->count accordingly.
- */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
-{
-	long new;
-
-	if (RWSEM_COUNT_LOCKED(count))
-		return false;
-
-	new = count + RWSEM_WRITER_LOCKED -
-	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
-
-	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
-		rwsem_set_owner(sem);
-		return true;
-	}
-
-	return false;
-}
-
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-/*
- * Try to acquire write lock before the writer has been put on wait queue.
- */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
-{
-	long count = atomic_long_read(&sem->count);
-
-	while (!RWSEM_COUNT_LOCKED(count)) {
-		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_WRITER_LOCKED)) {
-			rwsem_set_owner(sem);
-			lockevent_inc(rwsem_opt_wlock);
-			return true;
-		}
-	}
-	return false;
-}
-
-static inline bool owner_on_cpu(struct task_struct *owner)
-{
-	/*
-	 * As lock holder preemption issue, we both skip spinning if
-	 * task is not on cpu or its cpu is preempted
-	 */
-	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
-}
-
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
-{
-	struct task_struct *owner;
-	bool ret = true;
-
-	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
-
-	if (need_resched())
-		return false;
-
-	rcu_read_lock();
-	owner = READ_ONCE(sem->owner);
-	if (owner) {
-		ret = is_rwsem_owner_spinnable(owner) &&
-		      owner_on_cpu(owner);
-	}
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
- * Return true only if we can still spin on the owner field of the rwsem.
- */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
-{
-	struct task_struct *owner = READ_ONCE(sem->owner);
-
-	if (!is_rwsem_owner_spinnable(owner))
-		return false;
-
-	rcu_read_lock();
-	while (owner && (READ_ONCE(sem->owner) == owner)) {
-		/*
-		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking sem->owner still matches owner, if that fails,
-		 * owner might point to free()d memory, if it still matches,
-		 * the rcu_read_lock() ensures the memory stays valid.
-		 */
-		barrier();
-
-		/*
-		 * abort spinning when need_resched or owner is not running or
-		 * owner's cpu is preempted.
-		 */
-		if (need_resched() || !owner_on_cpu(owner)) {
-			rcu_read_unlock();
-			return false;
-		}
-
-		cpu_relax();
-	}
-	rcu_read_unlock();
-
-	/*
-	 * If there is a new owner or the owner is not set, we continue
-	 * spinning.
-	 */
-	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
-}
-
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
-	bool taken = false;
-
-	preempt_disable();
-
-	/* sem->wait_lock should not be held when doing optimistic spinning */
-	if (!rwsem_can_spin_on_owner(sem))
-		goto done;
-
-	if (!osq_lock(&sem->osq))
-		goto done;
-
-	/*
-	 * Optimistically spin on the owner field and attempt to acquire the
-	 * lock whenever the owner changes. Spinning will be stopped when:
-	 *  1) the owning writer isn't running; or
-	 *  2) readers own the lock as we can't determine if they are
-	 *     actively running or not.
-	 */
-	while (rwsem_spin_on_owner(sem)) {
-		/*
-		 * Try to acquire the lock
-		 */
-		if (rwsem_try_write_lock_unqueued(sem)) {
-			taken = true;
-			break;
-		}
-
-		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
-		 */
-		if (!sem->owner && (need_resched() || rt_task(current)))
-			break;
-
-		/*
-		 * The cpu_relax() call is a compiler barrier which forces
-		 * everything in this loop to be re-loaded. We don't need
-		 * memory barriers as we'll eventually observe the right
-		 * values at the cost of a few extra spins.
-		 */
-		cpu_relax();
-	}
-	osq_unlock(&sem->osq);
-done:
-	preempt_enable();
-	lockevent_cond_inc(rwsem_opt_fail, !taken);
-	return taken;
-}
-#else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
-	return false;
-}
-#endif
-
-/*
- * Wait for the read lock to be granted
- */
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
-{
-	long count, adjustment = -RWSEM_READER_BIAS;
-	struct rwsem_waiter waiter;
-	DEFINE_WAKE_Q(wake_q);
-
-	waiter.task = current;
-	waiter.type = RWSEM_WAITING_FOR_READ;
-
-	raw_spin_lock_irq(&sem->wait_lock);
-	if (list_empty(&sem->wait_list)) {
-		/*
-		 * In case the wait queue is empty and the lock isn't owned
-		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_READER_BIAS has already been
-		 * set in the count.
-		 */
-		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
-			raw_spin_unlock_irq(&sem->wait_lock);
-			rwsem_set_reader_owned(sem);
-			lockevent_inc(rwsem_rlock_fast);
-			return sem;
-		}
-		adjustment += RWSEM_FLAG_WAITERS;
-	}
-	list_add_tail(&waiter.list, &sem->wait_list);
-
-	/* we're now waiting on the lock, but no longer actively locking */
-	count = atomic_long_add_return(adjustment, &sem->count);
-
-	/*
-	 * If there are no active locks, wake the front queued process(es).
-	 *
-	 * If there are no writers and we are first in the queue,
-	 * wake our own waiter to join the existing active readers !
-	 */
-	if (!RWSEM_COUNT_LOCKED(count) ||
-	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
-	raw_spin_unlock_irq(&sem->wait_lock);
-	wake_up_q(&wake_q);
-
-	/* wait to be given the lock */
-	while (true) {
-		set_current_state(state);
-		if (!waiter.task)
-			break;
-		if (signal_pending_state(state, current)) {
-			raw_spin_lock_irq(&sem->wait_lock);
-			if (waiter.task)
-				goto out_nolock;
-			raw_spin_unlock_irq(&sem->wait_lock);
-			break;
-		}
-		schedule();
-		lockevent_inc(rwsem_sleep_reader);
-	}
-
-	__set_current_state(TASK_RUNNING);
-	lockevent_inc(rwsem_rlock);
-	return sem;
-out_nolock:
-	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	__set_current_state(TASK_RUNNING);
-	lockevent_inc(rwsem_rlock_fail);
-	return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
-/*
- * Wait until we successfully acquire the write lock
- */
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
-{
-	long count;
-	bool waiting = true; /* any queued threads before us */
-	struct rwsem_waiter waiter;
-	struct rw_semaphore *ret = sem;
-	DEFINE_WAKE_Q(wake_q);
-
-	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_optimistic_spin(sem))
-		return sem;
-
-	/*
-	 * Optimistic spinning failed, proceed to the slowpath
-	 * and block until we can acquire the sem.
-	 */
-	waiter.task = current;
-	waiter.type = RWSEM_WAITING_FOR_WRITE;
-
-	raw_spin_lock_irq(&sem->wait_lock);
-
-	/* account for this before adding a new element to the list */
-	if (list_empty(&sem->wait_list))
-		waiting = false;
-
-	list_add_tail(&waiter.list, &sem->wait_list);
-
-	/* we're now waiting on the lock */
-	if (waiting) {
-		count = atomic_long_read(&sem->count);
-
-		/*
-		 * If there were already threads queued before us and there are
-		 * no active writers and some readers, the lock must be read
-		 * owned; so we try to  any read locks that were queued ahead
-		 * of us.
-		 */
-		if (!(count & RWSEM_WRITER_MASK) &&
-		     (count & RWSEM_READER_MASK)) {
-			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
-			/*
-			 * The wakeup is normally called _after_ the wait_lock
-			 * is released, but given that we are proactively waking
-			 * readers we can deal with the wake_q overhead as it is
-			 * similar to releasing and taking the wait_lock again
-			 * for attempting rwsem_try_write_lock().
-			 */
-			wake_up_q(&wake_q);
-
-			/*
-			 * Reinitialize wake_q after use.
-			 */
-			wake_q_init(&wake_q);
-		}
-
-	} else {
-		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
-	}
-
-	/* wait until we successfully acquire the lock */
-	set_current_state(state);
-	while (true) {
-		if (rwsem_try_write_lock(count, sem))
-			break;
-		raw_spin_unlock_irq(&sem->wait_lock);
-
-		/* Block until there are no active lockers. */
-		do {
-			if (signal_pending_state(state, current))
-				goto out_nolock;
-
-			schedule();
-			lockevent_inc(rwsem_sleep_writer);
-			set_current_state(state);
-			count = atomic_long_read(&sem->count);
-		} while (RWSEM_COUNT_LOCKED(count));
-
-		raw_spin_lock_irq(&sem->wait_lock);
-	}
-	__set_current_state(TASK_RUNNING);
-	list_del(&waiter.list);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	lockevent_inc(rwsem_wlock);
-
-	return ret;
-
-out_nolock:
-	__set_current_state(TASK_RUNNING);
-	raw_spin_lock_irq(&sem->wait_lock);
-	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-	else
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	wake_up_q(&wake_q);
-	lockevent_inc(rwsem_wlock_fail);
-
-	return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
-/*
- * handle waking up a waiter on the semaphore
- * - up_read/up_write has decremented the active part of count if we come here
- */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
-{
-	unsigned long flags;
-	DEFINE_WAKE_Q(wake_q);
-
-	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
-	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
-	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
-	wake_up_q(&wake_q);
-
-	return sem;
-}
-EXPORT_SYMBOL(rwsem_wake);
-
-/*
- * downgrade a write lock into a read lock
- * - caller incremented waiting part of count and discovered it still negative
- * - just wake up any readers at the front of the queue
- */
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
-{
-	unsigned long flags;
-	DEFINE_WAKE_Q(wake_q);
-
-	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
-	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
-
-	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
-	wake_up_q(&wake_q);
-
-	return sem;
-}
-EXPORT_SYMBOL(rwsem_downgrade_wake);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ccbf18f560ff..5f06b0601eb6 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -3,17 +3,887 @@
  *
  * Written by David Howells (dhowells@redhat.com).
  * Derived from asm-i386/semaphore.h
+ *
+ * Writer lock-stealing by Alex Shi <alex.shi@intel.com>
+ * and Michel Lespinasse <walken@google.com>
+ *
+ * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
+ * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition and rwsem rearchitecture
+ * by Waiman Long <longman@redhat.com>.
  */
 
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/sched/rt.h>
+#include <linux/sched/task.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched/signal.h>
 #include <linux/export.h>
 #include <linux/rwsem.h>
 #include <linux/atomic.h>
 
 #include "rwsem.h"
+#include "lock_events.h"
+
+/*
+ * The least significant 2 bits of the owner value has the following
+ * meanings when set.
+ *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
+ *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
+ *    i.e. the owner(s) cannot be readily determined. It can be reader
+ *    owned or the owning writer is indeterminate.
+ *
+ * When a writer acquires a rwsem, it puts its task_struct pointer
+ * into the owner field. It is cleared after an unlock.
+ *
+ * When a reader acquires a rwsem, it will also puts its task_struct
+ * pointer into the owner field with both the RWSEM_READER_OWNED and
+ * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * largely be left untouched. So for a free or reader-owned rwsem,
+ * the owner value may contain information about the last reader that
+ * acquires the rwsem. The anonymous bit is set because that particular
+ * reader may or may not still own the lock.
+ *
+ * That information may be helpful in debugging cases where the system
+ * seems to hang on a reader owned rwsem especially if only one reader
+ * is involved. Ideally we would like to track all the readers that own
+ * a rwsem, but the overhead is simply too big.
+ */
+#define RWSEM_READER_OWNED	(1UL << 0)
+#define RWSEM_ANONYMOUSLY_OWNED	(1UL << 1)
+
+#ifdef CONFIG_DEBUG_RWSEMS
+# define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
+	if (!debug_locks_silent &&				\
+	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
+		#c, atomic_long_read(&(sem)->count),		\
+		(long)((sem)->owner), (long)current,		\
+		list_empty(&(sem)->wait_list) ? "" : "not "))	\
+			debug_locks_off();			\
+	} while (0)
+#else
+# define DEBUG_RWSEMS_WARN_ON(c, sem)
+#endif
+
+/*
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit  0   - writer locked bit
+ * Bit  1   - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
+ */
+#define RWSEM_WRITER_LOCKED	(1UL << 0)
+#define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_READER_SHIFT	8
+#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+
+#define RWSEM_COUNT_LOCKED(c)	((c) & RWSEM_LOCK_MASK)
+
+/*
+ * All writes to owner are protected by WRITE_ONCE() to make sure that
+ * store tearing can't happen as optimistic spinners may read and use
+ * the owner value concurrently without lock. Read from owner, however,
+ * may not need READ_ONCE() as long as the pointer value is only used
+ * for comparison and isn't being dereferenced.
+ */
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+	WRITE_ONCE(sem->owner, current);
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+	WRITE_ONCE(sem->owner, NULL);
+}
+
+/*
+ * The task_struct pointer of the last owning reader will be left in
+ * the owner field.
+ *
+ * Note that the owner value just indicates the task has owned the rwsem
+ * previously, it may not be the real owner or one of the real owners
+ * anymore when that field is examined, so take it with a grain of salt.
+ */
+static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
+					    struct task_struct *owner)
+{
+	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
+						 | RWSEM_ANONYMOUSLY_OWNED;
+
+	WRITE_ONCE(sem->owner, (struct task_struct *)val);
+}
+
+static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
+{
+	__rwsem_set_reader_owned(sem, current);
+}
+
+/*
+ * Return true if the a rwsem waiter can spin on the rwsem's owner
+ * and steal the lock, i.e. the lock is not anonymously owned.
+ * N.B. !owner is considered spinnable.
+ */
+static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+{
+	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
+}
+
+/*
+ * Return true if rwsem is owned by an anonymous writer or readers.
+ */
+static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
+{
+	return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+}
+
+#ifdef CONFIG_DEBUG_RWSEMS
+/*
+ * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
+ * is a task pointer in owner of a reader-owned rwsem, it will be the
+ * real owner or one of the real owners. The only exception is when the
+ * unlock is done by up_read_non_owner().
+ */
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
+						   | RWSEM_ANONYMOUSLY_OWNED;
+	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
+		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
+				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+}
+#else
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+}
+#endif
+
+/*
+ * Guide to the rw_semaphore's count field.
+ *
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
+ *
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
+ *
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
+ */
+
+/*
+ * Initialize an rwsem:
+ */
+void __init_rwsem(struct rw_semaphore *sem, const char *name,
+		  struct lock_class_key *key)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	/*
+	 * Make sure we are not reinitializing a held semaphore:
+	 */
+	debug_check_no_locks_freed((void *)sem, sizeof(*sem));
+	lockdep_init_map(&sem->dep_map, name, key, 0);
+#endif
+	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+	raw_spin_lock_init(&sem->wait_lock);
+	INIT_LIST_HEAD(&sem->wait_list);
+	sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+	osq_lock_init(&sem->osq);
+#endif
+}
+
+EXPORT_SYMBOL(__init_rwsem);
+
+enum rwsem_waiter_type {
+	RWSEM_WAITING_FOR_WRITE,
+	RWSEM_WAITING_FOR_READ
+};
+
+struct rwsem_waiter {
+	struct list_head list;
+	struct task_struct *task;
+	enum rwsem_waiter_type type;
+};
+
+enum rwsem_wake_type {
+	RWSEM_WAKE_ANY,		/* Wake whatever's at head of wait list */
+	RWSEM_WAKE_READERS,	/* Wake readers only */
+	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
+};
+
+/*
+ * handle the lock release when processes blocked on it that can now run
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ *   have been set.
+ * - there must be someone on the queue
+ * - the wait_lock must be held by the caller
+ * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
+ *   to actually wakeup the blocked task(s) and drop the reference count,
+ *   preferably when the wait_lock is released
+ * - woken process blocks are discarded from the list after having task zeroed
+ * - writers are only marked woken if downgrading is false
+ */
+static void __rwsem_mark_wake(struct rw_semaphore *sem,
+			      enum rwsem_wake_type wake_type,
+			      struct wake_q_head *wake_q)
+{
+	struct rwsem_waiter *waiter, *tmp;
+	long oldcount, woken = 0, adjustment = 0;
+
+	/*
+	 * Take a peek at the queue head waiter such that we can determine
+	 * the wakeup(s) to perform.
+	 */
+	waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
+
+	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
+		if (wake_type == RWSEM_WAKE_ANY) {
+			/*
+			 * Mark writer at the front of the queue for wakeup.
+			 * Until the task is actually later awoken later by
+			 * the caller, other writers are able to steal it.
+			 * Readers, on the other hand, will block as they
+			 * will notice the queued writer.
+			 */
+			wake_q_add(wake_q, waiter->task);
+			lockevent_inc(rwsem_wake_writer);
+		}
+
+		return;
+	}
+
+	/*
+	 * Writers might steal the lock before we grant it to the next reader.
+	 * We prefer to do the first reader grant before counting readers
+	 * so we can bail out early if a writer stole the lock.
+	 */
+	if (wake_type != RWSEM_WAKE_READ_OWNED) {
+		adjustment = RWSEM_READER_BIAS;
+		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
+		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+			atomic_long_sub(adjustment, &sem->count);
+			return;
+		}
+		/*
+		 * Set it to reader-owned to give spinners an early
+		 * indication that readers now have the lock.
+		 */
+		__rwsem_set_reader_owned(sem, waiter->task);
+	}
+
+	/*
+	 * Grant an infinite number of read locks to the readers at the front
+	 * of the queue. We know that woken will be at least 1 as we accounted
+	 * for above. Note we increment the 'active part' of the count by the
+	 * number of readers before waking any processes up.
+	 */
+	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
+		struct task_struct *tsk;
+
+		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
+			break;
+
+		woken++;
+		tsk = waiter->task;
+
+		get_task_struct(tsk);
+		list_del(&waiter->list);
+		/*
+		 * Ensure calling get_task_struct() before setting the reader
+		 * waiter to nil such that rwsem_down_read_failed() cannot
+		 * race with do_exit() by always holding a reference count
+		 * to the task to wakeup.
+		 */
+		smp_store_release(&waiter->task, NULL);
+		/*
+		 * Ensure issuing the wakeup (either by us or someone else)
+		 * after setting the reader waiter to nil.
+		 */
+		wake_q_add_safe(wake_q, tsk);
+	}
+
+	adjustment = woken * RWSEM_READER_BIAS - adjustment;
+	lockevent_cond_inc(rwsem_wake_reader, woken);
+	if (list_empty(&sem->wait_list)) {
+		/* hit end of list above */
+		adjustment -= RWSEM_FLAG_WAITERS;
+	}
+
+	if (adjustment)
+		atomic_long_add(adjustment, &sem->count);
+}
+
+/*
+ * This function must be called with the sem->wait_lock held to prevent
+ * race conditions between checking the rwsem wait list and setting the
+ * sem->count accordingly.
+ */
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+	long new;
+
+	if (RWSEM_COUNT_LOCKED(count))
+		return false;
+
+	new = count + RWSEM_WRITER_LOCKED -
+	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+
+	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
+		rwsem_set_owner(sem);
+		return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire write lock before the writer has been put on wait queue.
+ */
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+
+	while (!RWSEM_COUNT_LOCKED(count)) {
+		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
+					count + RWSEM_WRITER_LOCKED)) {
+			rwsem_set_owner(sem);
+			lockevent_inc(rwsem_opt_wlock);
+			return true;
+		}
+	}
+	return false;
+}
+
+static inline bool owner_on_cpu(struct task_struct *owner)
+{
+	/*
+	 * As lock holder preemption issue, we both skip spinning if
+	 * task is not on cpu or its cpu is preempted
+	 */
+	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+}
+
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	struct task_struct *owner;
+	bool ret = true;
+
+	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
+
+	if (need_resched())
+		return false;
+
+	rcu_read_lock();
+	owner = READ_ONCE(sem->owner);
+	if (owner) {
+		ret = is_rwsem_owner_spinnable(owner) &&
+		      owner_on_cpu(owner);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Return true only if we can still spin on the owner field of the rwsem.
+ */
+static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+	struct task_struct *owner = READ_ONCE(sem->owner);
+
+	if (!is_rwsem_owner_spinnable(owner))
+		return false;
+
+	rcu_read_lock();
+	while (owner && (READ_ONCE(sem->owner) == owner)) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking sem->owner still matches owner, if that fails,
+		 * owner might point to free()d memory, if it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		/*
+		 * abort spinning when need_resched or owner is not running or
+		 * owner's cpu is preempted.
+		 */
+		if (need_resched() || !owner_on_cpu(owner)) {
+			rcu_read_unlock();
+			return false;
+		}
+
+		cpu_relax();
+	}
+	rcu_read_unlock();
+
+	/*
+	 * If there is a new owner or the owner is not set, we continue
+	 * spinning.
+	 */
+	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+}
+
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	bool taken = false;
+
+	preempt_disable();
+
+	/* sem->wait_lock should not be held when doing optimistic spinning */
+	if (!rwsem_can_spin_on_owner(sem))
+		goto done;
+
+	if (!osq_lock(&sem->osq))
+		goto done;
+
+	/*
+	 * Optimistically spin on the owner field and attempt to acquire the
+	 * lock whenever the owner changes. Spinning will be stopped when:
+	 *  1) the owning writer isn't running; or
+	 *  2) readers own the lock as we can't determine if they are
+	 *     actively running or not.
+	 */
+	while (rwsem_spin_on_owner(sem)) {
+		/*
+		 * Try to acquire the lock
+		 */
+		if (rwsem_try_write_lock_unqueued(sem)) {
+			taken = true;
+			break;
+		}
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!sem->owner && (need_resched() || rt_task(current)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax();
+	}
+	osq_unlock(&sem->osq);
+done:
+	preempt_enable();
+	lockevent_cond_inc(rwsem_opt_fail, !taken);
+	return taken;
+}
+#else
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	return false;
+}
+#endif
+
+/*
+ * Wait for the read lock to be granted
+ */
+static inline struct rw_semaphore __sched *
+__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+{
+	long count, adjustment = -RWSEM_READER_BIAS;
+	struct rwsem_waiter waiter;
+	DEFINE_WAKE_Q(wake_q);
+
+	waiter.task = current;
+	waiter.type = RWSEM_WAITING_FOR_READ;
+
+	raw_spin_lock_irq(&sem->wait_lock);
+	if (list_empty(&sem->wait_list)) {
+		/*
+		 * In case the wait queue is empty and the lock isn't owned
+		 * by a writer, this reader can exit the slowpath and return
+		 * immediately as its RWSEM_READER_BIAS has already been
+		 * set in the count.
+		 */
+		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+			raw_spin_unlock_irq(&sem->wait_lock);
+			rwsem_set_reader_owned(sem);
+			lockevent_inc(rwsem_rlock_fast);
+			return sem;
+		}
+		adjustment += RWSEM_FLAG_WAITERS;
+	}
+	list_add_tail(&waiter.list, &sem->wait_list);
+
+	/* we're now waiting on the lock, but no longer actively locking */
+	count = atomic_long_add_return(adjustment, &sem->count);
+
+	/*
+	 * If there are no active locks, wake the front queued process(es).
+	 *
+	 * If there are no writers and we are first in the queue,
+	 * wake our own waiter to join the existing active readers !
+	 */
+	if (!RWSEM_COUNT_LOCKED(count) ||
+	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+	raw_spin_unlock_irq(&sem->wait_lock);
+	wake_up_q(&wake_q);
+
+	/* wait to be given the lock */
+	while (true) {
+		set_current_state(state);
+		if (!waiter.task)
+			break;
+		if (signal_pending_state(state, current)) {
+			raw_spin_lock_irq(&sem->wait_lock);
+			if (waiter.task)
+				goto out_nolock;
+			raw_spin_unlock_irq(&sem->wait_lock);
+			break;
+		}
+		schedule();
+		lockevent_inc(rwsem_sleep_reader);
+	}
+
+	__set_current_state(TASK_RUNNING);
+	lockevent_inc(rwsem_rlock);
+	return sem;
+out_nolock:
+	list_del(&waiter.list);
+	if (list_empty(&sem->wait_list))
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	__set_current_state(TASK_RUNNING);
+	lockevent_inc(rwsem_rlock_fail);
+	return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed(struct rw_semaphore *sem)
+{
+	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed_killable(struct rw_semaphore *sem)
+{
+	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed_killable);
+
+/*
+ * Wait until we successfully acquire the write lock
+ */
+static inline struct rw_semaphore *
+__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+{
+	long count;
+	bool waiting = true; /* any queued threads before us */
+	struct rwsem_waiter waiter;
+	struct rw_semaphore *ret = sem;
+	DEFINE_WAKE_Q(wake_q);
+
+	/* do optimistic spinning and steal lock if possible */
+	if (rwsem_optimistic_spin(sem))
+		return sem;
+
+	/*
+	 * Optimistic spinning failed, proceed to the slowpath
+	 * and block until we can acquire the sem.
+	 */
+	waiter.task = current;
+	waiter.type = RWSEM_WAITING_FOR_WRITE;
+
+	raw_spin_lock_irq(&sem->wait_lock);
+
+	/* account for this before adding a new element to the list */
+	if (list_empty(&sem->wait_list))
+		waiting = false;
+
+	list_add_tail(&waiter.list, &sem->wait_list);
+
+	/* we're now waiting on the lock */
+	if (waiting) {
+		count = atomic_long_read(&sem->count);
+
+		/*
+		 * If there were already threads queued before us and there are
+		 * no active writers and some readers, the lock must be read
+		 * owned; so we try to  any read locks that were queued ahead
+		 * of us.
+		 */
+		if (!(count & RWSEM_WRITER_MASK) &&
+		     (count & RWSEM_READER_MASK)) {
+			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
+			/*
+			 * The wakeup is normally called _after_ the wait_lock
+			 * is released, but given that we are proactively waking
+			 * readers we can deal with the wake_q overhead as it is
+			 * similar to releasing and taking the wait_lock again
+			 * for attempting rwsem_try_write_lock().
+			 */
+			wake_up_q(&wake_q);
+
+			/*
+			 * Reinitialize wake_q after use.
+			 */
+			wake_q_init(&wake_q);
+		}
+
+	} else {
+		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+	}
+
+	/* wait until we successfully acquire the lock */
+	set_current_state(state);
+	while (true) {
+		if (rwsem_try_write_lock(count, sem))
+			break;
+		raw_spin_unlock_irq(&sem->wait_lock);
+
+		/* Block until there are no active lockers. */
+		do {
+			if (signal_pending_state(state, current))
+				goto out_nolock;
+
+			schedule();
+			lockevent_inc(rwsem_sleep_writer);
+			set_current_state(state);
+			count = atomic_long_read(&sem->count);
+		} while (RWSEM_COUNT_LOCKED(count));
+
+		raw_spin_lock_irq(&sem->wait_lock);
+	}
+	__set_current_state(TASK_RUNNING);
+	list_del(&waiter.list);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	lockevent_inc(rwsem_wlock);
+
+	return ret;
+
+out_nolock:
+	__set_current_state(TASK_RUNNING);
+	raw_spin_lock_irq(&sem->wait_lock);
+	list_del(&waiter.list);
+	if (list_empty(&sem->wait_list))
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+	else
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	wake_up_q(&wake_q);
+	lockevent_inc(rwsem_wlock_fail);
+
+	return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed(struct rw_semaphore *sem)
+{
+	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed_killable(struct rw_semaphore *sem)
+{
+	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed_killable);
+
+/*
+ * handle waking up a waiter on the semaphore
+ * - up_read/up_write has decremented the active part of count if we come here
+ */
+__visible
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+{
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+	if (!list_empty(&sem->wait_list))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+	wake_up_q(&wake_q);
+
+	return sem;
+}
+EXPORT_SYMBOL(rwsem_wake);
+
+/*
+ * downgrade a write lock into a read lock
+ * - caller incremented waiting part of count and discovered it still negative
+ * - just wake up any readers at the front of the queue
+ */
+__visible
+struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+{
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+	if (!list_empty(&sem->wait_list))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+
+	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+	wake_up_q(&wake_q);
+
+	return sem;
+}
+EXPORT_SYMBOL(rwsem_downgrade_wake);
+
+/*
+ * lock for reading
+ */
+inline void __down_read(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
+		rwsem_down_read_failed(sem);
+		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+					RWSEM_READER_OWNED), sem);
+	} else {
+		rwsem_set_reader_owned(sem);
+	}
+}
+
+static inline int __down_read_killable(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
+		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+			return -EINTR;
+		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+					RWSEM_READER_OWNED), sem);
+	} else {
+		rwsem_set_reader_owned(sem);
+	}
+	return 0;
+}
+
+static inline int __down_read_trylock(struct rw_semaphore *sem)
+{
+	/*
+	 * Optimize for the case when the rwsem is not locked at all.
+	 */
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	lockevent_inc(rwsem_rtrylock);
+	do {
+		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+					tmp + RWSEM_READER_BIAS)) {
+			rwsem_set_reader_owned(sem);
+			return 1;
+		}
+	} while (!(tmp & RWSEM_READ_FAILED_MASK));
+	return 0;
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
+		rwsem_down_write_failed(sem);
+	rwsem_set_owner(sem);
+}
+
+static inline int __down_write_killable(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
+		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+			return -EINTR;
+	rwsem_set_owner(sem);
+	return 0;
+}
+
+static inline int __down_write_trylock(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	lockevent_inc(rwsem_wtrylock);
+	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
+					  RWSEM_WRITER_LOCKED);
+	if (tmp == RWSEM_UNLOCKED_VALUE) {
+		rwsem_set_owner(sem);
+		return true;
+	}
+	return false;
+}
+
+/*
+ * unlock after reading
+ */
+inline void __up_read(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
+				sem);
+	rwsem_clear_reader_owned(sem);
+	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+			== RWSEM_FLAG_WAITERS))
+		rwsem_wake(sem);
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	rwsem_clear_owner(sem);
+	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+			&sem->count) & RWSEM_FLAG_WAITERS))
+		rwsem_wake(sem);
+}
+
+/*
+ * downgrade write lock to read lock
+ */
+static inline void __downgrade_write(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	/*
+	 * When downgrading from exclusive to shared ownership,
+	 * anything inside the write-locked region cannot leak
+	 * into the read side. In contrast, anything in the
+	 * read-locked region is ok to be re-ordered into the
+	 * write side. As such, rely on RELEASE semantics.
+	 */
+	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	tmp = atomic_long_fetch_add_release(
+		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
+	rwsem_set_reader_owned(sem);
+	if (tmp & RWSEM_FLAG_WAITERS)
+		rwsem_downgrade_wake(sem);
+}
 
 /*
  * lock for reading
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index e7cbabfe0ad1..2534ce49f648 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -1,281 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-/*
- * The least significant 2 bits of the owner value has the following
- * meanings when set.
- *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- *    i.e. the owner(s) cannot be readily determined. It can be reader
- *    owned or the owning writer is indeterminate.
- *
- * When a writer acquires a rwsem, it puts its task_struct pointer
- * into the owner field. It is cleared after an unlock.
- *
- * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
- *
- * That information may be helpful in debugging cases where the system
- * seems to hang on a reader owned rwsem especially if only one reader
- * is involved. Ideally we would like to track all the readers that own
- * a rwsem, but the overhead is simply too big.
- */
-#include "lock_events.h"
 
-#define RWSEM_READER_OWNED	(1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED	(1UL << 1)
+#ifndef __INTERNAL_RWSEM_H
+#define __INTERNAL_RWSEM_H
+#include <linux/rwsem.h>
 
-#ifdef CONFIG_DEBUG_RWSEMS
-# define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
-	if (!debug_locks_silent &&				\
-	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
-		#c, atomic_long_read(&(sem)->count),		\
-		(long)((sem)->owner), (long)current,		\
-		list_empty(&(sem)->wait_list) ? "" : "not "))	\
-			debug_locks_off();			\
-	} while (0)
-#else
-# define DEBUG_RWSEMS_WARN_ON(c, sem)
-#endif
+extern void __down_read(struct rw_semaphore *sem);
+extern void __up_read(struct rw_semaphore *sem);
 
-/*
- * The definition of the atomic counter in the semaphore:
- *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bits 2-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
- *
- * atomic_long_fetch_add() is used to obtain reader lock, whereas
- * atomic_long_cmpxchg() will be used to obtain writer lock.
- */
-#define RWSEM_WRITER_LOCKED	(1UL << 0)
-#define RWSEM_FLAG_WAITERS	(1UL << 1)
-#define RWSEM_READER_SHIFT	8
-#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
-#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
-#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
-
-#define RWSEM_COUNT_LOCKED(c)	((c) & RWSEM_LOCK_MASK)
-
-/*
- * All writes to owner are protected by WRITE_ONCE() to make sure that
- * store tearing can't happen as optimistic spinners may read and use
- * the owner value concurrently without lock. Read from owner, however,
- * may not need READ_ONCE() as long as the pointer value is only used
- * for comparison and isn't being dereferenced.
- */
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
-	WRITE_ONCE(sem->owner, current);
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
-	WRITE_ONCE(sem->owner, NULL);
-}
-
-/*
- * The task_struct pointer of the last owning reader will be left in
- * the owner field.
- *
- * Note that the owner value just indicates the task has owned the rwsem
- * previously, it may not be the real owner or one of the real owners
- * anymore when that field is examined, so take it with a grain of salt.
- */
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
-					    struct task_struct *owner)
-{
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-						 | RWSEM_ANONYMOUSLY_OWNED;
-
-	WRITE_ONCE(sem->owner, (struct task_struct *)val);
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
-	__rwsem_set_reader_owned(sem, current);
-}
-
-/*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
- * N.B. !owner is considered spinnable.
- */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
-{
-	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
-	return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
-}
-
-#ifdef CONFIG_DEBUG_RWSEMS
-/*
- * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
- * is a task pointer in owner of a reader-owned rwsem, it will be the
- * real owner or one of the real owners. The only exception is when the
- * unlock is done by up_read_non_owner().
- */
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
-	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
-						   | RWSEM_ANONYMOUSLY_OWNED;
-	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
-		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
-				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
-}
-#else
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_read_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_failed(sem);
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
-	} else {
-		rwsem_set_reader_owned(sem);
-	}
-}
-
-static inline int __down_read_killable(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
-			return -EINTR;
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
-	} else {
-		rwsem_set_reader_owned(sem);
-	}
-	return 0;
-}
-
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
-	/*
-	 * Optimize for the case when the rwsem is not locked at all.
-	 */
-	long tmp = RWSEM_UNLOCKED_VALUE;
-
-	lockevent_inc(rwsem_rtrylock);
-	do {
-		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
-					tmp + RWSEM_READER_BIAS)) {
-			rwsem_set_reader_owned(sem);
-			return 1;
-		}
-	} while (!(tmp & RWSEM_READ_FAILED_MASK));
-	return 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		rwsem_down_write_failed(sem);
-	rwsem_set_owner(sem);
-}
-
-static inline int __down_write_killable(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
-			return -EINTR;
-	rwsem_set_owner(sem);
-	return 0;
-}
-
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	lockevent_inc(rwsem_wtrylock);
-	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-					  RWSEM_WRITER_LOCKED);
-	if (tmp == RWSEM_UNLOCKED_VALUE) {
-		rwsem_set_owner(sem);
-		return true;
-	}
-	return false;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
-	rwsem_clear_reader_owned(sem);
-	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
-	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
-			== RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
-			&sem->count) & RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	/*
-	 * When downgrading from exclusive to shared ownership,
-	 * anything inside the write-locked region cannot leak
-	 * into the read side. In contrast, anything in the
-	 * read-locked region is ok to be re-ordered into the
-	 * write side. As such, rely on RELEASE semantics.
-	 */
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	tmp = atomic_long_fetch_add_release(
-		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
-	rwsem_set_reader_owned(sem);
-	if (tmp & RWSEM_FLAG_WAITERS)
-		rwsem_downgrade_wake(sem);
-}
+#endif /* __INTERNAL_RWSEM_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 06/16] locking/rwsem: Code cleanup after files merging
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (4 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 05/16] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-16 16:01   ` Peter Zijlstra
  2019-04-13 17:22 ` [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

After merging all the relevant rwsem code into one single file, there
are a number of optimizations and cleanups that can be done:

 1) Remove all the EXPORT_SYMBOL() calls for functions that are not
    accessed elsewhere.
 2) Remove all the __visible tags as none of the functions will be
    called from assembly code anymore.
 3) Make all the internal functions static.
 4) Remove some unneeded blank lines.

That enables the compiler to do better optimization and reduce code
size. The text+data size of rwsem.o on an x86-64 machine was reduced
from 8945 bytes to 4651 bytes with this change.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 50 +++++++++---------------------------------
 1 file changed, 10 insertions(+), 40 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 5f06b0601eb6..c1a089ab19fd 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -207,7 +207,6 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	osq_lock_init(&sem->osq);
 #endif
 }
-
 EXPORT_SYMBOL(__init_rwsem);
 
 enum rwsem_waiter_type {
@@ -575,19 +574,17 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 	return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
+static inline struct rw_semaphore * __sched
 rwsem_down_read_failed(struct rw_semaphore *sem)
 {
 	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
 }
-EXPORT_SYMBOL(rwsem_down_read_failed);
 
-__visible struct rw_semaphore * __sched
+static inline struct rw_semaphore * __sched
 rwsem_down_read_failed_killable(struct rw_semaphore *sem)
 {
 	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
 }
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
 
 /*
  * Wait until we successfully acquire the write lock
@@ -694,26 +691,23 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
+static inline struct rw_semaphore * __sched
 rwsem_down_write_failed(struct rw_semaphore *sem)
 {
 	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
 }
-EXPORT_SYMBOL(rwsem_down_write_failed);
 
-__visible struct rw_semaphore * __sched
+static inline struct rw_semaphore * __sched
 rwsem_down_write_failed_killable(struct rw_semaphore *sem)
 {
 	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
 }
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
 
 /*
  * handle waking up a waiter on the semaphore
  * - up_read/up_write has decremented the active part of count if we come here
  */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -728,15 +722,13 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 
 	return sem;
 }
-EXPORT_SYMBOL(rwsem_wake);
 
 /*
  * downgrade a write lock into a read lock
  * - caller incremented waiting part of count and discovered it still negative
  * - just wake up any readers at the front of the queue
  */
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -751,7 +743,6 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 
 	return sem;
 }
-EXPORT_SYMBOL(rwsem_downgrade_wake);
 
 /*
  * lock for reading
@@ -895,7 +886,6 @@ void __sched down_read(struct rw_semaphore *sem)
 
 	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
 }
-
 EXPORT_SYMBOL(down_read);
 
 int __sched down_read_killable(struct rw_semaphore *sem)
@@ -910,7 +900,6 @@ int __sched down_read_killable(struct rw_semaphore *sem)
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_read_killable);
 
 /*
@@ -924,7 +913,6 @@ int down_read_trylock(struct rw_semaphore *sem)
 		rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_);
 	return ret;
 }
-
 EXPORT_SYMBOL(down_read_trylock);
 
 /*
@@ -934,10 +922,8 @@ void __sched down_write(struct rw_semaphore *sem)
 {
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(down_write);
 
 /*
@@ -948,14 +934,14 @@ int __sched down_write_killable(struct rw_semaphore *sem)
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
-	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+				  __down_write_killable)) {
 		rwsem_release(&sem->dep_map, 1, _RET_IP_);
 		return -EINTR;
 	}
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_write_killable);
 
 /*
@@ -970,7 +956,6 @@ int down_write_trylock(struct rw_semaphore *sem)
 
 	return ret;
 }
-
 EXPORT_SYMBOL(down_write_trylock);
 
 /*
@@ -979,10 +964,8 @@ EXPORT_SYMBOL(down_write_trylock);
 void up_read(struct rw_semaphore *sem)
 {
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
 	__up_read(sem);
 }
-
 EXPORT_SYMBOL(up_read);
 
 /*
@@ -991,10 +974,8 @@ EXPORT_SYMBOL(up_read);
 void up_write(struct rw_semaphore *sem)
 {
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
 	__up_write(sem);
 }
-
 EXPORT_SYMBOL(up_write);
 
 /*
@@ -1003,10 +984,8 @@ EXPORT_SYMBOL(up_write);
 void downgrade_write(struct rw_semaphore *sem)
 {
 	lock_downgrade(&sem->dep_map, _RET_IP_);
-
 	__downgrade_write(sem);
 }
-
 EXPORT_SYMBOL(downgrade_write);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -1015,40 +994,32 @@ void down_read_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
 	rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
 }
-
 EXPORT_SYMBOL(down_read_nested);
 
 void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
 {
 	might_sleep();
 	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(_down_write_nest_lock);
 
 void down_read_non_owner(struct rw_semaphore *sem)
 {
 	might_sleep();
-
 	__down_read(sem);
 	__rwsem_set_reader_owned(sem, NULL);
 }
-
 EXPORT_SYMBOL(down_read_non_owner);
 
 void down_write_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(down_write_nested);
 
 int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
@@ -1056,14 +1027,14 @@ int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
-	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+				  __down_write_killable)) {
 		rwsem_release(&sem->dep_map, 1, _RET_IP_);
 		return -EINTR;
 	}
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_write_killable_nested);
 
 void up_read_non_owner(struct rw_semaphore *sem)
@@ -1072,7 +1043,6 @@ void up_read_non_owner(struct rw_semaphore *sem)
 				sem);
 	__up_read(sem);
 }
-
 EXPORT_SYMBOL(up_read_non_owner);
 
 #endif
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (5 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 06/16] locking/rwsem: Code cleanup after files merging Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-16 14:12   ` Peter Zijlstra
                     ` (2 more replies)
  2019-04-13 17:22 ` [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
                   ` (8 subsequent siblings)
  15 siblings, 3 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.

This patch implements a lock handoff mechanism to disable lock stealing
and force lock handoff to the first waiter or waiters (for readers)
in the queue after at least a 4ms waiting period unless it is a RT
writer task which doesn't need to wait. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.

The setting and clearing of the handoff bit is serialized by the
wait_lock. So racing is not possible.

A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
80-thread Skylake system with a v5.1 based kernel and 240 write_lock
threads with 5us sleep critical section.

Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,792/173,696. After the patch, the figures
became 5,842/6,542/7,458.  It can be seen that the rwsem became much
more fair, though there was a drop of about 16% in the mean locking
operations done which was a tradeoff of having better fairness.

Making the waiter set the handoff bit right after the first wakeup can
impact performance especially with a mixed reader/writer workload. With
the same microbenchmark with short critical section and equal number of
reader and writer threads (40/40), the reader/writer locking operation
counts with the current patch were:

  40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
  40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081

By making waiter set handoff bit immediately after wakeup:

  40 readers, Iterations Min/Mean/Max = 43/44/46
  40 writers, Iterations Min/Mean/Max = 43/1,263/3,191

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |   2 +
 kernel/locking/rwsem.c            | 205 +++++++++++++++++++++++-------
 2 files changed, 164 insertions(+), 43 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index ad7668cfc9da..29e5c52197fa 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,7 +61,9 @@ LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
+LOCK_EVENT(rwsem_rlock_handoff)	/* # of read lock handoffs		*/
 LOCK_EVENT(rwsem_rtrylock)	/* # of read trylock calls		*/
 LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
 LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
+LOCK_EVENT(rwsem_wlock_handoff)	/* # of write lock handoffs		*/
 LOCK_EVENT(rwsem_wtrylock)	/* # of write trylock calls		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c1a089ab19fd..aaab546a890d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -74,22 +74,38 @@
  *
  * Bit  0   - writer locked bit
  * Bit  1   - waiters present bit
- * Bits 2-7 - reserved
+ * Bit  2   - lock handoff bit
+ * Bits 3-7 - reserved
  * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
+ *
+ * There are three places where the lock handoff bit may be set or cleared.
+ * 1) __rwsem_mark_wake() for readers.
+ * 2) rwsem_try_write_lock() for writers.
+ * 3) Error path of __rwsem_down_write_failed_common().
+ *
+ * For all the above cases, wait_lock will be held. A writer must also
+ * be the first one in the wait_list to be eligible for setting the handoff
+ * bit. So concurrent setting/clearing of handoff bit is not possible.
  */
 #define RWSEM_WRITER_LOCKED	(1UL << 0)
 #define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_FLAG_HANDOFF	(1UL << 2)
+
 #define RWSEM_READER_SHIFT	8
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
 #define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
 #define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
+				 RWSEM_FLAG_HANDOFF)
 
 #define RWSEM_COUNT_LOCKED(c)	((c) & RWSEM_LOCK_MASK)
+#define RWSEM_COUNT_HANDOFF(c)	((c) & RWSEM_FLAG_HANDOFF)
+#define RWSEM_COUNT_LOCKED_OR_HANDOFF(c)	\
+	((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -218,6 +234,7 @@ struct rwsem_waiter {
 	struct list_head list;
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
+	unsigned long timeout;
 };
 
 enum rwsem_wake_type {
@@ -226,6 +243,18 @@ enum rwsem_wake_type {
 	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
 };
 
+enum writer_wait_state {
+	WRITER_NOT_FIRST,	/* Writer is not first in wait list */
+	WRITER_FIRST,		/* Writer is first in wait list     */
+	WRITER_HANDOFF		/* Writer is first & handoff needed */
+};
+
+/*
+ * The typical HZ value is either 250 or 1000. So set the minimum waiting
+ * time to 4ms in the wait queue before initiating the handoff protocol.
+ */
+#define RWSEM_WAIT_TIMEOUT	(HZ/250)
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -245,6 +274,8 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	struct rwsem_waiter *waiter, *tmp;
 	long oldcount, woken = 0, adjustment = 0;
 
+	lockdep_assert_held(&sem->wait_lock);
+
 	/*
 	 * Take a peek at the queue head waiter such that we can determine
 	 * the wakeup(s) to perform.
@@ -276,6 +307,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
 		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+			/*
+			 * Initiate handoff to reader, if applicable.
+			 */
+			if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
+			    time_after(jiffies, waiter->timeout)) {
+				adjustment -= RWSEM_FLAG_HANDOFF;
+				lockevent_inc(rwsem_rlock_handoff);
+			}
+
 			atomic_long_sub(adjustment, &sem->count);
 			return;
 		}
@@ -324,6 +364,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		adjustment -= RWSEM_FLAG_WAITERS;
 	}
 
+	/*
+	 * Clear the handoff flag
+	 */
+	if (woken && RWSEM_COUNT_HANDOFF(atomic_long_read(&sem->count)))
+		adjustment -= RWSEM_FLAG_HANDOFF;
+
 	if (adjustment)
 		atomic_long_add(adjustment, &sem->count);
 }
@@ -332,22 +378,42 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  * This function must be called with the sem->wait_lock held to prevent
  * race conditions between checking the rwsem wait list and setting the
  * sem->count accordingly.
+ *
+ * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
+ * bit is set or the lock is acquired.
  */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+					enum writer_wait_state wstate)
 {
 	long new;
 
-	if (RWSEM_COUNT_LOCKED(count))
+retry:
+	if (RWSEM_COUNT_LOCKED(count)) {
+		if (RWSEM_COUNT_HANDOFF(count) || (wstate != WRITER_HANDOFF))
+			return false;
+		/*
+		 * The lock may become free just before setting handoff bit.
+		 * It will be simpler if atomic_long_or_return() is available.
+		 */
+		atomic_long_or(RWSEM_FLAG_HANDOFF, &sem->count);
+		count = atomic_long_read(&sem->count);
+		goto retry;
+	}
+
+	if ((wstate == WRITER_NOT_FIRST) && RWSEM_COUNT_HANDOFF(count))
 		return false;
 
-	new = count + RWSEM_WRITER_LOCKED -
-	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+	new = (count & ~RWSEM_FLAG_HANDOFF) + RWSEM_WRITER_LOCKED -
+	      (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
 
 	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
 		rwsem_set_owner(sem);
 		return true;
 	}
 
+	if (unlikely((wstate == WRITER_HANDOFF) && !RWSEM_COUNT_HANDOFF(count)))
+		goto retry;
+
 	return false;
 }
 
@@ -359,7 +425,7 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 {
 	long count = atomic_long_read(&sem->count);
 
-	while (!RWSEM_COUNT_LOCKED(count)) {
+	while (!RWSEM_COUNT_LOCKED_OR_HANDOFF(count)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
 					count + RWSEM_WRITER_LOCKED)) {
 			rwsem_set_owner(sem);
@@ -498,6 +564,16 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 }
 #endif
 
+/*
+ * This is safe to be called without holding the wait_lock.
+ */
+static inline bool
+rwsem_waiter_is_first(struct rw_semaphore *sem, struct rwsem_waiter *waiter)
+{
+	return list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
+			== waiter;
+}
+
 /*
  * Wait for the read lock to be granted
  */
@@ -510,16 +586,18 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_READ;
+	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list)) {
 		/*
 		 * In case the wait queue is empty and the lock isn't owned
-		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_READER_BIAS has already been
-		 * set in the count.
+		 * by a writer or has the handoff bit set, this reader can
+		 * exit the slowpath and return immediately as its
+		 * RWSEM_READER_BIAS has already been set in the count.
 		 */
-		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+		if (!(atomic_long_read(&sem->count) &
+		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
@@ -567,7 +645,8 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 out_nolock:
 	list_del(&waiter.list);
 	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+		atomic_long_andnot(RWSEM_FLAG_WAITERS|RWSEM_FLAG_HANDOFF,
+				   &sem->count);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock_fail);
@@ -593,7 +672,7 @@ static inline struct rw_semaphore *
 __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 {
 	long count;
-	bool waiting = true; /* any queued threads before us */
+	enum writer_wait_state wstate;
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
@@ -608,56 +687,63 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	 */
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_WRITE;
+	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 
 	/* account for this before adding a new element to the list */
-	if (list_empty(&sem->wait_list))
-		waiting = false;
+	wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
 
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock */
-	if (waiting) {
+	if (wstate == WRITER_NOT_FIRST) {
 		count = atomic_long_read(&sem->count);
 
 		/*
-		 * If there were already threads queued before us and there are
-		 * no active writers and some readers, the lock must be read
-		 * owned; so we try to  any read locks that were queued ahead
-		 * of us.
+		 * If there were already threads queued before us and:
+		 *  1) there are no no active locks, wake the front
+		 *     queued process(es) as the handoff bit might be set.
+		 *  2) there are no active writers and some readers, the lock
+		 *     must be read owned; so we try to wake any read lock
+		 *     waiters that were queued ahead of us.
 		 */
-		if (!(count & RWSEM_WRITER_MASK) &&
-		     (count & RWSEM_READER_MASK)) {
+		if (!RWSEM_COUNT_LOCKED(count))
+			__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+		else if (!(count & RWSEM_WRITER_MASK) &&
+			  (count & RWSEM_READER_MASK))
 			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
-			/*
-			 * The wakeup is normally called _after_ the wait_lock
-			 * is released, but given that we are proactively waking
-			 * readers we can deal with the wake_q overhead as it is
-			 * similar to releasing and taking the wait_lock again
-			 * for attempting rwsem_try_write_lock().
-			 */
-			wake_up_q(&wake_q);
+		else
+			goto wait;
 
-			/*
-			 * Reinitialize wake_q after use.
-			 */
-			wake_q_init(&wake_q);
-		}
+		/*
+		 * The wakeup is normally called _after_ the wait_lock
+		 * is released, but given that we are proactively waking
+		 * readers we can deal with the wake_q overhead as it is
+		 * similar to releasing and taking the wait_lock again
+		 * for attempting rwsem_try_write_lock().
+		 */
+		wake_up_q(&wake_q);
 
+		/*
+		 * Reinitialize wake_q after use.
+		 */
+		wake_q_init(&wake_q);
 	} else {
 		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
 	}
 
+wait:
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	while (true) {
-		if (rwsem_try_write_lock(count, sem))
+		if (rwsem_try_write_lock(count, sem, wstate))
 			break;
+
 		raw_spin_unlock_irq(&sem->wait_lock);
 
 		/* Block until there are no active lockers. */
-		do {
+		for (;;) {
 			if (signal_pending_state(state, current))
 				goto out_nolock;
 
@@ -665,9 +751,34 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
 			count = atomic_long_read(&sem->count);
-		} while (RWSEM_COUNT_LOCKED(count));
+
+			if ((wstate == WRITER_NOT_FIRST) &&
+			    rwsem_waiter_is_first(sem, &waiter))
+				wstate = WRITER_FIRST;
+
+			if (!RWSEM_COUNT_LOCKED(count))
+				break;
+
+			/*
+			 * An RT task sets the HANDOFF bit immediately.
+			 * Non-RT task will wait a while before doing so.
+			 *
+			 * The setting of the handoff bit is deferred
+			 * until rwsem_try_write_lock() is called.
+			 */
+			if ((wstate == WRITER_FIRST) && (rt_task(current) ||
+			    time_after(jiffies, waiter.timeout))) {
+				wstate = WRITER_HANDOFF;
+				lockevent_inc(rwsem_wlock_handoff);
+				/*
+				 * Break out to call rwsem_try_write_lock().
+				 */
+				break;
+			}
+		}
 
 		raw_spin_lock_irq(&sem->wait_lock);
+		count = atomic_long_read(&sem->count);
 	}
 	__set_current_state(TASK_RUNNING);
 	list_del(&waiter.list);
@@ -680,6 +791,12 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	__set_current_state(TASK_RUNNING);
 	raw_spin_lock_irq(&sem->wait_lock);
 	list_del(&waiter.list);
+	/*
+	 * If handoff bit has been set by this waiter, make sure that the
+	 * clearing of it is seen by others before proceeding.
+	 */
+	if (unlikely(wstate == WRITER_HANDOFF))
+		atomic_long_add_return(-RWSEM_FLAG_HANDOFF,  &sem->count);
 	if (list_empty(&sem->wait_list))
 		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	else
@@ -707,7 +824,7 @@ rwsem_down_write_failed_killable(struct rw_semaphore *sem)
  * handle waking up a waiter on the semaphore
  * - up_read/up_write has decremented the active part of count if we come here
  */
-static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -839,7 +956,7 @@ inline void __up_read(struct rw_semaphore *sem)
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
 			== RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
+		rwsem_wake(sem, tmp);
 }
 
 /*
@@ -847,11 +964,13 @@ inline void __up_read(struct rw_semaphore *sem)
  */
 static inline void __up_write(struct rw_semaphore *sem)
 {
+	long tmp;
+
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
 	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
-			&sem->count) & RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
+	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
+		rwsem_wake(sem, tmp);
 }
 
 /*
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (6 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-17  9:00   ` Peter Zijlstra
                     ` (2 more replies)
  2019-04-13 17:22 ` [PATCH v4 09/16] locking/rwsem: Ensure an RT task will not spin on reader Waiman Long
                   ` (7 subsequent siblings)
  15 siblings, 3 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

This patch modifies rwsem_spin_on_owner() to return four possible
values to better reflect the state of lock holder which enables us to
make a better decision of what to do next.

In the special case that there is no active lock and the handoff bit
is set, optimistic spinning has to be stopped.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 45 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index aaab546a890d..2d6850c3e77b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -156,6 +156,11 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
 	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
 }
 
+static inline bool is_rwsem_owner_reader(struct task_struct *owner)
+{
+	return (unsigned long)owner & RWSEM_READER_OWNED;
+}
+
 /*
  * Return true if rwsem is owned by an anonymous writer or readers.
  */
@@ -466,14 +471,30 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 }
 
 /*
- * Return true only if we can still spin on the owner field of the rwsem.
+ * Return the folowing 4 values depending on the lock owner state.
+ *   OWNER_NULL  : owner is currently NULL
+ *   OWNER_WRITER: when owner changes and is a writer
+ *   OWNER_READER: when owner changes and the new owner may be a reader.
+ *   OWNER_NONSPINNABLE:
+ *		   when optimistic spinning has to stop because either the
+ *		   owner stops running, is unknown, or its timeslice has
+ *		   been used up.
  */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+enum owner_state {
+	OWNER_NULL		= 1 << 0,
+	OWNER_WRITER		= 1 << 1,
+	OWNER_READER		= 1 << 2,
+	OWNER_NONSPINNABLE	= 1 << 3,
+};
+#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
+
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 {
 	struct task_struct *owner = READ_ONCE(sem->owner);
+	long count;
 
 	if (!is_rwsem_owner_spinnable(owner))
-		return false;
+		return OWNER_NONSPINNABLE;
 
 	rcu_read_lock();
 	while (owner && (READ_ONCE(sem->owner) == owner)) {
@@ -491,7 +512,7 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
 		 */
 		if (need_resched() || !owner_on_cpu(owner)) {
 			rcu_read_unlock();
-			return false;
+			return OWNER_NONSPINNABLE;
 		}
 
 		cpu_relax();
@@ -500,9 +521,19 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
 
 	/*
 	 * If there is a new owner or the owner is not set, we continue
-	 * spinning.
+	 * spinning except when here is no active locks and the handoff bit
+	 * is set. In this case, we have to stop spinning.
 	 */
-	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+	owner = READ_ONCE(sem->owner);
+	if (!is_rwsem_owner_spinnable(owner))
+		return OWNER_NONSPINNABLE;
+	if (owner && !is_rwsem_owner_reader(owner))
+		return OWNER_WRITER;
+
+	count = atomic_long_read(&sem->count);
+	if (RWSEM_COUNT_HANDOFF(count) && !RWSEM_COUNT_LOCKED(count))
+		return OWNER_NONSPINNABLE;
+	return !owner ? OWNER_NULL : OWNER_READER;
 }
 
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
@@ -525,7 +556,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	 *  2) readers own the lock as we can't determine if they are
 	 *     actively running or not.
 	 */
-	while (rwsem_spin_on_owner(sem)) {
+	while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
 		/*
 		 * Try to acquire the lock
 		 */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 09/16] locking/rwsem: Ensure an RT task will not spin on reader
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (7 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-17 13:18   ` Peter Zijlstra
  2019-04-13 17:22 ` [PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.

When the owner is temporarily set to NULL, it is more tricky to decide
if an RT task should stop spinning as it may be a temporary state
where another writer may have just stolen the lock which then failed
the task's trylock attempt. So one more retry is allowed to make sure
that the lock is not spinnable by an RT task.

When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.

    Locking threads     Pre-patch     Post-patch
    ---------------     ---------     -----------
            4             2,753          2,608
            8             2,529          2,520
           16             1,727          1,918
           32             1,263          1,956
           64               889          1,343

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2d6850c3e77b..8e19b5141595 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -539,6 +539,8 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
 	bool taken = false;
+	bool is_rt_task = rt_task(current);
+	int prev_owner_state = OWNER_NULL;
 
 	preempt_disable();
 
@@ -556,7 +558,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	 *  2) readers own the lock as we can't determine if they are
 	 *     actively running or not.
 	 */
-	while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
+	for (;;) {
+		enum owner_state owner_state = rwsem_spin_on_owner(sem);
+
+		if (!(owner_state & OWNER_SPINNABLE))
+			break;
+
 		/*
 		 * Try to acquire the lock
 		 */
@@ -566,13 +573,28 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		}
 
 		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
+		 * An RT task cannot do optimistic spinning if it cannot
+		 * be sure the lock holder is running or live-lock may
+		 * happen if the current task and the lock holder happen
+		 * to run in the same CPU.
+		 *
+		 * When there's no owner or is reader-owned, an RT task
+		 * will stop spinning if the owner state is not a writer
+		 * at the previous iteration of the loop. This allows the
+		 * RT task to recheck if the task that steals the lock is
+		 * a spinnable writer. If so, it can keeps on spinning.
+		 *
+		 * If the owner is a writer, the need_resched() check is
+		 * done inside rwsem_spin_on_owner(). If the owner is not
+		 * a writer, need_resched() check needs to be done here.
 		 */
-		if (!sem->owner && (need_resched() || rt_task(current)))
-			break;
+		if (owner_state != OWNER_WRITER) {
+			if (need_resched())
+				break;
+			if (is_rt_task && (prev_owner_state != OWNER_WRITER))
+				break;
+		}
+		prev_owner_state = owner_state;
 
 		/*
 		 * The cpu_relax() call is a compiler barrier which forces
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (8 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 09/16] locking/rwsem: Ensure an RT task will not spin on reader Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-16 16:50   ` Davidlohr Bueso
  2019-04-17 13:39   ` Peter Zijlstra
  2019-04-13 17:22 ` [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer Waiman Long
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.

Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.

Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-Patch   Post-patch
   ------------  ---------   ----------
        4          1,641        1,674
        8            731        1,062
       16            564          924
       32             78          300
       64             38          195
      240             50          149

There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8e19b5141595..cf0a90d251aa 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -260,6 +260,13 @@ enum writer_wait_state {
  */
 #define RWSEM_WAIT_TIMEOUT	(HZ/250)
 
+/*
+ * We limit the maximum number of readers that can be woken up for a
+ * wake-up call to not penalizing the waking thread for spending too
+ * much time doing it.
+ */
+#define MAX_READERS_WAKEUP	0x100
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -332,16 +339,16 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	}
 
 	/*
-	 * Grant an infinite number of read locks to the readers at the front
-	 * of the queue. We know that woken will be at least 1 as we accounted
-	 * for above. Note we increment the 'active part' of the count by the
+	 * Grant up to MAX_READERS_WAKEUP read locks to all the readers in the
+	 * queue. We know that woken will be at least 1 as we accounted for
+	 * above. Note we increment the 'active part' of the count by the
 	 * number of readers before waking any processes up.
 	 */
 	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
 		struct task_struct *tsk;
 
 		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-			break;
+			continue;
 
 		woken++;
 		tsk = waiter->task;
@@ -360,6 +367,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		 * after setting the reader waiter to nil.
 		 */
 		wake_q_add_safe(wake_q, tsk);
+
+		/*
+		 * Limit # of readers that can be woken up per wakeup call.
+		 */
+		if (woken >= MAX_READERS_WAKEUP)
+			break;
 	}
 
 	adjustment = woken * RWSEM_READER_BIAS - adjustment;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (9 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-17 13:56   ` Peter Zijlstra
                     ` (2 more replies)
  2019-04-13 17:22 ` [PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
                   ` (4 subsequent siblings)
  15 siblings, 3 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly.  The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
__rwsem_down_read_failed_common() and __rwsem_down_write_failed_common().

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        4          1,674        1,684
        8          1,062        1,074
       16            924          900
       32            300          458
       64            195          208
      128            164          168
      240            149          143

The performance change wasn't significant in this case, but this change
is required by a follow-on patch.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |  1 +
 kernel/locking/rwsem.c            | 91 +++++++++++++++++++++++++++----
 2 files changed, 80 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 29e5c52197fa..333ed5fda333 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,6 +56,7 @@ LOCK_EVENT(rwsem_sleep_reader)	/* # of reader sleeps			*/
 LOCK_EVENT(rwsem_sleep_writer)	/* # of writer sleeps			*/
 LOCK_EVENT(rwsem_wake_reader)	/* # of reader wakeups			*/
 LOCK_EVENT(rwsem_wake_writer)	/* # of writer wakeups			*/
+LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index cf0a90d251aa..3cf8355252d1 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -103,9 +103,12 @@
 				 RWSEM_FLAG_HANDOFF)
 
 #define RWSEM_COUNT_LOCKED(c)	((c) & RWSEM_LOCK_MASK)
+#define RWSEM_COUNT_WLOCKED(c)	((c) & RWSEM_WRITER_MASK)
 #define RWSEM_COUNT_HANDOFF(c)	((c) & RWSEM_FLAG_HANDOFF)
 #define RWSEM_COUNT_LOCKED_OR_HANDOFF(c)	\
 	((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))
+#define RWSEM_COUNT_WLOCKED_OR_HANDOFF(c)	\
+	((c) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -436,6 +439,30 @@ static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
 }
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire read lock before the reader is put on wait queue.
+ * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
+ * is ongoing.
+ */
+static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+
+	if (RWSEM_COUNT_WLOCKED_OR_HANDOFF(count))
+		return false;
+
+	count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+	if (!RWSEM_COUNT_WLOCKED_OR_HANDOFF(count)) {
+		rwsem_set_reader_owned(sem);
+		lockevent_inc(rwsem_opt_rlock);
+		return true;
+	}
+
+	/* Back out the change */
+	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+	return false;
+}
+
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
@@ -470,9 +497,12 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 
 	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
 
-	if (need_resched())
+	if (need_resched()) {
+		lockevent_inc(rwsem_opt_fail);
 		return false;
+	}
 
+	preempt_disable();
 	rcu_read_lock();
 	owner = READ_ONCE(sem->owner);
 	if (owner) {
@@ -480,6 +510,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 		      owner_on_cpu(owner);
 	}
 	rcu_read_unlock();
+	preempt_enable();
+
+	lockevent_cond_inc(rwsem_opt_fail, !ret);
 	return ret;
 }
 
@@ -549,7 +582,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 	return !owner ? OWNER_NULL : OWNER_READER;
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	bool taken = false;
 	bool is_rt_task = rt_task(current);
@@ -558,9 +591,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	preempt_disable();
 
 	/* sem->wait_lock should not be held when doing optimistic spinning */
-	if (!rwsem_can_spin_on_owner(sem))
-		goto done;
-
 	if (!osq_lock(&sem->osq))
 		goto done;
 
@@ -580,10 +610,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		/*
 		 * Try to acquire the lock
 		 */
-		if (rwsem_try_write_lock_unqueued(sem)) {
-			taken = true;
+		taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+			      : rwsem_try_read_lock_unqueued(sem);
+
+		if (taken)
 			break;
-		}
 
 		/*
 		 * An RT task cannot do optimistic spinning if it cannot
@@ -624,7 +655,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	return taken;
 }
 #else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	return false;
+}
+
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	return false;
 }
@@ -650,6 +686,33 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	if (!rwsem_can_spin_on_owner(sem))
+		goto queue;
+
+	/*
+	 * Undo read bias from down_read() and do optimistic spinning.
+	 */
+	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+	adjustment = 0;
+	if (rwsem_optimistic_spin(sem, false)) {
+		unsigned long flags;
+
+		/*
+		 * Opportunistically wake up other readers in the wait queue.
+		 * It has another chance of wakeup at unlock time.
+		 */
+		if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS) &&
+		    raw_spin_trylock_irqsave(&sem->wait_lock, flags)) {
+			if (!list_empty(&sem->wait_list))
+				__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
+						  &wake_q);
+			raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+			wake_up_q(&wake_q);
+		}
+		return sem;
+	}
+
+queue:
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_READ;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
@@ -662,7 +725,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 		 * exit the slowpath and return immediately as its
 		 * RWSEM_READER_BIAS has already been set in the count.
 		 */
-		if (!(atomic_long_read(&sem->count) &
+		if (adjustment && !(atomic_long_read(&sem->count) &
 		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
@@ -674,7 +737,10 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
-	count = atomic_long_add_return(adjustment, &sem->count);
+	if (adjustment)
+		count = atomic_long_add_return(adjustment, &sem->count);
+	else
+		count = atomic_long_read(&sem->count);
 
 	/*
 	 * If there are no active locks, wake the front queued process(es).
@@ -744,7 +810,8 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	DEFINE_WAKE_Q(wake_q);
 
 	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_optimistic_spin(sem))
+	if (rwsem_can_spin_on_owner(sem) &&
+	    rwsem_optimistic_spin(sem, true))
 		return sem;
 
 	/*
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (10 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-18 13:06   ` Peter Zijlstra
  2019-04-13 17:22 ` [PATCH v4 13/16] locking/rwsem: Add more rwsem owner access helpers Waiman Long
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.

This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.

When the time threshold is exceeded, a bit will be set in the owner field
to indicate that no more optimistic spinning will be allowed on this
rwsem until it becomes writer owned again. Not even readers is allowed
to acquire the reader-locked rwsem by optimistic spinning for fairness.

The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.

      System                 Time for 16 Iterations
      ------                 ----------------------
  1-socket Skylake                  ~800ns
  4-socket Broadwell                ~300ns
  2-socket ThunderX2 (arm64)        ~250ns

When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time.  So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        2          1,759        6,684
        4          1,684        6,738
        8          1,074        7,222
       16            900        7,163
       32            458        7,316
       64            208          520
      128            168          425
      240            143          474

This patch gives a big boost in performance for mixed reader/writer
workloads.

With 32 locking threads, the rwsem lock event data were:

rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663

With 64 locking threads, the data looked like:

rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281

So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |   1 +
 kernel/locking/rwsem.c            | 121 ++++++++++++++++++++++++++----
 2 files changed, 107 insertions(+), 15 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 333ed5fda333..f3550aa5866a 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -59,6 +59,7 @@ LOCK_EVENT(rwsem_wake_writer)	/* # of writer wakeups			*/
 LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
+LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled reader opt-spinnings	*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 3cf8355252d1..8b23009e6b2c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -22,6 +22,7 @@
 #include <linux/sched/debug.h>
 #include <linux/sched/wake_q.h>
 #include <linux/sched/signal.h>
+#include <linux/sched/clock.h>
 #include <linux/export.h>
 #include <linux/rwsem.h>
 #include <linux/atomic.h>
@@ -35,18 +36,20 @@
  *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
  *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
  *    i.e. the owner(s) cannot be readily determined. It can be reader
- *    owned or the owning writer is indeterminate.
+ *    owned or the owning writer is indeterminate. Optimistic spinning
+ *    should be disabled if this flag is set.
  *
  * When a writer acquires a rwsem, it puts its task_struct pointer
- * into the owner field. It is cleared after an unlock.
+ * into the owner field or the count itself (64-bit only. It should
+ * be cleared after an unlock.
  *
  * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
+ * pointer into the owner field with the RWSEM_READER_OWNED bit set.
+ * On unlock, the owner field will largely be left untouched. So
+ * for a free or reader-owned rwsem, the owner value may contain
+ * information about the last reader that acquires the rwsem. The
+ * anonymous bit may also be set to permanently disable optimistic
+ * spinning on a reader-own rwsem until a writer comes along.
  *
  * That information may be helpful in debugging cases where the system
  * seems to hang on a reader owned rwsem especially if only one reader
@@ -138,8 +141,7 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-						 | RWSEM_ANONYMOUSLY_OWNED;
+	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED;
 
 	WRITE_ONCE(sem->owner, (struct task_struct *)val);
 }
@@ -164,6 +166,14 @@ static inline bool is_rwsem_owner_reader(struct task_struct *owner)
 	return (unsigned long)owner & RWSEM_READER_OWNED;
 }
 
+/*
+ * Return true if the rwsem is spinnable.
+ */
+static inline bool is_rwsem_spinnable(struct rw_semaphore *sem)
+{
+	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+}
+
 /*
  * Return true if rwsem is owned by an anonymous writer or readers.
  */
@@ -193,6 +203,22 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 }
 #endif
 
+/*
+ * Set the RWSEM_ANONYMOUSLY_OWNED flag if the RWSEM_READER_OWNED flag
+ * remains set. Otherwise, the operation will be aborted.
+ */
+static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
+{
+	long owner = (long)READ_ONCE(sem->owner);
+
+	while (is_rwsem_owner_reader((struct task_struct *)owner)) {
+		if (!is_rwsem_owner_spinnable((struct task_struct *)owner))
+			break;
+		owner = cmpxchg((long *)&sem->owner, owner,
+				owner | RWSEM_ANONYMOUSLY_OWNED);
+	}
+}
+
 /*
  * Guide to the rw_semaphore's count field.
  *
@@ -507,7 +533,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	owner = READ_ONCE(sem->owner);
 	if (owner) {
 		ret = is_rwsem_owner_spinnable(owner) &&
-		      owner_on_cpu(owner);
+		     (is_rwsem_owner_reader(owner) || owner_on_cpu(owner));
 	}
 	rcu_read_unlock();
 	preempt_enable();
@@ -532,7 +558,7 @@ enum owner_state {
 	OWNER_READER		= 1 << 2,
 	OWNER_NONSPINNABLE	= 1 << 3,
 };
-#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
+#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER | OWNER_READER)
 
 static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 {
@@ -543,7 +569,8 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 		return OWNER_NONSPINNABLE;
 
 	rcu_read_lock();
-	while (owner && (READ_ONCE(sem->owner) == owner)) {
+	while (owner && !is_rwsem_owner_reader(owner)
+		     && (READ_ONCE(sem->owner) == owner)) {
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
 		 * checking sem->owner still matches owner, if that fails,
@@ -582,11 +609,47 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 	return !owner ? OWNER_NULL : OWNER_READER;
 }
 
+/*
+ * Calculate reader-owned rwsem spinning threshold for writer
+ *
+ * It is assumed that the more readers own the rwsem, the longer it will
+ * take for them to wind down and free the rwsem. So the formula to
+ * determine the actual spinning time limit is:
+ *
+ * 1) RWSEM_FLAG_WAITERS set
+ *    Spinning threshold = (10 + nr_readers/2)us
+ *
+ * 2) RWSEM_FLAG_WAITERS not set
+ *    Spinning threshold = 25us
+ *
+ * In the first case when RWSEM_FLAG_WAITERS is set, no new reader can
+ * become rwsem owner. It is assumed that the more readers own the rwsem,
+ * the longer it will take for them to wind down and free the rwsem. This
+ * is subjected to a maximum value of 25us.
+ *
+ * In the second case with RWSEM_FLAG_WAITERS off, new readers can join
+ * and become one of the owners. So assuming for the worst case and spin
+ * for at most 25us.
+ */
+static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+	int reader_cnt = atomic_long_read(&sem->count) >> RWSEM_READER_SHIFT;
+
+	if (reader_cnt > 30)
+		reader_cnt = 30;
+	return sched_clock() + ((count & RWSEM_FLAG_WAITERS)
+		? 10 * NSEC_PER_USEC + reader_cnt * NSEC_PER_USEC/2
+		: 25 * NSEC_PER_USEC);
+}
+
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	bool taken = false;
 	bool is_rt_task = rt_task(current);
 	int prev_owner_state = OWNER_NULL;
+	int loop = 0;
+	u64 rspin_threshold = 0;
 
 	preempt_disable();
 
@@ -598,8 +661,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 	 * Optimistically spin on the owner field and attempt to acquire the
 	 * lock whenever the owner changes. Spinning will be stopped when:
 	 *  1) the owning writer isn't running; or
-	 *  2) readers own the lock as we can't determine if they are
-	 *     actively running or not.
+	 *  2) readers own the lock and spinning count has reached 0.
 	 */
 	for (;;) {
 		enum owner_state owner_state = rwsem_spin_on_owner(sem);
@@ -616,6 +678,35 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 		if (taken)
 			break;
 
+		/*
+		 * Time-based reader-owned rwsem optimistic spinning
+		 */
+		if (wlock && (owner_state == OWNER_READER)) {
+			/*
+			 * Initialize rspin_threshold when the owner
+			 * state changes from non-reader to reader.
+			 */
+			if (prev_owner_state != OWNER_READER) {
+				if (!is_rwsem_spinnable(sem))
+					break;
+				rspin_threshold = rwsem_rspin_threshold(sem);
+				loop = 0;
+			}
+
+			/*
+			 * Check time threshold every 16 iterations to
+			 * avoid calling sched_clock() too frequently.
+			 * This will make the actual spinning time a
+			 * bit more than that specified in the threshold.
+			 */
+			else if (!(++loop & 0xf) &&
+				 (sched_clock() > rspin_threshold)) {
+				rwsem_set_nonspinnable(sem);
+				lockevent_inc(rwsem_opt_nospin);
+				break;
+			}
+		}
+
 		/*
 		 * An RT task cannot do optimistic spinning if it cannot
 		 * be sure the lock holder is running or live-lock may
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 13/16] locking/rwsem: Add more rwsem owner access helpers
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (11 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-13 17:22 ` [PATCH v4 14/16] locking/rwsem: Guard against making count negative Waiman Long
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Before combining owner and count, we are adding two new helpers for
accessing the owner value in the rwsem.

 1) struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
 2) bool is_rwsem_reader_owned(struct rw_semaphore *sem)

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 50 +++++++++++++++++++++++++++++++-----------
 1 file changed, 37 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8b23009e6b2c..ab26aba65371 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -130,6 +130,11 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 	WRITE_ONCE(sem->owner, NULL);
 }
 
+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+	return READ_ONCE(sem->owner);
+}
+
 /*
  * The task_struct pointer of the last owning reader will be left in
  * the owner field.
@@ -174,6 +179,23 @@ static inline bool is_rwsem_spinnable(struct rw_semaphore *sem)
 	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
 }
 
+/*
+ * Return true if the rwsem is owned by a reader.
+ */
+static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_DEBUG_RWSEMS
+	/*
+	 * Check the count to see if it is write-locked.
+	 */
+	long count = atomic_long_read(&sem->count);
+
+	if (count & RWSEM_WRITER_MASK)
+		return false;
+#endif
+	return (unsigned long)sem->owner & RWSEM_READER_OWNED;
+}
+
 /*
  * Return true if rwsem is owned by an anonymous writer or readers.
  */
@@ -193,6 +215,7 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
 						   | RWSEM_ANONYMOUSLY_OWNED;
+
 	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
 		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
 				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
@@ -530,7 +553,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 
 	preempt_disable();
 	rcu_read_lock();
-	owner = READ_ONCE(sem->owner);
+	owner = rwsem_get_owner(sem);
 	if (owner) {
 		ret = is_rwsem_owner_spinnable(owner) &&
 		     (is_rwsem_owner_reader(owner) || owner_on_cpu(owner));
@@ -562,15 +585,21 @@ enum owner_state {
 
 static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 {
-	struct task_struct *owner = READ_ONCE(sem->owner);
+	struct task_struct *owner = rwsem_get_owner(sem);
 	long count;
 
 	if (!is_rwsem_owner_spinnable(owner))
 		return OWNER_NONSPINNABLE;
 
 	rcu_read_lock();
-	while (owner && !is_rwsem_owner_reader(owner)
-		     && (READ_ONCE(sem->owner) == owner)) {
+	while (owner && !is_rwsem_owner_reader(owner)) {
+		struct task_struct *new_owner = rwsem_get_owner(sem);
+
+		if (new_owner != owner) {
+			owner = new_owner;
+			break;	/* The owner has changed */
+		}
+
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
 		 * checking sem->owner still matches owner, if that fails,
@@ -597,7 +626,6 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 	 * spinning except when here is no active locks and the handoff bit
 	 * is set. In this case, we have to stop spinning.
 	 */
-	owner = READ_ONCE(sem->owner);
 	if (!is_rwsem_owner_spinnable(owner))
 		return OWNER_NONSPINNABLE;
 	if (owner && !is_rwsem_owner_reader(owner))
@@ -1093,8 +1121,7 @@ inline void __down_read(struct rw_semaphore *sem)
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		rwsem_down_read_failed(sem);
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
+		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
@@ -1106,8 +1133,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
 			return -EINTR;
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
+		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
@@ -1174,8 +1200,7 @@ inline void __up_read(struct rw_semaphore *sem)
 {
 	long tmp;
 
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
+	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
@@ -1382,8 +1407,7 @@ EXPORT_SYMBOL(down_write_killable_nested);
 
 void up_read_non_owner(struct rw_semaphore *sem)
 {
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
+	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	__up_read(sem);
 }
 EXPORT_SYMBOL(up_read_non_owner);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (12 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 13/16] locking/rwsem: Add more rwsem owner access helpers Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-18 13:51   ` Peter Zijlstra
  2019-04-13 17:22 ` [PATCH v4 15/16] locking/rwsem: Merge owner into count on x86-64 Waiman Long
  2019-04-13 17:22 ` [PATCH v4 16/16] locking/rwsem: Remove redundant computation of writer lock word Waiman Long
  15 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.

To prevent this count overflow from happening, the most significant bit
is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock attempts
will now fail for both the fast and optimistic spinning paths whenever
this bit is set. So all those extra readers will be put to sleep in
the wait queue. Wakeup will not happen until the reader count reaches 0.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 84 ++++++++++++++++++++++++++++++++----------
 1 file changed, 64 insertions(+), 20 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ab26aba65371..f37ab6358fe0 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -73,13 +73,28 @@
 #endif
 
 /*
- * The definition of the atomic counter in the semaphore:
+ * On 64-bit architectures, the bit definitions of the count are:
  *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bit  2   - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * Bit  0    - writer locked bit
+ * Bit  1    - waiters present bit
+ * Bit  2    - lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-62 - 55-bit reader count
+ * Bit  63   - read fail bit
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit  0    - writer locked bit
+ * Bit  1    - waiters present bit
+ * Bit  2    - lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-30 - 23-bit reader count
+ * Bit  31   - read fail bit
+ *
+ * It is not likely that the most significant bit (read fail bit) will ever
+ * be set. This guard bit is still checked anyway in the down_read() fastpath
+ * just in case we need to use up more of the reader bits for other purpose
+ * in the future.
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
@@ -96,6 +111,7 @@
 #define RWSEM_WRITER_LOCKED	(1UL << 0)
 #define RWSEM_FLAG_WAITERS	(1UL << 1)
 #define RWSEM_FLAG_HANDOFF	(1UL << 2)
+#define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
 
 #define RWSEM_READER_SHIFT	8
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
@@ -103,7 +119,7 @@
 #define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
-				 RWSEM_FLAG_HANDOFF)
+				 RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
 
 #define RWSEM_COUNT_LOCKED(c)	((c) & RWSEM_LOCK_MASK)
 #define RWSEM_COUNT_WLOCKED(c)	((c) & RWSEM_WRITER_MASK)
@@ -315,7 +331,8 @@ enum writer_wait_state {
 /*
  * We limit the maximum number of readers that can be woken up for a
  * wake-up call to not penalizing the waking thread for spending too
- * much time doing it.
+ * much time doing it as well as the unlikely possiblity of overflowing
+ * the reader count.
  */
 #define MAX_READERS_WAKEUP	0x100
 
@@ -799,12 +816,35 @@ rwsem_waiter_is_first(struct rw_semaphore *sem, struct rwsem_waiter *waiter)
  * Wait for the read lock to be granted
  */
 static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state, long count)
 {
-	long count, adjustment = -RWSEM_READER_BIAS;
+	long adjustment = -RWSEM_READER_BIAS;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	if (unlikely(count < 0)) {
+		/*
+		 * The sign bit has been set meaning that too many active
+		 * readers are present. We need to decrement reader count &
+		 * enter wait queue immediately to avoid overflowing the
+		 * reader count.
+		 *
+		 * As preemption is not disabled, there is a remote
+		 * possibility that premption can happen in the narrow
+		 * timing window between incrementing and decrementing
+		 * the reader count and the task is put to sleep for a
+		 * considerable amount of time. If sufficient number
+		 * of such unfortunate sequence of events happen, we
+		 * may still overflow the reader count. It is extremely
+		 * unlikey, though. If this is a concern, we should consider
+		 * disable preemption during this timing window to make
+		 * sure that such unfortunate event will not happen.
+		 */
+		atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+		adjustment = 0;
+		goto queue;
+	}
+
 	if (!rwsem_can_spin_on_owner(sem))
 		goto queue;
 
@@ -905,15 +945,15 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 }
 
 static inline struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
+rwsem_down_read_failed(struct rw_semaphore *sem, long cnt)
 {
-	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
+	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE, cnt);
 }
 
 static inline struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
+rwsem_down_read_failed_killable(struct rw_semaphore *sem, long cnt)
 {
-	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
+	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE, cnt);
 }
 
 /*
@@ -1118,9 +1158,11 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
  */
 inline void __down_read(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_failed(sem);
+	long count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+						   &sem->count);
+
+	if (unlikely(count & RWSEM_READ_FAILED_MASK)) {
+		rwsem_down_read_failed(sem, count);
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
@@ -1129,9 +1171,11 @@ inline void __down_read(struct rw_semaphore *sem)
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+	long count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+						   &sem->count);
+
+	if (unlikely(count & RWSEM_READ_FAILED_MASK)) {
+		if (IS_ERR(rwsem_down_read_failed_killable(sem, count)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 15/16] locking/rwsem: Merge owner into count on x86-64
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (13 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 14/16] locking/rwsem: Guard against making count negative Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  2019-04-18 14:28   ` Peter Zijlstra
  2019-04-13 17:22 ` [PATCH v4 16/16] locking/rwsem: Remove redundant computation of writer lock word Waiman Long
  15 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

With separate count and owner, there are timing windows where the two
values are inconsistent. That can cause problem when trying to figure
out the exact state of the rwsem. For instance, a RT task will stop
optimistic spinning if the lock is acquired by a writer but the owner
field isn't set yet. That can be solved by combining the count and
owner together in a single atomic value.

On 32-bit architectures, there aren't enough bits to hold both.
64-bit architectures, however, can have enough bits to do that. For
x86-64, the physical address can use up to 52 bits. That is 4PB of
memory. That leaves 12 bits available for other use. The task structure
pointer is aligned to the L1 cache size. That means another 6 bits
(64 bytes cacheline) will be available. Reserving 2 bits for status
flags, we will have 16 bits for the reader count and the read fail bit.
That can supports up to (32k-1) readers.

The owner value will still be duplicated in the owner field as that
will ease debugging when looking at core dump.

This change is currently enabled for x86-64 only. Other 64-bit
architectures may be enabled in the future if the need arises.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
writer-only locking threads and then equal numbers of readers and writers
(mixed) before patch and after this and subsequent related patches were
as follows:

                  Before Patch      After Patch
   # of Threads  wlock    mixed    wlock    mixed
   ------------  -----    -----    -----    -----
        1        30,422   31,034   30,323   30,379
        2         6,427    6,684    7,804    9,436
        4         6,742    6,738    7,568    8,268
        8         7,092    7,222    5,679    7,041
       16         6,882    7,163    6,848    7,652
       32         7,458    7,316    7,975    2,189
       64         7,906      520    8,269      534
      128         1,680      425    8,047      448

In the single thread case, the complex write-locking operation does
introduce a little bit of overhead (about 0.3%). For the contended cases,
except for some anomalies in the data, there is no evidence that this
change will adversely impact performance.

When running the same microbenchmark with RT locking threads instead,
we got the following results:

                  Before Patch      After Patch
   # of Threads  wlock    mixed    wlock    mixed
   ------------  -----    -----    -----    -----
        2         4,065    3,642    4,756    5,062
        4         2,254    1,907    3,460    2,496
        8         2,386      964    3,012    1,964
       16         2,095    1,596    3,083    1,862
       32         2,388      530    3,717      359
       64         1,424      322    4,060      401
      128         1,642      510    4,488      628

It is obvious that RT tasks can benefit pretty significantly with this set
of patches.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 112 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 103 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index f37ab6358fe0..27219abb8bb6 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -73,7 +73,41 @@
 #endif
 
 /*
- * On 64-bit architectures, the bit definitions of the count are:
+ * Enable the merging of owner into count for x86-64 only.
+ */
+#ifdef CONFIG_X86_64
+#define RWSEM_MERGE_OWNER_TO_COUNT
+#endif
+
+/*
+ * With separate count and owner, there are timing windows where the two
+ * values are inconsistent. That can cause problem when trying to figure
+ * out the exact state of the rwsem. That can be solved by combining
+ * the count and owner together in a single atomic value.
+ *
+ * On 64-bit architectures, the owner task structure pointer can be
+ * compressed and combined with reader count and other status flags.
+ * A simple compression method is to map the virtual address back to
+ * the physical address by subtracting PAGE_OFFSET. On 32-bit
+ * architectures, the long integer value just isn't big enough for
+ * combining owner and count. So they remain separate.
+ *
+ * For x86-64, the physical address can use up to 52 bits. That is 4PB
+ * of memory. That leaves 12 bits available for other use. The task
+ * structure pointer is also aligned to the L1 cache size. That means
+ * another 6 bits (64 bytes cacheline) will be available. Reserving
+ * 2 bits for status flags, we will have 16 bits for the reader count
+ * and read fail bit. That can supports up to (32k-1) active readers.
+ *
+ * On x86-64, the bit definitions of the count are:
+ *
+ * Bit   0    - waiters present bit
+ * Bit   1    - lock handoff bit
+ * Bits  2-47 - compressed task structure pointer
+ * Bits 48-62 - 15-bit reader counts
+ * Bit  63    - read fail bit
+ *
+ * On other 64-bit architectures, the bit definitions are:
  *
  * Bit  0    - writer locked bit
  * Bit  1    - waiters present bit
@@ -108,15 +142,30 @@
  * be the first one in the wait_list to be eligible for setting the handoff
  * bit. So concurrent setting/clearing of handoff bit is not possible.
  */
-#define RWSEM_WRITER_LOCKED	(1UL << 0)
-#define RWSEM_FLAG_WAITERS	(1UL << 1)
-#define RWSEM_FLAG_HANDOFF	(1UL << 2)
+#define RWSEM_FLAG_WAITERS	(1UL << 0)
+#define RWSEM_FLAG_HANDOFF	(1UL << 1)
 #define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
 
+
+#ifdef RWSEM_MERGE_OWNER_TO_COUNT
+
+#ifdef __PHYSICAL_MASK_SHIFT
+#define RWSEM_PA_MASK_SHIFT	__PHYSICAL_MASK_SHIFT
+#else
+#define RWSEM_PA_MASK_SHIFT	52
+#endif
+#define RWSEM_READER_SHIFT	(RWSEM_PA_MASK_SHIFT - L1_CACHE_SHIFT + 2)
+#define RWSEM_WRITER_MASK	((1UL << RWSEM_READER_SHIFT) - 4)
+#define RWSEM_WRITER_LOCKED	rwsem_owner_count(current)
+
+#else /* RWSEM_MERGE_OWNER_TO_COUNT */
 #define RWSEM_READER_SHIFT	8
+#define RWSEM_WRITER_MASK	(1UL << 7)
+#define RWSEM_WRITER_LOCKED	RWSEM_WRITER_MASK
+#endif /* RWSEM_MERGE_OWNER_TO_COUNT */
+
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
 #define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
 				 RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
@@ -129,13 +178,34 @@
 #define RWSEM_COUNT_WLOCKED_OR_HANDOFF(c)	\
 	((c) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))
 
+/*
+ * Task structure pointer compression (64-bit only):
+ * (owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2)
+ */
+static inline unsigned long rwsem_owner_count(struct task_struct *owner)
+{
+	return ((unsigned long)owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2);
+}
+
+static inline unsigned long rwsem_count_owner(long count)
+{
+	unsigned long writer = (unsigned long)count & RWSEM_WRITER_MASK;
+
+	return writer ? (writer << (L1_CACHE_SHIFT - 2)) + PAGE_OFFSET : 0;
+}
+
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
  * store tearing can't happen as optimistic spinners may read and use
  * the owner value concurrently without lock. Read from owner, however,
  * may not need READ_ONCE() as long as the pointer value is only used
  * for comparison and isn't being dereferenced.
+ *
+ * On 32-bit architectures, the owner and count are separate. On 64-bit
+ * architectures, however, the writer task structure pointer is written
+ * to the count as well in addition to the owner field.
  */
+
 static inline void rwsem_set_owner(struct rw_semaphore *sem)
 {
 	WRITE_ONCE(sem->owner, current);
@@ -146,10 +216,26 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 	WRITE_ONCE(sem->owner, NULL);
 }
 
+#ifdef RWSEM_MERGE_OWNER_TO_COUNT
+/*
+ * Get the owner value from count to have early access to the task structure.
+ * Owner from sem->count should includes the RWSEM_ANONYMOUSLY_OWNED bit
+ * from sem->owner.
+ */
+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+	unsigned long cowner = rwsem_count_owner(atomic_long_read(&sem->count));
+	unsigned long sowner = (unsigned long)READ_ONCE(sem->owner);
+
+	return (struct task_struct *) (cowner
+		? cowner | (sowner & RWSEM_ANONYMOUSLY_OWNED) : sowner);
+}
+#else /* !RWSEM_MERGE_OWNER_TO_COUNT */
 static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
 {
 	return READ_ONCE(sem->owner);
 }
+#endif /* RWSEM_MERGE_OWNER_TO_COUNT */
 
 /*
  * The task_struct pointer of the last owning reader will be left in
@@ -261,11 +347,11 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 /*
  * Guide to the rw_semaphore's count field.
  *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
+ * When any of the RWSEM_WRITER_MASK bits in count is set, the lock is
+ * owned by a writer.
  *
  * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (1) none of the RWSEM_WRITER_MASK bits is set in count,
  * (2) some of the reader bits are set in count, and
  * (3) the owner field has RWSEM_READ_OWNED bit set.
  *
@@ -281,6 +367,11 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 void __init_rwsem(struct rw_semaphore *sem, const char *name,
 		  struct lock_class_key *key)
 {
+	/*
+	 * We should support at least (4k-1) concurrent readers
+	 */
+	BUILD_BUG_ON(sizeof(long) * 8 - RWSEM_READER_SHIFT < 12);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	/*
 	 * Make sure we are not reinitializing a held semaphore:
@@ -1211,6 +1302,9 @@ static inline void __down_write(struct rw_semaphore *sem)
 						 RWSEM_WRITER_LOCKED)))
 		rwsem_down_write_failed(sem);
 	rwsem_set_owner(sem);
+#ifdef RWSEM_MERGE_OWNER_TO_COUNT
+	DEBUG_RWSEMS_WARN_ON(sem->owner != rwsem_get_owner(sem), sem);
+#endif
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
@@ -1261,7 +1355,7 @@ static inline void __up_write(struct rw_semaphore *sem)
 
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
 	rwsem_clear_owner(sem);
-	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+	tmp = atomic_long_fetch_and_release(~RWSEM_WRITER_MASK, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem, tmp);
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v4 16/16] locking/rwsem: Remove redundant computation of writer lock word
  2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (14 preceding siblings ...)
  2019-04-13 17:22 ` [PATCH v4 15/16] locking/rwsem: Merge owner into count on x86-64 Waiman Long
@ 2019-04-13 17:22 ` Waiman Long
  15 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-13 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

On 64-bit architectures, each rwsem writer will have its unique lock
word for acquiring the lock. Right now, the writer code recomputes the
lock word every time it tries to acquire the lock. This is a waste of
time. The lock word is now cached and reused when it is needed.

On 32-bit architectures, the extra constant argument to
rwsem_try_write_lock() and rwsem_try_write_lock_unqueued() should be
optimized out by the compiler.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 27219abb8bb6..2c8187690c7c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -561,6 +561,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  * bit is set or the lock is acquired.
  */
 static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+					const long wlock,
 					enum writer_wait_state wstate)
 {
 	long new;
@@ -581,7 +582,7 @@ static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
 	if ((wstate == WRITER_NOT_FIRST) && RWSEM_COUNT_HANDOFF(count))
 		return false;
 
-	new = (count & ~RWSEM_FLAG_HANDOFF) + RWSEM_WRITER_LOCKED -
+	new = (count & ~RWSEM_FLAG_HANDOFF) + wlock -
 	      (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
 
 	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
@@ -623,13 +624,14 @@ static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem,
+						 const long wlock)
 {
 	long count = atomic_long_read(&sem->count);
 
 	while (!RWSEM_COUNT_LOCKED_OR_HANDOFF(count)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_WRITER_LOCKED)) {
+						    count + wlock)) {
 			rwsem_set_owner(sem);
 			lockevent_inc(rwsem_opt_wlock);
 			return true;
@@ -779,7 +781,7 @@ static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
 		: 25 * NSEC_PER_USEC);
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
 {
 	bool taken = false;
 	bool is_rt_task = rt_task(current);
@@ -808,7 +810,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 		/*
 		 * Try to acquire the lock
 		 */
-		taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+		taken = wlock ? rwsem_try_write_lock_unqueued(sem, wlock)
 			      : rwsem_try_read_lock_unqueued(sem);
 
 		if (taken)
@@ -887,7 +889,8 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	return false;
 }
 
-static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem,
+					 const long wlock)
 {
 	return false;
 }
@@ -944,7 +947,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state, long count)
 	 */
 	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
 	adjustment = 0;
-	if (rwsem_optimistic_spin(sem, false)) {
+	if (rwsem_optimistic_spin(sem, 0)) {
 		unsigned long flags;
 
 		/*
@@ -1058,10 +1061,11 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
+	const long wlock = RWSEM_WRITER_LOCKED;
 
 	/* do optimistic spinning and steal lock if possible */
 	if (rwsem_can_spin_on_owner(sem) &&
-	    rwsem_optimistic_spin(sem, true))
+	    rwsem_optimistic_spin(sem, wlock))
 		return sem;
 
 	/*
@@ -1120,7 +1124,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	while (true) {
-		if (rwsem_try_write_lock(count, sem, wstate))
+		if (rwsem_try_write_lock(count, sem, wlock, wstate))
 			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 04/16] locking/rwsem: Implement a new locking scheme
  2019-04-13 17:22 ` [PATCH v4 04/16] locking/rwsem: Implement a new locking scheme Waiman Long
@ 2019-04-16 13:22   ` Peter Zijlstra
  2019-04-16 13:32     ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-16 13:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:47PM -0400, Waiman Long wrote:
> +#define RWSEM_COUNT_LOCKED(c)	((c) & RWSEM_LOCK_MASK)

The above doesn't seem to make it more readable or shorter.

--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -192,7 +192,7 @@ static inline bool rwsem_try_write_lock(
 {
 	long new;
 
-	if (RWSEM_COUNT_LOCKED(count))
+	if (count & RWSEM_LOCK_MASK)
 		return false;
 
 	new = count + RWSEM_WRITER_LOCKED -
@@ -214,7 +214,7 @@ static inline bool rwsem_try_write_lock_
 {
 	long count = atomic_long_read(&sem->count);
 
-	while (!RWSEM_COUNT_LOCKED(count)) {
+	while (!(count & RWSEM_LOCK_MASK)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
 					count + RWSEM_WRITER_LOCKED)) {
 			rwsem_set_owner(sem);
@@ -393,7 +393,7 @@ __rwsem_down_read_failed_common(struct r
 	 * If there are no writers and we are first in the queue,
 	 * wake our own waiter to join the existing active readers !
 	 */
-	if (!RWSEM_COUNT_LOCKED(count) ||
+	if (!(count & RWSEM_LOCK_MASK) ||
 	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
@@ -522,7 +522,7 @@ __rwsem_down_write_failed_common(struct
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
 			count = atomic_long_read(&sem->count);
-		} while (RWSEM_COUNT_LOCKED(count));
+		} while (count & RWSEM_LOCK_MASK);
 
 		raw_spin_lock_irq(&sem->wait_lock);
 	}

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 04/16] locking/rwsem: Implement a new locking scheme
  2019-04-16 13:22   ` Peter Zijlstra
@ 2019-04-16 13:32     ` Waiman Long
  2019-04-16 14:18       ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-16 13:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/16/2019 09:22 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:47PM -0400, Waiman Long wrote:
>> +#define RWSEM_COUNT_LOCKED(c)	((c) & RWSEM_LOCK_MASK)
> The above doesn't seem to make it more readable or shorter.

Fair enough. I can remove that macro.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-13 17:22 ` [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
@ 2019-04-16 14:12   ` Peter Zijlstra
  2019-04-16 20:26     ` Waiman Long
  2019-04-16 15:49   ` Peter Zijlstra
  2019-04-17  8:17   ` Peter Zijlstra
  2 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-16 14:12 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:50PM -0400, Waiman Long wrote:
> +/*
> + * The typical HZ value is either 250 or 1000. So set the minimum waiting
> + * time to 4ms in the wait queue before initiating the handoff protocol.
> + */
> +#define RWSEM_WAIT_TIMEOUT	(HZ/250)

That seems equally unfortunate. For HZ=100 that results in 0ms, and for
HZ=300 that results in 3 1/3-rd ms.

(and this is not considering Alpha,ARM and MIPS, who all have various
other 'creative' HZ values)

In general aiming for sub 10ms timing using jiffies seems 'optimistic'.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 04/16] locking/rwsem: Implement a new locking scheme
  2019-04-16 13:32     ` Waiman Long
@ 2019-04-16 14:18       ` Peter Zijlstra
  2019-04-16 14:42         ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-16 14:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Tue, Apr 16, 2019 at 09:32:38AM -0400, Waiman Long wrote:
> On 04/16/2019 09:22 AM, Peter Zijlstra wrote:
> > On Sat, Apr 13, 2019 at 01:22:47PM -0400, Waiman Long wrote:
> >> +#define RWSEM_COUNT_LOCKED(c)	((c) & RWSEM_LOCK_MASK)
> > The above doesn't seem to make it more readable or shorter.
> 
> Fair enough. I can remove that macro.

I did the same for the HANDOFF patch but seem to have misplaced the
delta and have already refreshed the patch.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 04/16] locking/rwsem: Implement a new locking scheme
  2019-04-16 14:18       ` Peter Zijlstra
@ 2019-04-16 14:42         ` Peter Zijlstra
  0 siblings, 0 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-16 14:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Tue, Apr 16, 2019 at 04:18:20PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 16, 2019 at 09:32:38AM -0400, Waiman Long wrote:
> > On 04/16/2019 09:22 AM, Peter Zijlstra wrote:
> > > On Sat, Apr 13, 2019 at 01:22:47PM -0400, Waiman Long wrote:
> > >> +#define RWSEM_COUNT_LOCKED(c)	((c) & RWSEM_LOCK_MASK)
> > > The above doesn't seem to make it more readable or shorter.
> > 
> > Fair enough. I can remove that macro.
> 
> I did the same for the HANDOFF patch but seem to have misplaced the
> delta and have already refreshed the patch.

Had to redo it, so here goes:

---
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -102,9 +102,6 @@
 #define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
 				 RWSEM_FLAG_HANDOFF)
 
-#define RWSEM_COUNT_HANDOFF(c)	((c) & RWSEM_FLAG_HANDOFF)
-#define RWSEM_COUNT_LOCKED_OR_HANDOFF(c)	\
-	((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
  * store tearing can't happen as optimistic spinners may read and use
@@ -365,7 +362,7 @@ static void __rwsem_mark_wake(struct rw_
 	/*
 	 * Clear the handoff flag
 	 */
-	if (woken && RWSEM_COUNT_HANDOFF(atomic_long_read(&sem->count)))
+	if (woken && (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF))
 		adjustment -= RWSEM_FLAG_HANDOFF;
 
 	if (adjustment)
@@ -387,7 +384,7 @@ static inline bool rwsem_try_write_lock(
 
 retry:
 	if (count & RWSEM_LOCK_MASM)) {
-		if (RWSEM_COUNT_HANDOFF(count) || (wstate != WRITER_HANDOFF))
+		if ((count & RWSEM_FLAG_HANDOFF) || (wstate != WRITER_HANDOFF))
 			return false;
 		/*
 		 * The lock may become free just before setting handoff bit.
@@ -398,7 +395,7 @@ static inline bool rwsem_try_write_lock(
 		goto retry;
 	}
 
-	if ((wstate == WRITER_NOT_FIRST) && RWSEM_COUNT_HANDOFF(count))
+	if ((wstate == WRITER_NOT_FIRST) && (count & RWSEM_FLAG_HANDOFF))
 		return false;
 
 	new = (count & ~RWSEM_FLAG_HANDOFF) + RWSEM_WRITER_LOCKED -
@@ -409,7 +406,7 @@ static inline bool rwsem_try_write_lock(
 		return true;
 	}
 
-	if (unlikely((wstate == WRITER_HANDOFF) && !RWSEM_COUNT_HANDOFF(count)))
+	if (unlikely((wstate == WRITER_HANDOFF) && !(count & RWSEM_FLAG_HANDOFF)))
 		goto retry;
 
 	return false;
@@ -704,7 +701,7 @@ __rwsem_down_write_failed_common(struct
 			    rwsem_waiter_is_first(sem, &waiter))
 				wstate = WRITER_FIRST;
 
-			if (!RWSEM_COUNT_LOCKED(count))
+			if (!(count & RWSEM_LOCK_MASK))
 				break;
 
 			/*

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-13 17:22 ` [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
  2019-04-16 14:12   ` Peter Zijlstra
@ 2019-04-16 15:49   ` Peter Zijlstra
  2019-04-16 16:15     ` Peter Zijlstra
  2019-04-16 18:16     ` Waiman Long
  2019-04-17  8:17   ` Peter Zijlstra
  2 siblings, 2 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-16 15:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:50PM -0400, Waiman Long wrote:

> +#define RWSEM_COUNT_HANDOFF(c)	((c) & RWSEM_FLAG_HANDOFF)
> +#define RWSEM_COUNT_LOCKED_OR_HANDOFF(c)	\
> +	((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))

Like said before, I also made these go away.

> @@ -245,6 +274,8 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>  	struct rwsem_waiter *waiter, *tmp;
>  	long oldcount, woken = 0, adjustment = 0;
>  
> +	lockdep_assert_held(&sem->wait_lock);
> +
>  	/*
>  	 * Take a peek at the queue head waiter such that we can determine
>  	 * the wakeup(s) to perform.
> @@ -276,6 +307,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>  		adjustment = RWSEM_READER_BIAS;
>  		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
>  		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
> +			/*
> +			 * Initiate handoff to reader, if applicable.
> +			 */
> +			if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
> +			    time_after(jiffies, waiter->timeout)) {
> +				adjustment -= RWSEM_FLAG_HANDOFF;
> +				lockevent_inc(rwsem_rlock_handoff);
> +			}

			/*
			 * When we've been waiting 'too' long (for
			 * writers to give up the lock) request a
			 * HANDOFF to force the issue.
			 */

?

> +
>  			atomic_long_sub(adjustment, &sem->count);

Can we change this to: atomic_long_add() please? The below loop that
wakes all remaining readers does use add(), so it is a bit 'weird' to
have the adjustment being negated on handover.

>  			return;
>  		}
> @@ -324,6 +364,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>  		adjustment -= RWSEM_FLAG_WAITERS;
>  	}
>  
> +	/*
> +	 * Clear the handoff flag
> +	 */

Right, but that is a trivial comment in the 'increment i' style, it
clearly states what the code does, but completely fails to elucidate the
code.

Maybe:

	/*
	 * When we've woken a reader, we no longer need to force writers
	 * to give up the lock and we can clear HANDOFF.
	 */

And I suppose this is required if we were the pickup of the handoff set
above, but is there a guarantee that the HANDOFF was not set by a
writer?

> +	if (woken && RWSEM_COUNT_HANDOFF(atomic_long_read(&sem->count)))
> +		adjustment -= RWSEM_FLAG_HANDOFF;
> +
>  	if (adjustment)
>  		atomic_long_add(adjustment, &sem->count);
>  }
> @@ -332,22 +378,42 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>   * This function must be called with the sem->wait_lock held to prevent
>   * race conditions between checking the rwsem wait list and setting the
>   * sem->count accordingly.
> + *
> + * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
> + * bit is set or the lock is acquired.
>   */
> +static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
> +					enum writer_wait_state wstate)
>  {
>  	long new;
>  
	lockdep_assert_held(&sem->wait_lock);

> +retry:
> +	if (RWSEM_COUNT_LOCKED(count)) {
> +		if (RWSEM_COUNT_HANDOFF(count) || (wstate != WRITER_HANDOFF))
> +			return false;
> +		/*
> +		 * The lock may become free just before setting handoff bit.
> +		 * It will be simpler if atomic_long_or_return() is available.
> +		 */
> +		atomic_long_or(RWSEM_FLAG_HANDOFF, &sem->count);
> +		count = atomic_long_read(&sem->count);
> +		goto retry;
> +	}
> +
> +	if ((wstate == WRITER_NOT_FIRST) && RWSEM_COUNT_HANDOFF(count))
>  		return false;
>  
> +	new = (count & ~RWSEM_FLAG_HANDOFF) + RWSEM_WRITER_LOCKED -
> +	      (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
>  
>  	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
>  		rwsem_set_owner(sem);
>  		return true;
>  	}
>  
> +	if (unlikely((wstate == WRITER_HANDOFF) && !RWSEM_COUNT_HANDOFF(count)))
> +		goto retry;
> +
>  	return false;
>  }

This function gives me heartburn. Don't you just feel something readable
trying to struggle free from that?

See, if you first write that function in the form:

	long new;

	do {
		new = count | RWSEM_WRITER_LOCKED;

		if (count & RWSEM_LOCK_MASK)
			return false;

		if (list_is_singular(&sem->wait_list))
			new &= ~RWSEM_FLAG_WAITERS;

	} while (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));

	rwsem_set_owner(sem);
	return true;

And then add the HANDOFF bits like:

	long new;

	do {
+		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);

+		new = (count | RWSEM_WRITER_LOCKED) & ~RWSEM_FLAG_HANDOFF;

		if (count & RWSEM_LOCK_MASK) {
+			if (has_handoff && wstate != WRITER_HANDOFF)
+				return false;
			new |= RWSEM_FLAG_HANDOFF;
		}

+		if (has_handoff && wstate == WRITER_NOT_FIRST)
+			return false;

		if (list_is_singular(&sem->wait_list))
			new &= ~RWSEM_FLAG_WAITERS;

	} while (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));

	rwsem_set_owner(sem);
	return true;

it almost looks like sensible code.

>  
> @@ -359,7 +425,7 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
>  {
>  	long count = atomic_long_read(&sem->count);
>  
> -	while (!RWSEM_COUNT_LOCKED(count)) {
> +	while (!RWSEM_COUNT_LOCKED_OR_HANDOFF(count)) {
>  		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
>  					count + RWSEM_WRITER_LOCKED)) {

RWSEM_WRITER_LOCKED really should be RWSEM_FLAG_WRITER or something like
that, and since it is a flag, that really should've been | not +.

>  			rwsem_set_owner(sem);
> @@ -498,6 +564,16 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>  }
>  #endif
>  
> +/*
> + * This is safe to be called without holding the wait_lock.
> + */
> +static inline bool
> +rwsem_waiter_is_first(struct rw_semaphore *sem, struct rwsem_waiter *waiter)
> +{
> +	return list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
> +			== waiter;

Just bust the line limit on that, this is silly. If you feel strongly
about the 80 char thing, we could do:

#define rwsem_first_waiter(sem) \
	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)

and use that in both locations. (and one could even write the
list_for_each_entry_safe() loop in the form:

	while (!list_empty(&sem->wait_list)) {
		entry = rwsem_first_waiter(sem);

		...

		list_del();

		...
	}

Although I suppose that gets you confused later on where you want to
wake more readers still... I'll get there,.. eventually.

> +}
> +
>  /*
>   * Wait for the read lock to be granted
>   */
> @@ -510,16 +586,18 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
>  
>  	waiter.task = current;
>  	waiter.type = RWSEM_WAITING_FOR_READ;
> +	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
>  
>  	raw_spin_lock_irq(&sem->wait_lock);
>  	if (list_empty(&sem->wait_list)) {
>  		/*
>  		 * In case the wait queue is empty and the lock isn't owned
> +		 * by a writer or has the handoff bit set, this reader can
> +		 * exit the slowpath and return immediately as its
> +		 * RWSEM_READER_BIAS has already been set in the count.
>  		 */
> +		if (!(atomic_long_read(&sem->count) &
> +		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			rwsem_set_reader_owned(sem);
>  			lockevent_inc(rwsem_rlock_fast);
> @@ -567,7 +645,8 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
>  out_nolock:
>  	list_del(&waiter.list);
>  	if (list_empty(&sem->wait_list))
> +		atomic_long_andnot(RWSEM_FLAG_WAITERS|RWSEM_FLAG_HANDOFF,
> +				   &sem->count);

If you split the line, this wants { }.

>  	raw_spin_unlock_irq(&sem->wait_lock);
>  	__set_current_state(TASK_RUNNING);
>  	lockevent_inc(rwsem_rlock_fail);
> @@ -593,7 +672,7 @@ static inline struct rw_semaphore *
>  __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
>  {
>  	long count;
> +	enum writer_wait_state wstate;
>  	struct rwsem_waiter waiter;
>  	struct rw_semaphore *ret = sem;
>  	DEFINE_WAKE_Q(wake_q);
> @@ -608,56 +687,63 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
>  	 */
>  	waiter.task = current;
>  	waiter.type = RWSEM_WAITING_FOR_WRITE;
> +	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
>  
>  	raw_spin_lock_irq(&sem->wait_lock);
>  
>  	/* account for this before adding a new element to the list */
> +	wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
>  
>  	list_add_tail(&waiter.list, &sem->wait_list);
>  
>  	/* we're now waiting on the lock */
> +	if (wstate == WRITER_NOT_FIRST) {
>  		count = atomic_long_read(&sem->count);
>  
>  		/*
> +		 * If there were already threads queued before us and:
> +		 *  1) there are no no active locks, wake the front
> +		 *     queued process(es) as the handoff bit might be set.
> +		 *  2) there are no active writers and some readers, the lock
> +		 *     must be read owned; so we try to wake any read lock
> +		 *     waiters that were queued ahead of us.
>  		 */
> +		if (!RWSEM_COUNT_LOCKED(count))
> +			__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
> +		else if (!(count & RWSEM_WRITER_MASK) &&
> +			  (count & RWSEM_READER_MASK))
>  			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);

That RWSEM_WRITER_MASK is another layer of obfustaction we can do
without.

Does the above want to be something like:

		if (!(count & RWSEM_WRITER_LOCKED)) {
			__rwsem_mark_wake(sem, (count & RWSEM_READER_MASK) ?
					       RWSEM_WAKE_READERS :
					       RWSEM_WAKE_ANY, &wake_q);
		}

> +		else
> +			goto wait;
>  
> +		/*
> +		 * The wakeup is normally called _after_ the wait_lock
> +		 * is released, but given that we are proactively waking
> +		 * readers we can deal with the wake_q overhead as it is
> +		 * similar to releasing and taking the wait_lock again
> +		 * for attempting rwsem_try_write_lock().
> +		 */
> +		wake_up_q(&wake_q);

Hurmph.. the reason we do wake_up_q() outside of wait_lock is such that
those tasks don't bounce on wait_lock. Also, it removes a great deal of
hold-time from wait_lock.

So I'm not sure I buy your argument here.

> +		/*
> +		 * Reinitialize wake_q after use.
> +		 */

Or:
		/* we need wake_q again below, reinitialize */

> +		wake_q_init(&wake_q);
>  	} else {
>  		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
>  	}
>  
> +wait:
>  	/* wait until we successfully acquire the lock */
>  	set_current_state(state);
>  	while (true) {
> +		if (rwsem_try_write_lock(count, sem, wstate))
>  			break;
> +
>  		raw_spin_unlock_irq(&sem->wait_lock);
>  
>  		/* Block until there are no active lockers. */
> +		for (;;) {
>  			if (signal_pending_state(state, current))
>  				goto out_nolock;
>  
> @@ -665,9 +751,34 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
>  			lockevent_inc(rwsem_sleep_writer);
>  			set_current_state(state);
>  			count = atomic_long_read(&sem->count);
> +
> +			if ((wstate == WRITER_NOT_FIRST) &&
> +			    rwsem_waiter_is_first(sem, &waiter))
> +				wstate = WRITER_FIRST;
> +
> +			if (!RWSEM_COUNT_LOCKED(count))
> +				break;
> +
> +			/*
> +			 * An RT task sets the HANDOFF bit immediately.
> +			 * Non-RT task will wait a while before doing so.

Again, this describes what we already read the code to do; but doesn't
add anything.

> +			 *
> +			 * The setting of the handoff bit is deferred
> +			 * until rwsem_try_write_lock() is called.
> +			 */
> +			if ((wstate == WRITER_FIRST) && (rt_task(current) ||
> +			    time_after(jiffies, waiter.timeout))) {
> +				wstate = WRITER_HANDOFF;
> +				lockevent_inc(rwsem_wlock_handoff);
> +				/*
> +				 * Break out to call rwsem_try_write_lock().
> +				 */

Another exceedingly useful comment.

> +				break;
> +			}
> +		}
>  
>  		raw_spin_lock_irq(&sem->wait_lock);
> +		count = atomic_long_read(&sem->count);
>  	}
>  	__set_current_state(TASK_RUNNING);
>  	list_del(&waiter.list);
> @@ -680,6 +791,12 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
>  	__set_current_state(TASK_RUNNING);
>  	raw_spin_lock_irq(&sem->wait_lock);
>  	list_del(&waiter.list);
> +	/*
> +	 * If handoff bit has been set by this waiter, make sure that the
> +	 * clearing of it is seen by others before proceeding.
> +	 */
> +	if (unlikely(wstate == WRITER_HANDOFF))
> +		atomic_long_add_return(-RWSEM_FLAG_HANDOFF,  &sem->count);

_AGAIN_ no explanation what so ff'ing ever.

And why add_return() if you ignore the return value.

>  	if (list_empty(&sem->wait_list))
>  		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);

And you could've easily combined the two flags in a single andnot op.

>  	else

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 06/16] locking/rwsem: Code cleanup after files merging
  2019-04-13 17:22 ` [PATCH v4 06/16] locking/rwsem: Code cleanup after files merging Waiman Long
@ 2019-04-16 16:01   ` Peter Zijlstra
  2019-04-16 16:17     ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-16 16:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying


More cleanups..

---
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -303,7 +303,7 @@ static void __rwsem_mark_wake(struct rw_
 		list_del(&waiter->list);
 		/*
 		 * Ensure calling get_task_struct() before setting the reader
-		 * waiter to nil such that rwsem_down_read_failed() cannot
+		 * waiter to nil such that rwsem_down_read_slow() cannot
 		 * race with do_exit() by always holding a reference count
 		 * to the task to wakeup.
 		 */
@@ -500,7 +500,7 @@ static bool rwsem_optimistic_spin(struct
  * Wait for the read lock to be granted
  */
 static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+rwsem_down_read_slow(struct rw_semaphore *sem, int state)
 {
 	long count, adjustment = -RWSEM_READER_BIAS;
 	struct rwsem_waiter waiter;
@@ -572,23 +572,11 @@ __rwsem_down_read_failed_common(struct r
 	return ERR_PTR(-EINTR);
 }
 
-static inline struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-
-static inline struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-
 /*
  * Wait until we successfully acquire the write lock
  */
 static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+rwsem_down_write_slow(struct rw_semaphore *sem, int state)
 {
 	long count;
 	bool waiting = true; /* any queued threads before us */
@@ -689,18 +677,6 @@ __rwsem_down_write_failed_common(struct
 	return ERR_PTR(-EINTR);
 }
 
-static inline struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-
-static inline struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-
 /*
  * handle waking up a waiter on the semaphore
  * - up_read/up_write has decremented the active part of count if we come here
@@ -749,7 +725,7 @@ inline void __down_read(struct rw_semaph
 {
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_failed(sem);
+		rwsem_down_read_slow(sem, TASK_UNINTERRUPTIBLE);
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
 	} else {
@@ -761,7 +737,7 @@ static inline int __down_read_killable(s
 {
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+		if (IS_ERR(rwsem_down_read_slow(sem, TASK_KILLABLE)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
@@ -794,34 +770,38 @@ static inline int __down_read_trylock(st
  */
 static inline void __down_write(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		rwsem_down_write_failed(sem);
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	if (unlikely(atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+						     RWSEM_WRITER_LOCKED)))
+		rwsem_down_write_slow(sem, TASK_UNINTERRUPTIBLE);
 	rwsem_set_owner(sem);
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	if (unlikely(atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+						     RWSEM_WRITER_LOCKED))) {
+		if (IS_ERR(rwsem_down_write_slow(sem, TASK_KILLABLE)))
 			return -EINTR;
+	}
 	rwsem_set_owner(sem);
 	return 0;
 }
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-	long tmp;
+	long tmp = RWSEM_UNLOCKED_VALUE;
 
 	lockevent_inc(rwsem_wtrylock);
-	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-					  RWSEM_WRITER_LOCKED);
-	if (tmp == RWSEM_UNLOCKED_VALUE) {
-		rwsem_set_owner(sem);
-		return true;
-	}
-	return false;
+	if (!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+					     RWSEM_WRITER_LOCKED))
+		return false;
+
+	rwsem_set_owner(sem);
+	return true;
 }
 
 /*
@@ -831,12 +811,11 @@ inline void __up_read(struct rw_semaphor
 {
 	long tmp;
 
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
+	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
-	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
-			== RWSEM_FLAG_WAITERS))
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
+		     RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -848,7 +827,7 @@ static inline void __up_write(struct rw_
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
 	rwsem_clear_owner(sem);
 	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
-			&sem->count) & RWSEM_FLAG_WAITERS))
+						   &sem->count) & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-16 15:49   ` Peter Zijlstra
@ 2019-04-16 16:15     ` Peter Zijlstra
  2019-04-16 18:41       ` Waiman Long
  2019-04-16 18:16     ` Waiman Long
  1 sibling, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-16 16:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Tue, Apr 16, 2019 at 05:49:37PM +0200, Peter Zijlstra wrote:
> See, if you first write that function in the form:
> 
> 	long new;
> 
> 	do {
> 		new = count | RWSEM_WRITER_LOCKED;
> 
> 		if (count & RWSEM_LOCK_MASK)
> 			return false;
> 
> 		if (list_is_singular(&sem->wait_list))
> 			new &= ~RWSEM_FLAG_WAITERS;
> 
> 	} while (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
> 
> 	rwsem_set_owner(sem);
> 	return true;
> 
> And then add the HANDOFF bits like:
> 
> 	long new;
> 
> 	do {
> +		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
> 
> +		new = (count | RWSEM_WRITER_LOCKED) & ~RWSEM_FLAG_HANDOFF;
> 
> 		if (count & RWSEM_LOCK_MASK) {
> +			if (has_handoff && wstate != WRITER_HANDOFF)
> +				return false;
> 			new |= RWSEM_FLAG_HANDOFF;
> 		}
> 
> +		if (has_handoff && wstate == WRITER_NOT_FIRST)
> +			return false;
> 
> 		if (list_is_singular(&sem->wait_list))
> 			new &= ~RWSEM_FLAG_WAITERS;
> 
> 	} while (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));

obviously that should be !

> 
> 	rwsem_set_owner(sem);
> 	return true;
> 
> it almost looks like sensible code.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 06/16] locking/rwsem: Code cleanup after files merging
  2019-04-16 16:01   ` Peter Zijlstra
@ 2019-04-16 16:17     ` Peter Zijlstra
  2019-04-16 19:45       ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-16 16:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Tue, Apr 16, 2019 at 06:01:13PM +0200, Peter Zijlstra wrote:
> @@ -794,34 +770,38 @@ static inline int __down_read_trylock(st
>   */
>  static inline void __down_write(struct rw_semaphore *sem)
>  {
> -	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
> -						 RWSEM_WRITER_LOCKED)))
> -		rwsem_down_write_failed(sem);
> +	long tmp = RWSEM_UNLOCKED_VALUE;
> +
> +	if (unlikely(atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
> +						     RWSEM_WRITER_LOCKED)))

!

> +		rwsem_down_write_slow(sem, TASK_UNINTERRUPTIBLE);
>  	rwsem_set_owner(sem);
>  }
>  
>  static inline int __down_write_killable(struct rw_semaphore *sem)
>  {
> -	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
> -						 RWSEM_WRITER_LOCKED)))
> -		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
> +	long tmp = RWSEM_UNLOCKED_VALUE;
> +
> +	if (unlikely(atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
> +						     RWSEM_WRITER_LOCKED))) {

also !

> +		if (IS_ERR(rwsem_down_write_slow(sem, TASK_KILLABLE)))
>  			return -EINTR;
> +	}
>  	rwsem_set_owner(sem);
>  	return 0;
>  }

I'm having a great day it seems, it's like back in uni, trying to find
all the missing - signs in this page-long DE.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue
  2019-04-13 17:22 ` [PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
@ 2019-04-16 16:50   ` Davidlohr Bueso
  2019-04-16 17:37     ` Waiman Long
  2019-04-17 13:39   ` Peter Zijlstra
  1 sibling, 1 reply; 112+ messages in thread
From: Davidlohr Bueso @ 2019-04-16 16:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	linux-kernel, x86, Linus Torvalds, Tim Chen, huang ying

On Sat, 13 Apr 2019, Waiman Long wrote:
>+/*
>+ * We limit the maximum number of readers that can be woken up for a
>+ * wake-up call to not penalizing the waking thread for spending too
>+ * much time doing it.
>+ */
>+#define MAX_READERS_WAKEUP	0x100

Although with wake_q this is not really so... Could it at least be
rewritten, dunno something like so:

/*
 * Magic number to batch-wakeup waiting readers, even when writers
 * are also present in the queue. This both limits the amount of
 * work the waking thread must do (albeit wake_q)  and also prevents
 * any potential counter overflow, however unlikely.
 */

I'm still not crazy about this artificial limit for the readers-only
case, but won't argue. I certainly like the reader/writer case.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue
  2019-04-16 16:50   ` Davidlohr Bueso
@ 2019-04-16 17:37     ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-16 17:37 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	linux-kernel, x86, Linus Torvalds, Tim Chen, huang ying

On 04/16/2019 12:50 PM, Davidlohr Bueso wrote:
> On Sat, 13 Apr 2019, Waiman Long wrote:
>> +/*
>> + * We limit the maximum number of readers that can be woken up for a
>> + * wake-up call to not penalizing the waking thread for spending too
>> + * much time doing it.
>> + */
>> +#define MAX_READERS_WAKEUP    0x100
>
> Although with wake_q this is not really so... Could it at least be
> rewritten, dunno something like so:
>
> /*
> * Magic number to batch-wakeup waiting readers, even when writers
> * are also present in the queue. This both limits the amount of
> * work the waking thread must do (albeit wake_q)  and also prevents
> * any potential counter overflow, however unlikely.
> */
>

The wording looks good to me. Will modify that for the next version.
BTW, wake_q_add() has low overhead and so the lock hold time should be
short. Outside the wait_lock, wake_up_q() still has a high overhead if
there are many tasks to be woken up.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-16 15:49   ` Peter Zijlstra
  2019-04-16 16:15     ` Peter Zijlstra
@ 2019-04-16 18:16     ` Waiman Long
  2019-04-16 18:32       ` Peter Zijlstra
                         ` (2 more replies)
  1 sibling, 3 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-16 18:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/16/2019 11:49 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:50PM -0400, Waiman Long wrote:
>
>> +#define RWSEM_COUNT_HANDOFF(c)	((c) & RWSEM_FLAG_HANDOFF)
>> +#define RWSEM_COUNT_LOCKED_OR_HANDOFF(c)	\
>> +	((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))
> Like said before, I also made these go away.

Yes, my refactored patches will remove all those trivial macros.

>
>> @@ -245,6 +274,8 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>>  	struct rwsem_waiter *waiter, *tmp;
>>  	long oldcount, woken = 0, adjustment = 0;
>>  
>> +	lockdep_assert_held(&sem->wait_lock);
>> +
>>  	/*
>>  	 * Take a peek at the queue head waiter such that we can determine
>>  	 * the wakeup(s) to perform.
>> @@ -276,6 +307,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>>  		adjustment = RWSEM_READER_BIAS;
>>  		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
>>  		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
>> +			/*
>> +			 * Initiate handoff to reader, if applicable.
>> +			 */
>> +			if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
>> +			    time_after(jiffies, waiter->timeout)) {
>> +				adjustment -= RWSEM_FLAG_HANDOFF;
>> +				lockevent_inc(rwsem_rlock_handoff);
>> +			}
> 			/*
> 			 * When we've been waiting 'too' long (for
> 			 * writers to give up the lock) request a
> 			 * HANDOFF to force the issue.
> 			 */
>
> ?

Sure.

>
>> +
>>  			atomic_long_sub(adjustment, &sem->count);
> Can we change this to: atomic_long_add() please? The below loop that
> wakes all remaining readers does use add(), so it is a bit 'weird' to
> have the adjustment being negated on handover.
>
>>  			return;
>>  		}
>> @@ -324,6 +364,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>>  		adjustment -= RWSEM_FLAG_WAITERS;
>>  	}
>>  
>> +	/*
>> +	 * Clear the handoff flag
>> +	 */
> Right, but that is a trivial comment in the 'increment i' style, it
> clearly states what the code does, but completely fails to elucidate the
> code.
>
> Maybe:
>
> 	/*
> 	 * When we've woken a reader, we no longer need to force writers
> 	 * to give up the lock and we can clear HANDOFF.
> 	 */
>
> And I suppose this is required if we were the pickup of the handoff set
> above, but is there a guarantee that the HANDOFF was not set by a
> writer?

I can change the comment. The handoff bit is always cleared in
rwsem_try_write_lock() when the lock is successfully acquire. Will add a
comment to document that.

>
>> +	if (woken && RWSEM_COUNT_HANDOFF(atomic_long_read(&sem->count)))
>> +		adjustment -= RWSEM_FLAG_HANDOFF;
>> +
>>  	if (adjustment)
>>  		atomic_long_add(adjustment, &sem->count);
>>  }
>> @@ -332,22 +378,42 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>>   * This function must be called with the sem->wait_lock held to prevent
>>   * race conditions between checking the rwsem wait list and setting the
>>   * sem->count accordingly.
>> + *
>> + * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
>> + * bit is set or the lock is acquired.
>>   */
>> +static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
>> +					enum writer_wait_state wstate)
>>  {
>>  	long new;
>>  
> 	lockdep_assert_held(&sem->wait_lock);

Sure.

>
>> +retry:
>> +	if (RWSEM_COUNT_LOCKED(count)) {
>> +		if (RWSEM_COUNT_HANDOFF(count) || (wstate != WRITER_HANDOFF))
>> +			return false;
>> +		/*
>> +		 * The lock may become free just before setting handoff bit.
>> +		 * It will be simpler if atomic_long_or_return() is available.
>> +		 */
>> +		atomic_long_or(RWSEM_FLAG_HANDOFF, &sem->count);
>> +		count = atomic_long_read(&sem->count);
>> +		goto retry;
>> +	}
>> +
>> +	if ((wstate == WRITER_NOT_FIRST) && RWSEM_COUNT_HANDOFF(count))
>>  		return false;
>>  
>> +	new = (count & ~RWSEM_FLAG_HANDOFF) + RWSEM_WRITER_LOCKED -
>> +	      (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
>>  
>>  	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
>>  		rwsem_set_owner(sem);
>>  		return true;
>>  	}
>>  
>> +	if (unlikely((wstate == WRITER_HANDOFF) && !RWSEM_COUNT_HANDOFF(count)))
>> +		goto retry;
>> +
>>  	return false;
>>  }
> This function gives me heartburn. Don't you just feel something readable
> trying to struggle free from that?
>
> See, if you first write that function in the form:
>
> 	long new;
>
> 	do {
> 		new = count | RWSEM_WRITER_LOCKED;
>
> 		if (count & RWSEM_LOCK_MASK)
> 			return false;
>
> 		if (list_is_singular(&sem->wait_list))
> 			new &= ~RWSEM_FLAG_WAITERS;
>
> 	} while (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
>
> 	rwsem_set_owner(sem);
> 	return true;
>
> And then add the HANDOFF bits like:
>
> 	long new;
>
> 	do {
> +		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>
> +		new = (count | RWSEM_WRITER_LOCKED) & ~RWSEM_FLAG_HANDOFF;
>
> 		if (count & RWSEM_LOCK_MASK) {
> +			if (has_handoff && wstate != WRITER_HANDOFF)
> +				return false;
> 			new |= RWSEM_FLAG_HANDOFF;
> 		}
>
> +		if (has_handoff && wstate == WRITER_NOT_FIRST)
> +			return false;
>
> 		if (list_is_singular(&sem->wait_list))
> 			new &= ~RWSEM_FLAG_WAITERS;
>
> 	} while (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
>
> 	rwsem_set_owner(sem);
> 	return true;
>
> it almost looks like sensible code.

Yes, it looks much better. I don't like that piece of code myself. I am
sorry that I didn't spend the time to make the code more sane.

Thanks for your suggestion. Will modify it accordingly.

>>  
>> @@ -359,7 +425,7 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
>>  {
>>  	long count = atomic_long_read(&sem->count);
>>  
>> -	while (!RWSEM_COUNT_LOCKED(count)) {
>> +	while (!RWSEM_COUNT_LOCKED_OR_HANDOFF(count)) {
>>  		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
>>  					count + RWSEM_WRITER_LOCKED)) {
> RWSEM_WRITER_LOCKED really should be RWSEM_FLAG_WRITER or something like
> that, and since it is a flag, that really should've been | not +.

Sure.

>>  			rwsem_set_owner(sem);
>> @@ -498,6 +564,16 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>>  }
>>  #endif
>>  
>> +/*
>> + * This is safe to be called without holding the wait_lock.
>> + */
>> +static inline bool
>> +rwsem_waiter_is_first(struct rw_semaphore *sem, struct rwsem_waiter *waiter)
>> +{
>> +	return list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
>> +			== waiter;
> Just bust the line limit on that, this is silly. If you feel strongly
> about the 80 char thing, we could do:
>
> #define rwsem_first_waiter(sem) \
> 	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
>
> and use that in both locations. (and one could even write the
> list_for_each_entry_safe() loop in the form:
>
> 	while (!list_empty(&sem->wait_list)) {
> 		entry = rwsem_first_waiter(sem);
>
> 		...
>
> 		list_del();
>
> 		...
> 	}
>
> Although I suppose that gets you confused later on where you want to
> wake more readers still... I'll get there,.. eventually.

Yes, it is a good idea.

>> +}
>> +
>>  /*
>>   * Wait for the read lock to be granted
>>   */
>> @@ -510,16 +586,18 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
>>  
>>  	waiter.task = current;
>>  	waiter.type = RWSEM_WAITING_FOR_READ;
>> +	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
>>  
>>  	raw_spin_lock_irq(&sem->wait_lock);
>>  	if (list_empty(&sem->wait_list)) {
>>  		/*
>>  		 * In case the wait queue is empty and the lock isn't owned
>> +		 * by a writer or has the handoff bit set, this reader can
>> +		 * exit the slowpath and return immediately as its
>> +		 * RWSEM_READER_BIAS has already been set in the count.
>>  		 */
>> +		if (!(atomic_long_read(&sem->count) &
>> +		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
>>  			raw_spin_unlock_irq(&sem->wait_lock);
>>  			rwsem_set_reader_owned(sem);
>>  			lockevent_inc(rwsem_rlock_fast);
>> @@ -567,7 +645,8 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
>>  out_nolock:
>>  	list_del(&waiter.list);
>>  	if (list_empty(&sem->wait_list))
>> +		atomic_long_andnot(RWSEM_FLAG_WAITERS|RWSEM_FLAG_HANDOFF,
>> +				   &sem->count);
> If you split the line, this wants { }.

OK.

>>  	raw_spin_unlock_irq(&sem->wait_lock);
>>  	__set_current_state(TASK_RUNNING);
>>  	lockevent_inc(rwsem_rlock_fail);
>> @@ -593,7 +672,7 @@ static inline struct rw_semaphore *
>>  __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
>>  {
>>  	long count;
>> +	enum writer_wait_state wstate;
>>  	struct rwsem_waiter waiter;
>>  	struct rw_semaphore *ret = sem;
>>  	DEFINE_WAKE_Q(wake_q);
>> @@ -608,56 +687,63 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
>>  	 */
>>  	waiter.task = current;
>>  	waiter.type = RWSEM_WAITING_FOR_WRITE;
>> +	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
>>  
>>  	raw_spin_lock_irq(&sem->wait_lock);
>>  
>>  	/* account for this before adding a new element to the list */
>> +	wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
>>  
>>  	list_add_tail(&waiter.list, &sem->wait_list);
>>  
>>  	/* we're now waiting on the lock */
>> +	if (wstate == WRITER_NOT_FIRST) {
>>  		count = atomic_long_read(&sem->count);
>>  
>>  		/*
>> +		 * If there were already threads queued before us and:
>> +		 *  1) there are no no active locks, wake the front
>> +		 *     queued process(es) as the handoff bit might be set.
>> +		 *  2) there are no active writers and some readers, the lock
>> +		 *     must be read owned; so we try to wake any read lock
>> +		 *     waiters that were queued ahead of us.
>>  		 */
>> +		if (!RWSEM_COUNT_LOCKED(count))
>> +			__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
>> +		else if (!(count & RWSEM_WRITER_MASK) &&
>> +			  (count & RWSEM_READER_MASK))
>>  			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
> That RWSEM_WRITER_MASK is another layer of obfustaction we can do
> without.

The RWSEM_WRITER_MASK macro is added to prepare for the later patch that
merge owner into count where RWSEM_WRITER_LOCK will be different.

> Does the above want to be something like:
>
> 		if (!(count & RWSEM_WRITER_LOCKED)) {
> 			__rwsem_mark_wake(sem, (count & RWSEM_READER_MASK) ?
> 					       RWSEM_WAKE_READERS :
> 					       RWSEM_WAKE_ANY, &wake_q);
> 		}

Yes.

>> +		else
>> +			goto wait;
>>  
>> +		/*
>> +		 * The wakeup is normally called _after_ the wait_lock
>> +		 * is released, but given that we are proactively waking
>> +		 * readers we can deal with the wake_q overhead as it is
>> +		 * similar to releasing and taking the wait_lock again
>> +		 * for attempting rwsem_try_write_lock().
>> +		 */
>> +		wake_up_q(&wake_q);
> Hurmph.. the reason we do wake_up_q() outside of wait_lock is such that
> those tasks don't bounce on wait_lock. Also, it removes a great deal of
> hold-time from wait_lock.
>
> So I'm not sure I buy your argument here.
>

Actually, we don't want to release the wait_lock, do wake_up_q() and
acquire the wait_lock again as the state would have been changed. I
didn't change the comment on this patch, but will reword it to discuss that.

>> +		/*
>> +		 * Reinitialize wake_q after use.
>> +		 */
> Or:
> 		/* we need wake_q again below, reinitialize */
>

Sure.

>> +		wake_q_init(&wake_q);
>>  	} else {
>>  		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
>>  	}
>>  
>> +wait:
>>  	/* wait until we successfully acquire the lock */
>>  	set_current_state(state);
>>  	while (true) {
>> +		if (rwsem_try_write_lock(count, sem, wstate))
>>  			break;
>> +
>>  		raw_spin_unlock_irq(&sem->wait_lock);
>>  
>>  		/* Block until there are no active lockers. */
>> +		for (;;) {
>>  			if (signal_pending_state(state, current))
>>  				goto out_nolock;
>>  
>> @@ -665,9 +751,34 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
>>  			lockevent_inc(rwsem_sleep_writer);
>>  			set_current_state(state);
>>  			count = atomic_long_read(&sem->count);
>> +
>> +			if ((wstate == WRITER_NOT_FIRST) &&
>> +			    rwsem_waiter_is_first(sem, &waiter))
>> +				wstate = WRITER_FIRST;
>> +
>> +			if (!RWSEM_COUNT_LOCKED(count))
>> +				break;
>> +
>> +			/*
>> +			 * An RT task sets the HANDOFF bit immediately.
>> +			 * Non-RT task will wait a while before doing so.
> Again, this describes what we already read the code to do; but doesn't
> add anything.

Will remove that.

>> +			 *
>> +			 * The setting of the handoff bit is deferred
>> +			 * until rwsem_try_write_lock() is called.
>> +			 */
>> +			if ((wstate == WRITER_FIRST) && (rt_task(current) ||
>> +			    time_after(jiffies, waiter.timeout))) {
>> +				wstate = WRITER_HANDOFF;
>> +				lockevent_inc(rwsem_wlock_handoff);
>> +				/*
>> +				 * Break out to call rwsem_try_write_lock().
>> +				 */
> Another exceedingly useful comment.
>
>> +				break;
>> +			}
>> +		}
>>  
>>  		raw_spin_lock_irq(&sem->wait_lock);
>> +		count = atomic_long_read(&sem->count);
>>  	}
>>  	__set_current_state(TASK_RUNNING);
>>  	list_del(&waiter.list);
>> @@ -680,6 +791,12 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
>>  	__set_current_state(TASK_RUNNING);
>>  	raw_spin_lock_irq(&sem->wait_lock);
>>  	list_del(&waiter.list);
>> +	/*
>> +	 * If handoff bit has been set by this waiter, make sure that the
>> +	 * clearing of it is seen by others before proceeding.
>> +	 */
>> +	if (unlikely(wstate == WRITER_HANDOFF))
>> +		atomic_long_add_return(-RWSEM_FLAG_HANDOFF,  &sem->count);
> _AGAIN_ no explanation what so ff'ing ever.
>
> And why add_return() if you ignore the return value.
>

OK, will remove those.

>>  	if (list_empty(&sem->wait_list))
>>  		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
> And you could've easily combined the two flags in a single andnot op.

That is true, but the nolock case is rarely executed. That is why I opt
for simplicity than more complicated but faster code.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-16 18:16     ` Waiman Long
@ 2019-04-16 18:32       ` Peter Zijlstra
  2019-04-17  7:35       ` Peter Zijlstra
  2019-04-17  8:05       ` Peter Zijlstra
  2 siblings, 0 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-16 18:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Tue, Apr 16, 2019 at 02:16:11PM -0400, Waiman Long wrote:

> >> @@ -665,9 +751,34 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
> >>  			lockevent_inc(rwsem_sleep_writer);
> >>  			set_current_state(state);
> >>  			count = atomic_long_read(&sem->count);
> >> +
> >> +			if ((wstate == WRITER_NOT_FIRST) &&
> >> +			    rwsem_waiter_is_first(sem, &waiter))
> >> +				wstate = WRITER_FIRST;
> >> +
> >> +			if (!RWSEM_COUNT_LOCKED(count))
> >> +				break;
> >> +
> >> +			/*
> >> +			 * An RT task sets the HANDOFF bit immediately.
> >> +			 * Non-RT task will wait a while before doing so.
> > Again, this describes what we already read the code to do; but doesn't
> > add anything.
> 
> Will remove that.
> 
> >> +			 *
> >> +			 * The setting of the handoff bit is deferred
> >> +			 * until rwsem_try_write_lock() is called.
> >> +			 */
> >> +			if ((wstate == WRITER_FIRST) && (rt_task(current) ||
> >> +			    time_after(jiffies, waiter.timeout))) {
> >> +				wstate = WRITER_HANDOFF;
> >> +				lockevent_inc(rwsem_wlock_handoff);
> >> +				/*
> >> +				 * Break out to call rwsem_try_write_lock().
> >> +				 */
> > Another exceedingly useful comment.
> >
> >> +				break;
> >> +			}
> >> +		}
> >>  
> >>  		raw_spin_lock_irq(&sem->wait_lock);
> >> +		count = atomic_long_read(&sem->count);
> >>  	}
> >>  	__set_current_state(TASK_RUNNING);
> >>  	list_del(&waiter.list);
> >> @@ -680,6 +791,12 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
> >>  	__set_current_state(TASK_RUNNING);
> >>  	raw_spin_lock_irq(&sem->wait_lock);
> >>  	list_del(&waiter.list);
> >> +	/*
> >> +	 * If handoff bit has been set by this waiter, make sure that the
> >> +	 * clearing of it is seen by others before proceeding.
> >> +	 */
> >> +	if (unlikely(wstate == WRITER_HANDOFF))
> >> +		atomic_long_add_return(-RWSEM_FLAG_HANDOFF,  &sem->count);
> > _AGAIN_ no explanation what so ff'ing ever.
> >
> > And why add_return() if you ignore the return value.
> >
> 
> OK, will remove those.

I'm not saying to remove them, although for at least the break one that
is fine. But do try and write comments that _add_ something to the code,
explain _why_ instead of state what (which we can trivially see from the
code).

Locking code in general is tricky enough, and comments are good, and
more comments are more good, but only if they explain why.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-16 16:15     ` Peter Zijlstra
@ 2019-04-16 18:41       ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-16 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/16/2019 12:15 PM, Peter Zijlstra wrote:
> On Tue, Apr 16, 2019 at 05:49:37PM +0200, Peter Zijlstra wrote:
>> See, if you first write that function in the form:
>>
>> 	long new;
>>
>> 	do {
>> 		new = count | RWSEM_WRITER_LOCKED;
>>
>> 		if (count & RWSEM_LOCK_MASK)
>> 			return false;
>>
>> 		if (list_is_singular(&sem->wait_list))
>> 			new &= ~RWSEM_FLAG_WAITERS;
>>
>> 	} while (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
>>
>> 	rwsem_set_owner(sem);
>> 	return true;
>>
>> And then add the HANDOFF bits like:
>>
>> 	long new;
>>
>> 	do {
>> +		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>>
>> +		new = (count | RWSEM_WRITER_LOCKED) & ~RWSEM_FLAG_HANDOFF;
>>
>> 		if (count & RWSEM_LOCK_MASK) {
>> +			if (has_handoff && wstate != WRITER_HANDOFF)
>> +				return false;
>> 			new |= RWSEM_FLAG_HANDOFF;
>> 		}
>>
>> +		if (has_handoff && wstate == WRITER_NOT_FIRST)
>> +			return false;
>>
>> 		if (list_is_singular(&sem->wait_list))
>> 			new &= ~RWSEM_FLAG_WAITERS;
>>
>> 	} while (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
> obviously that should be !
>

Right.

-Longman

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 06/16] locking/rwsem: Code cleanup after files merging
  2019-04-16 16:17     ` Peter Zijlstra
@ 2019-04-16 19:45       ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-16 19:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/16/2019 12:17 PM, Peter Zijlstra wrote:
> On Tue, Apr 16, 2019 at 06:01:13PM +0200, Peter Zijlstra wrote:
>> @@ -794,34 +770,38 @@ static inline int __down_read_trylock(st
>>   */
>>  static inline void __down_write(struct rw_semaphore *sem)
>>  {
>> -	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
>> -						 RWSEM_WRITER_LOCKED)))
>> -		rwsem_down_write_failed(sem);
>> +	long tmp = RWSEM_UNLOCKED_VALUE;
>> +
>> +	if (unlikely(atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
>> +						     RWSEM_WRITER_LOCKED)))
> !
>
>> +		rwsem_down_write_slow(sem, TASK_UNINTERRUPTIBLE);
>>  	rwsem_set_owner(sem);
>>  }
>>  
>>  static inline int __down_write_killable(struct rw_semaphore *sem)
>>  {
>> -	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
>> -						 RWSEM_WRITER_LOCKED)))
>> -		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
>> +	long tmp = RWSEM_UNLOCKED_VALUE;
>> +
>> +	if (unlikely(atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
>> +						     RWSEM_WRITER_LOCKED))) {
> also !
>
>> +		if (IS_ERR(rwsem_down_write_slow(sem, TASK_KILLABLE)))
>>  			return -EINTR;
>> +	}
>>  	rwsem_set_owner(sem);
>>  	return 0;
>>  }
> I'm having a great day it seems, it's like back in uni, trying to find
> all the missing - signs in this page-long DE.

I am really grateful that you have spent the time to review my rwsem
patches. It is also very intense for me in the last few days
concentrating on investigating the regression and finding useful fixes
that I don't have time to deal with other stuffs.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-16 14:12   ` Peter Zijlstra
@ 2019-04-16 20:26     ` Waiman Long
  2019-04-16 21:07       ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-16 20:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/16/2019 10:12 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:50PM -0400, Waiman Long wrote:
>> +/*
>> + * The typical HZ value is either 250 or 1000. So set the minimum waiting
>> + * time to 4ms in the wait queue before initiating the handoff protocol.
>> + */
>> +#define RWSEM_WAIT_TIMEOUT	(HZ/250)
> That seems equally unfortunate. For HZ=100 that results in 0ms, and for
> HZ=300 that results in 3 1/3-rd ms.
>
> (and this is not considering Alpha,ARM and MIPS, who all have various
> other 'creative' HZ values)
>
> In general aiming for sub 10ms timing using jiffies seems 'optimistic'.

I see your point. I will change it to use sched_clock() instead.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-16 20:26     ` Waiman Long
@ 2019-04-16 21:07       ` Waiman Long
  2019-04-17  7:13         ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-16 21:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/16/2019 04:26 PM, Waiman Long wrote:
> On 04/16/2019 10:12 AM, Peter Zijlstra wrote:
>> On Sat, Apr 13, 2019 at 01:22:50PM -0400, Waiman Long wrote:
>>> +/*
>>> + * The typical HZ value is either 250 or 1000. So set the minimum waiting
>>> + * time to 4ms in the wait queue before initiating the handoff protocol.
>>> + */
>>> +#define RWSEM_WAIT_TIMEOUT	(HZ/250)
>> That seems equally unfortunate. For HZ=100 that results in 0ms, and for
>> HZ=300 that results in 3 1/3-rd ms.
>>
>> (and this is not considering Alpha,ARM and MIPS, who all have various
>> other 'creative' HZ values)
>>
>> In general aiming for sub 10ms timing using jiffies seems 'optimistic'.
> I see your point. I will change it to use sched_clock() instead.
>

Thinking about it again. I think I will just change its definition to
"((HZ + 249)/250)" for now to make sure that it is at least 1. The
handoff waiting period isn't as important in the overall scheme. Using
sched_clock() will definitely have a higher overhead than reading
jiffies. I want to minimize delay before the waiter can attempt to steal
the lock in the slowpath. That is the main reason I use this simple
scheme. We can certain change it later on if we choose to, but I would
like to focus on other more important things first.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-16 21:07       ` Waiman Long
@ 2019-04-17  7:13         ` Peter Zijlstra
  2019-04-17 16:22           ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17  7:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Tue, Apr 16, 2019 at 05:07:26PM -0400, Waiman Long wrote:

> Thinking about it again. I think I will just change its definition to
> "((HZ + 249)/250)" for now to make sure that it is at least 1. The

DIV_ROUND_UP()

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-16 18:16     ` Waiman Long
  2019-04-16 18:32       ` Peter Zijlstra
@ 2019-04-17  7:35       ` Peter Zijlstra
  2019-04-17 16:35         ` Waiman Long
  2019-04-17  8:05       ` Peter Zijlstra
  2 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17  7:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Tue, Apr 16, 2019 at 02:16:11PM -0400, Waiman Long wrote:

> >> @@ -324,6 +364,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
> >>  		adjustment -= RWSEM_FLAG_WAITERS;
> >>  	}
> >>  
> >> +	/*
> >> +	 * Clear the handoff flag
> >> +	 */
> > Right, but that is a trivial comment in the 'increment i' style, it
> > clearly states what the code does, but completely fails to elucidate the
> > code.
> >
> > Maybe:
> >
> > 	/*
> > 	 * When we've woken a reader, we no longer need to force writers
> > 	 * to give up the lock and we can clear HANDOFF.
> > 	 */
> >
> > And I suppose this is required if we were the pickup of the handoff set
> > above, but is there a guarantee that the HANDOFF was not set by a
> > writer?
> 
> I can change the comment. The handoff bit is always cleared in
> rwsem_try_write_lock() when the lock is successfully acquire. Will add a
> comment to document that.

That doesn't help much, because it drops ->wait_lock between setting it
and acquiring it. So the read-acquire can interleave.

I _think_ it works, but I'm having trouble explaining how exactly. I
think because readers don't spin yet and thus wakeups abide by queue
order.

And the other way around should have (write) spinners terminate the
moment they see HANDOFF set by a readers, but I'm not immediately seeing
that either.

I'll continue staring at that.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-16 18:16     ` Waiman Long
  2019-04-16 18:32       ` Peter Zijlstra
  2019-04-17  7:35       ` Peter Zijlstra
@ 2019-04-17  8:05       ` Peter Zijlstra
  2019-04-17 16:39         ` Waiman Long
  2 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17  8:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Tue, Apr 16, 2019 at 02:16:11PM -0400, Waiman Long wrote:

> >> @@ -608,56 +687,63 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
> >>  	 */
> >>  	waiter.task = current;
> >>  	waiter.type = RWSEM_WAITING_FOR_WRITE;
> >> +	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
> >>  
> >>  	raw_spin_lock_irq(&sem->wait_lock);
> >>  
> >>  	/* account for this before adding a new element to the list */
> >> +	wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
> >>  
> >>  	list_add_tail(&waiter.list, &sem->wait_list);
> >>  
> >>  	/* we're now waiting on the lock */
> >> +	if (wstate == WRITER_NOT_FIRST) {
> >>  		count = atomic_long_read(&sem->count);
> >>  
> >>  		/*
> >> +		 * If there were already threads queued before us and:
> >> +		 *  1) there are no no active locks, wake the front
> >> +		 *     queued process(es) as the handoff bit might be set.
> >> +		 *  2) there are no active writers and some readers, the lock
> >> +		 *     must be read owned; so we try to wake any read lock
> >> +		 *     waiters that were queued ahead of us.
> >>  		 */
> >> +		if (!RWSEM_COUNT_LOCKED(count))
> >> +			__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
> >> +		else if (!(count & RWSEM_WRITER_MASK) &&
> >> +			  (count & RWSEM_READER_MASK))
> >>  			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);

> > Does the above want to be something like:
> >
> > 		if (!(count & RWSEM_WRITER_LOCKED)) {
> > 			__rwsem_mark_wake(sem, (count & RWSEM_READER_MASK) ?
> > 					       RWSEM_WAKE_READERS :
> > 					       RWSEM_WAKE_ANY, &wake_q);
> > 		}
> 
> Yes.
> 
> >> +		else
> >> +			goto wait;
> >>  
> >> +		/*
> >> +		 * The wakeup is normally called _after_ the wait_lock
> >> +		 * is released, but given that we are proactively waking
> >> +		 * readers we can deal with the wake_q overhead as it is
> >> +		 * similar to releasing and taking the wait_lock again
> >> +		 * for attempting rwsem_try_write_lock().
> >> +		 */
> >> +		wake_up_q(&wake_q);
> > Hurmph.. the reason we do wake_up_q() outside of wait_lock is such that
> > those tasks don't bounce on wait_lock. Also, it removes a great deal of
> > hold-time from wait_lock.
> >
> > So I'm not sure I buy your argument here.
> >
> 
> Actually, we don't want to release the wait_lock, do wake_up_q() and
> acquire the wait_lock again as the state would have been changed. I
> didn't change the comment on this patch, but will reword it to discuss that.

I don't understand, we've queued ourselves, we're on the list, we're not
first. How would dropping the lock to try and kick waiters before us be
a problem?

Sure, once we re-acquire the lock we have to re-avaluate @wstate to see
if we're first now or not, but we need to do that anyway.

So what is wrong with the below?

--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,6 +51,11 @@ static inline void wake_q_init(struct wa
 	head->lastp = &head->first;
 }
 
+static inline bool wake_q_empty(struct wake_q_head *head)
+{
+	return head->first == WAKE_Q_TAIL;
+}
+
 extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
 extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
 extern void wake_up_q(struct wake_q_head *head);
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -700,25 +700,22 @@ __rwsem_down_write_failed_common(struct
 		 *     must be read owned; so we try to wake any read lock
 		 *     waiters that were queued ahead of us.
 		 */
-		if (!(count & RWSEM_LOCKED_MASK))
-			__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-		else if (!(count & RWSEM_WRITER_MASK) &&
-				(count & RWSEM_READER_MASK))
-			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
-		else
+		if (count & RWSEM_WRITER_LOCKED)
 			goto wait;
-		/*
-		 * The wakeup is normally called _after_ the wait_lock
-		 * is released, but given that we are proactively waking
-		 * readers we can deal with the wake_q overhead as it is
-		 * similar to releasing and taking the wait_lock again
-		 * for attempting rwsem_try_write_lock().
-		 */
-		wake_up_q(&wake_q);
-		/*
-		 * Reinitialize wake_q after use.
-		 */
-		wake_q_init(&wake_q);
+
+		__rwsem_mark_wake(sem, (count & RWSEM_READER_MASK) ?
+				RWSEM_WAKE_READERS :
+				RWSEM_WAKE_ANY, &wake_q);
+
+		if (!wake_q_empty(&wake_q)) {
+			raw_spin_unlock_irq(&sem->wait_lock);
+			wake_up_q(&wake_q);
+			/* used again, reinit */
+			wake_q_init(&wake_q);
+			raw_spin_lock_irq(&sem->wait_lock);
+			if (rwsem_waiter_is_first(sem, &waiter))
+				wstate = WRITER_FIRST;
+		}
 	} else {
 		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
 	}

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-13 17:22 ` [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
  2019-04-16 14:12   ` Peter Zijlstra
  2019-04-16 15:49   ` Peter Zijlstra
@ 2019-04-17  8:17   ` Peter Zijlstra
  2 siblings, 0 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17  8:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:50PM -0400, Waiman Long wrote:
> +/*
> + * This is safe to be called without holding the wait_lock.

Because.... @waiter is *our* waiter and it's not going anywhere. So when
it's first, it stays first until we do something about it.

> + */
> +static inline bool
> +rwsem_waiter_is_first(struct rw_semaphore *sem, struct rwsem_waiter *waiter)
> +{
> +	return list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
> +			== waiter;
> +}

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-13 17:22 ` [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
@ 2019-04-17  9:00   ` Peter Zijlstra
  2019-04-17 16:42     ` Waiman Long
  2019-04-17 10:19   ` Peter Zijlstra
  2019-04-17 12:41   ` Peter Zijlstra
  2 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17  9:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:51PM -0400, Waiman Long wrote:
> This patch modifies rwsem_spin_on_owner() to return four possible
> values to better reflect the state of lock holder which enables us to
> make a better decision of what to do next.
> 
> In the special case that there is no active lock and the handoff bit
> is set, optimistic spinning has to be stopped.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/rwsem.c | 45 +++++++++++++++++++++++++++++++++++-------
>  1 file changed, 38 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index aaab546a890d..2d6850c3e77b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -156,6 +156,11 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
>  	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
>  }
>  
> +static inline bool is_rwsem_owner_reader(struct task_struct *owner)
> +{
> +	return (unsigned long)owner & RWSEM_READER_OWNED;
> +}

Move this and the surrounding helpers into the RWSEM_SPIN_ON_OWNER
block, it is only used there and that way all the code is together.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-13 17:22 ` [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
  2019-04-17  9:00   ` Peter Zijlstra
@ 2019-04-17 10:19   ` Peter Zijlstra
  2019-04-17 16:53     ` Waiman Long
  2019-04-17 12:41   ` Peter Zijlstra
  2 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17 10:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:51PM -0400, Waiman Long wrote:
> In the special case that there is no active lock and the handoff bit
> is set, optimistic spinning has to be stopped.

This makes me think this should've been _before_ you added the handoff
bit. So that when you introduce the handoff, everything is solid.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-13 17:22 ` [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
  2019-04-17  9:00   ` Peter Zijlstra
  2019-04-17 10:19   ` Peter Zijlstra
@ 2019-04-17 12:41   ` Peter Zijlstra
  2019-04-17 12:47     ` Peter Zijlstra
  2019-04-17 13:00     ` Peter Zijlstra
  2 siblings, 2 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17 12:41 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:51PM -0400, Waiman Long wrote:
> In the special case that there is no active lock and the handoff bit
> is set, optimistic spinning has to be stopped.

> @@ -500,9 +521,19 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
>  
>  	/*
>  	 * If there is a new owner or the owner is not set, we continue
> -	 * spinning.
> +	 * spinning except when here is no active locks and the handoff bit
> +	 * is set. In this case, we have to stop spinning.
>  	 */
> -	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
> +	owner = READ_ONCE(sem->owner);
> +	if (!is_rwsem_owner_spinnable(owner))
> +		return OWNER_NONSPINNABLE;
> +	if (owner && !is_rwsem_owner_reader(owner))
> +		return OWNER_WRITER;
> +
> +	count = atomic_long_read(&sem->count);
> +	if (RWSEM_COUNT_HANDOFF(count) && !RWSEM_COUNT_LOCKED(count))
> +		return OWNER_NONSPINNABLE;
> +	return !owner ? OWNER_NULL : OWNER_READER;
>  }

So this fixes a straight up bug in the last patch (and thus should be
done before so the bug never exists), and creates unreadable code while
at it.

Also, I think only checking HANDOFF after the loop is wrong; the moment
HANDOFF happens you have to terminate the loop, irrespective of what
@owner does.

Does something like so work?

---

enum owner_state {
	OWNER_NULL		= 1 << 0,
	OWNER_WRITER		= 1 << 1,
	OWNER_READER		= 1 << 2,
	OWNER_NONSPINNABLE	= 1 << 3,
};
#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)

static inline enum owner_state rwsem_owner_state(unsigned long owner)
{
	if (!owner)
		return OWNER_NULL;

	if (owner & RWSEM_ANONYMOUSLY_OWNED)
		return OWNER_NONSPINNABLE;

	if (owner & RWSEM_READER_OWNER)
		return OWNER_READER;

	return OWNER_WRITER;
}

static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
{
	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
	enum owner_state state;

	rcu_read_lock();
	for (;;) {
		state = rwsem_owner_state((unsigned long)owner);
		if (!(state & OWNER_SPINNABLE))
			break;

		if (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF) {
			state = OWNER_NONSPINNABLE;
			break;
		}

		tmp = READ_ONCE(sem->owner);
		if (tmp != owner) {
			state = rwsem_owner_state((unsigned long)tmp);
			break;
		}

		/*
		 * Ensure we emit the owner->on_cpu, dereference _after_
		 * checking sem->owner still matches owner, if that fails,
		 * owner might point to free()d memory, if it still matches,
		 * the rcu_read_lock() ensures the memory stays valid.
		 */
		barrier();

		if (need_resched() || !owner_on_cpu(owner)) {
			state = OWNER_NONSPINNABLE;
			break;
		}

		cpu_relax();
	}
	rcu_read_unlock();

	return state;
}

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-17 12:41   ` Peter Zijlstra
@ 2019-04-17 12:47     ` Peter Zijlstra
  2019-04-17 18:29       ` Waiman Long
  2019-04-17 13:00     ` Peter Zijlstra
  1 sibling, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17 12:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Wed, Apr 17, 2019 at 02:41:01PM +0200, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:51PM -0400, Waiman Long wrote:
> > In the special case that there is no active lock and the handoff bit
> > is set, optimistic spinning has to be stopped.
> 
> > @@ -500,9 +521,19 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
> >  
> >  	/*
> >  	 * If there is a new owner or the owner is not set, we continue
> > -	 * spinning.
> > +	 * spinning except when here is no active locks and the handoff bit
> > +	 * is set. In this case, we have to stop spinning.
> >  	 */
> > -	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
> > +	owner = READ_ONCE(sem->owner);
> > +	if (!is_rwsem_owner_spinnable(owner))
> > +		return OWNER_NONSPINNABLE;
> > +	if (owner && !is_rwsem_owner_reader(owner))
> > +		return OWNER_WRITER;
> > +
> > +	count = atomic_long_read(&sem->count);
> > +	if (RWSEM_COUNT_HANDOFF(count) && !RWSEM_COUNT_LOCKED(count))
> > +		return OWNER_NONSPINNABLE;
> > +	return !owner ? OWNER_NULL : OWNER_READER;
> >  }
> 
> So this fixes a straight up bug in the last patch (and thus should be
> done before so the bug never exists), and creates unreadable code while
> at it.
> 
> Also, I think only checking HANDOFF after the loop is wrong; the moment
> HANDOFF happens you have to terminate the loop, irrespective of what
> @owner does.
> 
> Does something like so work?
> 
> ---
> 
> enum owner_state {
> 	OWNER_NULL		= 1 << 0,
> 	OWNER_WRITER		= 1 << 1,
> 	OWNER_READER		= 1 << 2,
> 	OWNER_NONSPINNABLE	= 1 << 3,
> };
> #define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)

Hmm, we should not spin on OWNER_NULL. Or at least not mixed in with the
patch that changes the shape of all this. That should go in the RT
thingy patch, which comes after this.

> static inline enum owner_state rwsem_owner_state(unsigned long owner)
> {
> 	if (!owner)
> 		return OWNER_NULL;
> 
> 	if (owner & RWSEM_ANONYMOUSLY_OWNED)
> 		return OWNER_NONSPINNABLE;
> 
> 	if (owner & RWSEM_READER_OWNER)
> 		return OWNER_READER;
> 
> 	return OWNER_WRITER;
> }
> 
> static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
> {
> 	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
> 	enum owner_state state;
> 
> 	rcu_read_lock();
> 	for (;;) {
> 		state = rwsem_owner_state((unsigned long)owner);
> 		if (!(state & OWNER_SPINNABLE))
> 			break;
> 
> 		if (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF) {
> 			state = OWNER_NONSPINNABLE;
> 			break;
> 		}
> 
> 		tmp = READ_ONCE(sem->owner);
> 		if (tmp != owner) {
> 			state = rwsem_owner_state((unsigned long)tmp);
> 			break;
> 		}
> 
> 		/*
> 		 * Ensure we emit the owner->on_cpu, dereference _after_
> 		 * checking sem->owner still matches owner, if that fails,
> 		 * owner might point to free()d memory, if it still matches,
> 		 * the rcu_read_lock() ensures the memory stays valid.
> 		 */
> 		barrier();
> 
> 		if (need_resched() || !owner_on_cpu(owner)) {
> 			state = OWNER_NONSPINNABLE;
> 			break;
> 		}
> 
> 		cpu_relax();
> 	}
> 	rcu_read_unlock();
> 
> 	return state;
> }

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-17 12:41   ` Peter Zijlstra
  2019-04-17 12:47     ` Peter Zijlstra
@ 2019-04-17 13:00     ` Peter Zijlstra
  2019-04-17 18:50       ` Waiman Long
  1 sibling, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17 13:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Wed, Apr 17, 2019 at 02:41:01PM +0200, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:51PM -0400, Waiman Long wrote:
> > In the special case that there is no active lock and the handoff bit
> > is set, optimistic spinning has to be stopped.
> 
> > @@ -500,9 +521,19 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
> >  
> >  	/*
> >  	 * If there is a new owner or the owner is not set, we continue
> > -	 * spinning.
> > +	 * spinning except when here is no active locks and the handoff bit
> > +	 * is set. In this case, we have to stop spinning.
> >  	 */
> > -	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
> > +	owner = READ_ONCE(sem->owner);
> > +	if (!is_rwsem_owner_spinnable(owner))
> > +		return OWNER_NONSPINNABLE;
> > +	if (owner && !is_rwsem_owner_reader(owner))
> > +		return OWNER_WRITER;
> > +
> > +	count = atomic_long_read(&sem->count);
> > +	if (RWSEM_COUNT_HANDOFF(count) && !RWSEM_COUNT_LOCKED(count))
> > +		return OWNER_NONSPINNABLE;
> > +	return !owner ? OWNER_NULL : OWNER_READER;
> >  }
> 
> So this fixes a straight up bug in the last patch (and thus should be
> done before so the bug never exists), and creates unreadable code while
> at it.
> 
> Also, I think only checking HANDOFF after the loop is wrong; the moment
> HANDOFF happens you have to terminate the loop, irrespective of what
> @owner does.
> 
> Does something like so work?
> 
> ---

enum owner_state {
	OWNER_NULL		= 1 << 0,
	OWNER_WRITER		= 1 << 1,
	OWNER_READER		= 1 << 2,
	OWNER_NONSPINNABLE	= 1 << 3,
};
#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)

static inline enum owner_state rwsem_owner_state(unsigned long owner)
{
	if (!owner)
		return OWNER_NULL;

	if (owner & RWSEM_ANONYMOUSLY_OWNED)
		return OWNER_NONSPINNABLE;

	if (owner & RWSEM_READER_OWNER)
		return OWNER_READER;

	return OWNER_WRITER;
}

static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
{
	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
	enum owner_state state = rwsem_owner_state((unsigned long)owner);

	if (state != OWNER_WRITER)
		return state;

	rcu_read_lock();
	for (;;) {
		if (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF) {
			state = OWNER_NONSPINNABLE;
			break;
		}

		tmp = READ_ONCE(sem->owner);
		if (tmp != owner) {
			state = rwsem_owner_state((unsigned long)tmp);
			break;
		}

		/*
		 * Ensure we emit the owner->on_cpu, dereference _after_
		 * checking sem->owner still matches owner, if that fails,
		 * owner might point to free()d memory, if it still matches,
		 * the rcu_read_lock() ensures the memory stays valid.
		 */
		barrier();

		if (need_resched() || !owner_on_cpu(owner)) {
			state = OWNER_NONSPINNABLE;
			break;
		}

		cpu_relax();
	}
	rcu_read_unlock();

	return state;
}

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 09/16] locking/rwsem: Ensure an RT task will not spin on reader
  2019-04-13 17:22 ` [PATCH v4 09/16] locking/rwsem: Ensure an RT task will not spin on reader Waiman Long
@ 2019-04-17 13:18   ` Peter Zijlstra
  2019-04-17 18:47     ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17 13:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:52PM -0400, Waiman Long wrote:
> An RT task can do optimistic spinning only if the lock holder is
> actually running. If the state of the lock holder isn't known, there
> is a possibility that high priority of the RT task may block forward
> progress of the lock holder if it happens to reside on the same CPU.
> This will lead to deadlock. So we have to make sure that an RT task
> will not spin on a reader-owned rwsem.
> 
> When the owner is temporarily set to NULL, it is more tricky to decide
> if an RT task should stop spinning as it may be a temporary state
> where another writer may have just stolen the lock which then failed
> the task's trylock attempt. So one more retry is allowed to make sure
> that the lock is not spinnable by an RT task.
> 
> When testing on a 8-socket IvyBridge-EX system, the one additional retry
> seems to improve locking performance of RT write locking threads under
> heavy contentions. The table below shows the locking rates (in kops/s)
> with various write locking threads before and after the patch.
> 
>     Locking threads     Pre-patch     Post-patch
>     ---------------     ---------     -----------
>             4             2,753          2,608
>             8             2,529          2,520
>            16             1,727          1,918
>            32             1,263          1,956
>            64               889          1,343
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/rwsem.c | 36 +++++++++++++++++++++++++++++-------
>  1 file changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 2d6850c3e77b..8e19b5141595 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -539,6 +539,8 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
>  static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>  {
>  	bool taken = false;
> +	bool is_rt_task = rt_task(current);

Arguably this is wrong; a remote CPU could change the scheduling
atributes of this task while it is spinning. In practise I don't think
we do that without forcing a reschedule, but in theory we could if we
find the task is current anyway.

> +	int prev_owner_state = OWNER_NULL;
>  
>  	preempt_disable();
>  
> @@ -556,7 +558,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>  	 *  2) readers own the lock as we can't determine if they are
>  	 *     actively running or not.
>  	 */
> -	while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
> +	for (;;) {
> +		enum owner_state owner_state = rwsem_spin_on_owner(sem);
> +
> +		if (!(owner_state & OWNER_SPINNABLE))
> +			break;
> +
>  		/*
>  		 * Try to acquire the lock
>  		 */
> @@ -566,13 +573,28 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>  		}
>  
>  		/*
> -		 * When there's no owner, we might have preempted between the
> -		 * owner acquiring the lock and setting the owner field. If
> -		 * we're an RT task that will live-lock because we won't let
> -		 * the owner complete.
> +		 * An RT task cannot do optimistic spinning if it cannot
> +		 * be sure the lock holder is running or live-lock may
> +		 * happen if the current task and the lock holder happen
> +		 * to run in the same CPU.
> +		 *
> +		 * When there's no owner or is reader-owned, an RT task
> +		 * will stop spinning if the owner state is not a writer
> +		 * at the previous iteration of the loop. This allows the
> +		 * RT task to recheck if the task that steals the lock is
> +		 * a spinnable writer. If so, it can keeps on spinning.
> +		 *
> +		 * If the owner is a writer, the need_resched() check is
> +		 * done inside rwsem_spin_on_owner(). If the owner is not
> +		 * a writer, need_resched() check needs to be done here.
>  		 */
> -		if (!sem->owner && (need_resched() || rt_task(current)))
> -			break;
> +		if (owner_state != OWNER_WRITER) {
> +			if (need_resched())
> +				break;
> +			if (is_rt_task && (prev_owner_state != OWNER_WRITER))
> +				break;
> +		}
> +		prev_owner_state = owner_state;
>  
>  		/*
>  		 * The cpu_relax() call is a compiler barrier which forces

This patch confuses me mightily. I mean, I see what it does, but I can't
figure out why. The Changelog is just one big source of confusion.

If you want one extra trylock attempt, why make it conditional on RT,
why not something simple like this?

--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -556,7 +556,7 @@ static bool rwsem_optimistic_spin(struct
 		 */
 		if (rwsem_try_write_lock_unqueued(sem)) {
 			taken = true;
-			break;
+			goto unlock;
 		}
 
 		/*
@@ -576,6 +576,11 @@ static bool rwsem_optimistic_spin(struct
 		 */
 		cpu_relax();
 	}
+
+	if (rwsem_try_write_lock_unqueued(sem))
+		taken = true;
+
+unlock:
 	osq_unlock(&sem->osq);
 done:
 	preempt_enable();

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue
  2019-04-13 17:22 ` [PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
  2019-04-16 16:50   ` Davidlohr Bueso
@ 2019-04-17 13:39   ` Peter Zijlstra
  2019-04-17 17:16     ` Waiman Long
  1 sibling, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17 13:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:53PM -0400, Waiman Long wrote:
> When the front of the wait queue is a reader, other readers
> immediately following the first reader will also be woken up at the
> same time. However, if there is a writer in between. Those readers
> behind the writer will not be woken up.
> 
> Because of optimistic spinning, the lock acquisition order is not FIFO
> anyway. The lock handoff mechanism will ensure that lock starvation
> will not happen.
> 
> Assuming that the lock hold times of the other readers still in the
> queue will be about the same as the readers that are being woken up,
> there is really not much additional cost other than the additional
> latency due to the wakeup of additional tasks by the waker. Therefore
> all the readers up to a maximum of 256 in the queue are woken up when
> the first waiter is a reader to improve reader throughput.
> 
> With a locking microbenchmark running on 5.1 based kernel, the total
> locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
> equal numbers of readers and writers before and after this patch were
> as follows:
> 
>    # of Threads  Pre-Patch   Post-patch
>    ------------  ---------   ----------
>         4          1,641        1,674
>         8            731        1,062
>        16            564          924
>        32             78          300
>        64             38          195
>       240             50          149
> 
> There is no performance gain at low contention level. At high contention
> level, however, this patch gives a pretty decent performance boost.

Right, so this basically completes the convertion from task-fair (FIFO)
to phase-fair.

https://cs.unc.edu/~anderson/papers/rtsj10-for-web.pdf

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-13 17:22 ` [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer Waiman Long
@ 2019-04-17 13:56   ` Peter Zijlstra
  2019-04-17 17:34     ` Waiman Long
  2019-04-17 13:58   ` Peter Zijlstra
  2019-04-17 14:05   ` Peter Zijlstra
  2 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17 13:56 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
> @@ -549,7 +582,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
>  	return !owner ? OWNER_NULL : OWNER_READER;
>  }
>  
> -static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
> +static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
>  {
>  	bool taken = false;
>  	bool is_rt_task = rt_task(current);
> @@ -558,9 +591,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>  	preempt_disable();
>  
>  	/* sem->wait_lock should not be held when doing optimistic spinning */
> -	if (!rwsem_can_spin_on_owner(sem))
> -		goto done;
> -
>  	if (!osq_lock(&sem->osq))
>  		goto done;
>  
> @@ -580,10 +610,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>  		/*
>  		 * Try to acquire the lock
>  		 */
> -		if (rwsem_try_write_lock_unqueued(sem)) {
> -			taken = true;
> +		taken = wlock ? rwsem_try_write_lock_unqueued(sem)
> +			      : rwsem_try_read_lock_unqueued(sem);
> +
> +		if (taken)
>  			break;
> -		}
>  
>  		/*
>  		 * An RT task cannot do optimistic spinning if it cannot

Alternatively you pass the trylock function as an argument:

static bool rwsem_optimistic_spin(struct rw_semaphore *sem,
				  bool (*trylock)(struct rw_semaphore *sem))
{
	...
		if (trylock(sem)) {
			taken = true;
			goto unlock;
		}
	...
}


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-13 17:22 ` [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer Waiman Long
  2019-04-17 13:56   ` Peter Zijlstra
@ 2019-04-17 13:58   ` Peter Zijlstra
  2019-04-17 17:45     ` Waiman Long
  2019-04-17 14:05   ` Peter Zijlstra
  2 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17 13:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
> +/*
> + * Try to acquire read lock before the reader is put on wait queue.
> + * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
> + * is ongoing.
> + */
> +static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
> +{
> +	long count = atomic_long_read(&sem->count);
> +
> +	if (RWSEM_COUNT_WLOCKED_OR_HANDOFF(count))
> +		return false;
> +
> +	count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
> +	if (!RWSEM_COUNT_WLOCKED_OR_HANDOFF(count)) {
> +		rwsem_set_reader_owned(sem);
> +		lockevent_inc(rwsem_opt_rlock);
> +		return true;
> +	}
> +
> +	/* Back out the change */
> +	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> +	return false;
> +}

Doesn't a cmpxchg 'loop' make more sense here?

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-13 17:22 ` [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer Waiman Long
  2019-04-17 13:56   ` Peter Zijlstra
  2019-04-17 13:58   ` Peter Zijlstra
@ 2019-04-17 14:05   ` Peter Zijlstra
  2019-04-17 17:51     ` Waiman Long
  2 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-17 14:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
> @@ -650,6 +686,33 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
>  	struct rwsem_waiter waiter;
>  	DEFINE_WAKE_Q(wake_q);
>  
> +	if (!rwsem_can_spin_on_owner(sem))
> +		goto queue;
> +
> +	/*
> +	 * Undo read bias from down_read() and do optimistic spinning.
> +	 */
> +	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> +	adjustment = 0;
> +	if (rwsem_optimistic_spin(sem, false)) {
> +		unsigned long flags;
> +
> +		/*
> +		 * Opportunistically wake up other readers in the wait queue.
> +		 * It has another chance of wakeup at unlock time.
> +		 */
> +		if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS) &&
> +		    raw_spin_trylock_irqsave(&sem->wait_lock, flags)) {

why trylock? Also the rest of this function uses _irq(). Having had to
define @flags should've been a clue.

You simply cnanot do down_*() with IRQs disabled.

> +			if (!list_empty(&sem->wait_list))
> +				__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
> +						  &wake_q);
> +			raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
> +			wake_up_q(&wake_q);
> +		}
> +		return sem;
> +	}
> +
> +queue:
>  	waiter.task = current;
>  	waiter.type = RWSEM_WAITING_FOR_READ;
>  	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-17  7:13         ` Peter Zijlstra
@ 2019-04-17 16:22           ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-17 16:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 03:13 AM, Peter Zijlstra wrote:
> On Tue, Apr 16, 2019 at 05:07:26PM -0400, Waiman Long wrote:
>
>> Thinking about it again. I think I will just change its definition to
>> "((HZ + 249)/250)" for now to make sure that it is at least 1. The
> DIV_ROUND_UP()

Sure.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-17  7:35       ` Peter Zijlstra
@ 2019-04-17 16:35         ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-17 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 03:35 AM, Peter Zijlstra wrote:
> On Tue, Apr 16, 2019 at 02:16:11PM -0400, Waiman Long wrote:
>
>>>> @@ -324,6 +364,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>>>>  		adjustment -= RWSEM_FLAG_WAITERS;
>>>>  	}
>>>>  
>>>> +	/*
>>>> +	 * Clear the handoff flag
>>>> +	 */
>>> Right, but that is a trivial comment in the 'increment i' style, it
>>> clearly states what the code does, but completely fails to elucidate the
>>> code.
>>>
>>> Maybe:
>>>
>>> 	/*
>>> 	 * When we've woken a reader, we no longer need to force writers
>>> 	 * to give up the lock and we can clear HANDOFF.
>>> 	 */
>>>
>>> And I suppose this is required if we were the pickup of the handoff set
>>> above, but is there a guarantee that the HANDOFF was not set by a
>>> writer?
>> I can change the comment. The handoff bit is always cleared in
>> rwsem_try_write_lock() when the lock is successfully acquire. Will add a
>> comment to document that.
> That doesn't help much, because it drops ->wait_lock between setting it
> and acquiring it. So the read-acquire can interleave.
>
> I _think_ it works, but I'm having trouble explaining how exactly. I
> think because readers don't spin yet and thus wakeups abide by queue
> order.
>
> And the other way around should have (write) spinners terminate the
> moment they see HANDOFF set by a readers, but I'm not immediately seeing
> that either.
>
> I'll continue staring at that.
>
All writers acquire the lock by cmpxchg and they did check for the
handoff bit before attempting to acquire. So there is no way for write
spinners to acquire after they see the handoff bit.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-17  8:05       ` Peter Zijlstra
@ 2019-04-17 16:39         ` Waiman Long
  2019-04-18  8:22           ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-17 16:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 04:05 AM, Peter Zijlstra wrote:
> On Tue, Apr 16, 2019 at 02:16:11PM -0400, Waiman Long wrote:
>
>>>> @@ -608,56 +687,63 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
>>>>  	 */
>>>>  	waiter.task = current;
>>>>  	waiter.type = RWSEM_WAITING_FOR_WRITE;
>>>> +	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
>>>>  
>>>>  	raw_spin_lock_irq(&sem->wait_lock);
>>>>  
>>>>  	/* account for this before adding a new element to the list */
>>>> +	wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
>>>>  
>>>>  	list_add_tail(&waiter.list, &sem->wait_list);
>>>>  
>>>>  	/* we're now waiting on the lock */
>>>> +	if (wstate == WRITER_NOT_FIRST) {
>>>>  		count = atomic_long_read(&sem->count);
>>>>  
>>>>  		/*
>>>> +		 * If there were already threads queued before us and:
>>>> +		 *  1) there are no no active locks, wake the front
>>>> +		 *     queued process(es) as the handoff bit might be set.
>>>> +		 *  2) there are no active writers and some readers, the lock
>>>> +		 *     must be read owned; so we try to wake any read lock
>>>> +		 *     waiters that were queued ahead of us.
>>>>  		 */
>>>> +		if (!RWSEM_COUNT_LOCKED(count))
>>>> +			__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
>>>> +		else if (!(count & RWSEM_WRITER_MASK) &&
>>>> +			  (count & RWSEM_READER_MASK))
>>>>  			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
>>> Does the above want to be something like:
>>>
>>> 		if (!(count & RWSEM_WRITER_LOCKED)) {
>>> 			__rwsem_mark_wake(sem, (count & RWSEM_READER_MASK) ?
>>> 					       RWSEM_WAKE_READERS :
>>> 					       RWSEM_WAKE_ANY, &wake_q);
>>> 		}
>> Yes.
>>
>>>> +		else
>>>> +			goto wait;
>>>>  
>>>> +		/*
>>>> +		 * The wakeup is normally called _after_ the wait_lock
>>>> +		 * is released, but given that we are proactively waking
>>>> +		 * readers we can deal with the wake_q overhead as it is
>>>> +		 * similar to releasing and taking the wait_lock again
>>>> +		 * for attempting rwsem_try_write_lock().
>>>> +		 */
>>>> +		wake_up_q(&wake_q);
>>> Hurmph.. the reason we do wake_up_q() outside of wait_lock is such that
>>> those tasks don't bounce on wait_lock. Also, it removes a great deal of
>>> hold-time from wait_lock.
>>>
>>> So I'm not sure I buy your argument here.
>>>
>> Actually, we don't want to release the wait_lock, do wake_up_q() and
>> acquire the wait_lock again as the state would have been changed. I
>> didn't change the comment on this patch, but will reword it to discuss that.
> I don't understand, we've queued ourselves, we're on the list, we're not
> first. How would dropping the lock to try and kick waiters before us be
> a problem?
>
> Sure, once we re-acquire the lock we have to re-avaluate @wstate to see
> if we're first now or not, but we need to do that anyway.
>
> So what is wrong with the below?
>
> --- a/include/linux/sched/wake_q.h
> +++ b/include/linux/sched/wake_q.h
> @@ -51,6 +51,11 @@ static inline void wake_q_init(struct wa
>  	head->lastp = &head->first;
>  }
>  
> +static inline bool wake_q_empty(struct wake_q_head *head)
> +{
> +	return head->first == WAKE_Q_TAIL;
> +}
> +
>  extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
>  extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
>  extern void wake_up_q(struct wake_q_head *head);
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -700,25 +700,22 @@ __rwsem_down_write_failed_common(struct
>  		 *     must be read owned; so we try to wake any read lock
>  		 *     waiters that were queued ahead of us.
>  		 */
> -		if (!(count & RWSEM_LOCKED_MASK))
> -			__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
> -		else if (!(count & RWSEM_WRITER_MASK) &&
> -				(count & RWSEM_READER_MASK))
> -			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
> -		else
> +		if (count & RWSEM_WRITER_LOCKED)
>  			goto wait;
> -		/*
> -		 * The wakeup is normally called _after_ the wait_lock
> -		 * is released, but given that we are proactively waking
> -		 * readers we can deal with the wake_q overhead as it is
> -		 * similar to releasing and taking the wait_lock again
> -		 * for attempting rwsem_try_write_lock().
> -		 */
> -		wake_up_q(&wake_q);
> -		/*
> -		 * Reinitialize wake_q after use.
> -		 */
> -		wake_q_init(&wake_q);
> +
> +		__rwsem_mark_wake(sem, (count & RWSEM_READER_MASK) ?
> +				RWSEM_WAKE_READERS :
> +				RWSEM_WAKE_ANY, &wake_q);
> +
> +		if (!wake_q_empty(&wake_q)) {
> +			raw_spin_unlock_irq(&sem->wait_lock);
> +			wake_up_q(&wake_q);
> +			/* used again, reinit */
> +			wake_q_init(&wake_q);
> +			raw_spin_lock_irq(&sem->wait_lock);
> +			if (rwsem_waiter_is_first(sem, &waiter))
> +				wstate = WRITER_FIRST;
> +		}
>  	} else {
>  		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
>  	}

Yes, we can certainly do that. My point is that I haven't changed the
existing logic regarding that wakeup, I only move it around in the
patch. As it is not related to lock handoff, we can do it as a separate
patch.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-17  9:00   ` Peter Zijlstra
@ 2019-04-17 16:42     ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-17 16:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 05:00 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:51PM -0400, Waiman Long wrote:
>> This patch modifies rwsem_spin_on_owner() to return four possible
>> values to better reflect the state of lock holder which enables us to
>> make a better decision of what to do next.
>>
>> In the special case that there is no active lock and the handoff bit
>> is set, optimistic spinning has to be stopped.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  kernel/locking/rwsem.c | 45 +++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 38 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index aaab546a890d..2d6850c3e77b 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -156,6 +156,11 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
>>  	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
>>  }
>>  
>> +static inline bool is_rwsem_owner_reader(struct task_struct *owner)
>> +{
>> +	return (unsigned long)owner & RWSEM_READER_OWNED;
>> +}
> Move this and the surrounding helpers into the RWSEM_SPIN_ON_OWNER
> block, it is only used there and that way all the code is together.

OK, will do that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-17 10:19   ` Peter Zijlstra
@ 2019-04-17 16:53     ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-17 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 06:19 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:51PM -0400, Waiman Long wrote:
>> In the special case that there is no active lock and the handoff bit
>> is set, optimistic spinning has to be stopped.
> This makes me think this should've been _before_ you added the handoff
> bit. So that when you introduce the handoff, everything is solid.

I can move it forward if you think it is the right move.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue
  2019-04-17 13:39   ` Peter Zijlstra
@ 2019-04-17 17:16     ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-17 17:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 09:39 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:53PM -0400, Waiman Long wrote:
>> When the front of the wait queue is a reader, other readers
>> immediately following the first reader will also be woken up at the
>> same time. However, if there is a writer in between. Those readers
>> behind the writer will not be woken up.
>>
>> Because of optimistic spinning, the lock acquisition order is not FIFO
>> anyway. The lock handoff mechanism will ensure that lock starvation
>> will not happen.
>>
>> Assuming that the lock hold times of the other readers still in the
>> queue will be about the same as the readers that are being woken up,
>> there is really not much additional cost other than the additional
>> latency due to the wakeup of additional tasks by the waker. Therefore
>> all the readers up to a maximum of 256 in the queue are woken up when
>> the first waiter is a reader to improve reader throughput.
>>
>> With a locking microbenchmark running on 5.1 based kernel, the total
>> locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
>> equal numbers of readers and writers before and after this patch were
>> as follows:
>>
>>    # of Threads  Pre-Patch   Post-patch
>>    ------------  ---------   ----------
>>         4          1,641        1,674
>>         8            731        1,062
>>        16            564          924
>>        32             78          300
>>        64             38          195
>>       240             50          149
>>
>> There is no performance gain at low contention level. At high contention
>> level, however, this patch gives a pretty decent performance boost.
> Right, so this basically completes the convertion from task-fair (FIFO)
> to phase-fair.
>
> https://cs.unc.edu/~anderson/papers/rtsj10-for-web.pdf

Right, the changes that I am making is similar in concept to the
phase-fair rwlock mentioned in the article. That is an interesting
article even though I was not aware of it before you brought it up.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-17 13:56   ` Peter Zijlstra
@ 2019-04-17 17:34     ` Waiman Long
  2019-04-18  8:57       ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-17 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 09:56 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
>> @@ -549,7 +582,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
>>  	return !owner ? OWNER_NULL : OWNER_READER;
>>  }
>>  
>> -static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>> +static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
>>  {
>>  	bool taken = false;
>>  	bool is_rt_task = rt_task(current);
>> @@ -558,9 +591,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>>  	preempt_disable();
>>  
>>  	/* sem->wait_lock should not be held when doing optimistic spinning */
>> -	if (!rwsem_can_spin_on_owner(sem))
>> -		goto done;
>> -
>>  	if (!osq_lock(&sem->osq))
>>  		goto done;
>>  
>> @@ -580,10 +610,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>>  		/*
>>  		 * Try to acquire the lock
>>  		 */
>> -		if (rwsem_try_write_lock_unqueued(sem)) {
>> -			taken = true;
>> +		taken = wlock ? rwsem_try_write_lock_unqueued(sem)
>> +			      : rwsem_try_read_lock_unqueued(sem);
>> +
>> +		if (taken)
>>  			break;
>> -		}
>>  
>>  		/*
>>  		 * An RT task cannot do optimistic spinning if it cannot
> Alternatively you pass the trylock function as an argument:
>
> static bool rwsem_optimistic_spin(struct rw_semaphore *sem,
> 				  bool (*trylock)(struct rw_semaphore *sem))
> {
> 	...
> 		if (trylock(sem)) {
> 			taken = true;
> 			goto unlock;
> 		}
> 	...
> }
>
With retpoline, an indirect function call will be slower.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-17 13:58   ` Peter Zijlstra
@ 2019-04-17 17:45     ` Waiman Long
  2019-04-18  9:00       ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-17 17:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 09:58 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
>> +/*
>> + * Try to acquire read lock before the reader is put on wait queue.
>> + * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
>> + * is ongoing.
>> + */
>> +static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
>> +{
>> +	long count = atomic_long_read(&sem->count);
>> +
>> +	if (RWSEM_COUNT_WLOCKED_OR_HANDOFF(count))
>> +		return false;
>> +
>> +	count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
>> +	if (!RWSEM_COUNT_WLOCKED_OR_HANDOFF(count)) {
>> +		rwsem_set_reader_owned(sem);
>> +		lockevent_inc(rwsem_opt_rlock);
>> +		return true;
>> +	}
>> +
>> +	/* Back out the change */
>> +	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
>> +	return false;
>> +}
> Doesn't a cmpxchg 'loop' make more sense here?

Not really. A cmpxchg loop will have one more correctible failure mode -
a new reader acquire the lock or a reader owner does an unlock. Failures
caused by the setting of the handoff bit or writer acquiring the lock
are the same for both cases. I don't see any advantage in using cmpxchg
loop.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-17 14:05   ` Peter Zijlstra
@ 2019-04-17 17:51     ` Waiman Long
  2019-04-18  9:11       ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-17 17:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 10:05 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
>> @@ -650,6 +686,33 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
>>  	struct rwsem_waiter waiter;
>>  	DEFINE_WAKE_Q(wake_q);
>>  
>> +	if (!rwsem_can_spin_on_owner(sem))
>> +		goto queue;
>> +
>> +	/*
>> +	 * Undo read bias from down_read() and do optimistic spinning.
>> +	 */
>> +	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
>> +	adjustment = 0;
>> +	if (rwsem_optimistic_spin(sem, false)) {
>> +		unsigned long flags;
>> +
>> +		/*
>> +		 * Opportunistically wake up other readers in the wait queue.
>> +		 * It has another chance of wakeup at unlock time.
>> +		 */
>> +		if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS) &&
>> +		    raw_spin_trylock_irqsave(&sem->wait_lock, flags)) {
> why trylock? Also the rest of this function uses _irq(). Having had to
> define @flags should've been a clue.
>
> You simply cnanot do down_*() with IRQs disabled.

I used trylock to avoid getting stuck in the spinlock while holding a
read lock on the rwsem.

You are right. I don't to use _irqsave version. Just the _irq version is
enough.

Thanks,
Longman



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-17 12:47     ` Peter Zijlstra
@ 2019-04-17 18:29       ` Waiman Long
  2019-04-18  8:39         ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-17 18:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 08:47 AM, Peter Zijlstra wrote:
> On Wed, Apr 17, 2019 at 02:41:01PM +0200, Peter Zijlstra wrote:
>> On Sat, Apr 13, 2019 at 01:22:51PM -0400, Waiman Long wrote:
>>> In the special case that there is no active lock and the handoff bit
>>> is set, optimistic spinning has to be stopped.
>>> @@ -500,9 +521,19 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
>>>  
>>>  	/*
>>>  	 * If there is a new owner or the owner is not set, we continue
>>> -	 * spinning.
>>> +	 * spinning except when here is no active locks and the handoff bit
>>> +	 * is set. In this case, we have to stop spinning.
>>>  	 */
>>> -	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
>>> +	owner = READ_ONCE(sem->owner);
>>> +	if (!is_rwsem_owner_spinnable(owner))
>>> +		return OWNER_NONSPINNABLE;
>>> +	if (owner && !is_rwsem_owner_reader(owner))
>>> +		return OWNER_WRITER;
>>> +
>>> +	count = atomic_long_read(&sem->count);
>>> +	if (RWSEM_COUNT_HANDOFF(count) && !RWSEM_COUNT_LOCKED(count))
>>> +		return OWNER_NONSPINNABLE;
>>> +	return !owner ? OWNER_NULL : OWNER_READER;
>>>  }
>> So this fixes a straight up bug in the last patch (and thus should be
>> done before so the bug never exists), and creates unreadable code while
>> at it.
>>
>> Also, I think only checking HANDOFF after the loop is wrong; the moment
>> HANDOFF happens you have to terminate the loop, irrespective of what
>> @owner does.
>>
>> Does something like so work?
>>
>> ---
>>
>> enum owner_state {
>> 	OWNER_NULL		= 1 << 0,
>> 	OWNER_WRITER		= 1 << 1,
>> 	OWNER_READER		= 1 << 2,
>> 	OWNER_NONSPINNABLE	= 1 << 3,
>> };
>> #define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
> Hmm, we should not spin on OWNER_NULL. Or at least not mixed in with the
> patch that changes the shape of all this. That should go in the RT
> thingy patch, which comes after this.

We do spin on OWNER_NULL right now, not in rwsem_spin_on_owner() but in
the main rwsem_optimistic_spin() function.

RT task will quit if owner is NULL.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 09/16] locking/rwsem: Ensure an RT task will not spin on reader
  2019-04-17 13:18   ` Peter Zijlstra
@ 2019-04-17 18:47     ` Waiman Long
  2019-04-18  8:52       ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-17 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 09:18 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:52PM -0400, Waiman Long wrote:
>> An RT task can do optimistic spinning only if the lock holder is
>> actually running. If the state of the lock holder isn't known, there
>> is a possibility that high priority of the RT task may block forward
>> progress of the lock holder if it happens to reside on the same CPU.
>> This will lead to deadlock. So we have to make sure that an RT task
>> will not spin on a reader-owned rwsem.
>>
>> When the owner is temporarily set to NULL, it is more tricky to decide
>> if an RT task should stop spinning as it may be a temporary state
>> where another writer may have just stolen the lock which then failed
>> the task's trylock attempt. So one more retry is allowed to make sure
>> that the lock is not spinnable by an RT task.
>>
>> When testing on a 8-socket IvyBridge-EX system, the one additional retry
>> seems to improve locking performance of RT write locking threads under
>> heavy contentions. The table below shows the locking rates (in kops/s)
>> with various write locking threads before and after the patch.
>>
>>     Locking threads     Pre-patch     Post-patch
>>     ---------------     ---------     -----------
>>             4             2,753          2,608
>>             8             2,529          2,520
>>            16             1,727          1,918
>>            32             1,263          1,956
>>            64               889          1,343
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  kernel/locking/rwsem.c | 36 +++++++++++++++++++++++++++++-------
>>  1 file changed, 29 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index 2d6850c3e77b..8e19b5141595 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -539,6 +539,8 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
>>  static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>>  {
>>  	bool taken = false;
>> +	bool is_rt_task = rt_task(current);
> Arguably this is wrong; a remote CPU could change the scheduling
> atributes of this task while it is spinning. In practise I don't think
> we do that without forcing a reschedule, but in theory we could if we
> find the task is current anyway.

Will move the check back to the main loop.

>> +	int prev_owner_state = OWNER_NULL;
>>  
>>  	preempt_disable();
>>  
>> @@ -556,7 +558,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>>  	 *  2) readers own the lock as we can't determine if they are
>>  	 *     actively running or not.
>>  	 */
>> -	while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
>> +	for (;;) {
>> +		enum owner_state owner_state = rwsem_spin_on_owner(sem);
>> +
>> +		if (!(owner_state & OWNER_SPINNABLE))
>> +			break;
>> +
>>  		/*
>>  		 * Try to acquire the lock
>>  		 */
>> @@ -566,13 +573,28 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>>  		}
>>  
>>  		/*
>> -		 * When there's no owner, we might have preempted between the
>> -		 * owner acquiring the lock and setting the owner field. If
>> -		 * we're an RT task that will live-lock because we won't let
>> -		 * the owner complete.
>> +		 * An RT task cannot do optimistic spinning if it cannot
>> +		 * be sure the lock holder is running or live-lock may
>> +		 * happen if the current task and the lock holder happen
>> +		 * to run in the same CPU.
>> +		 *
>> +		 * When there's no owner or is reader-owned, an RT task
>> +		 * will stop spinning if the owner state is not a writer
>> +		 * at the previous iteration of the loop. This allows the
>> +		 * RT task to recheck if the task that steals the lock is
>> +		 * a spinnable writer. If so, it can keeps on spinning.
>> +		 *
>> +		 * If the owner is a writer, the need_resched() check is
>> +		 * done inside rwsem_spin_on_owner(). If the owner is not
>> +		 * a writer, need_resched() check needs to be done here.
>>  		 */
>> -		if (!sem->owner && (need_resched() || rt_task(current)))
>> -			break;
>> +		if (owner_state != OWNER_WRITER) {
>> +			if (need_resched())
>> +				break;
>> +			if (is_rt_task && (prev_owner_state != OWNER_WRITER))
>> +				break;
>> +		}
>> +		prev_owner_state = owner_state;
>>  
>>  		/*
>>  		 * The cpu_relax() call is a compiler barrier which forces
> This patch confuses me mightily. I mean, I see what it does, but I can't
> figure out why. The Changelog is just one big source of confusion.

Sorry for confusing you. If count and owner are separate, there is a
time lag where the owner is NULL, but the lock is not free yet.
Similarly, the lock could be free but another task may have stolen the
lock if the waiter bit isn't set. In the former case, an extra iteration
gives it more time for the lock holder to release the lock. In the
latter case, if the new lock owner is a writer and set owner in time,
the RT task can keep on spinning. Will clarify that in the commit log
and the comment.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-17 13:00     ` Peter Zijlstra
@ 2019-04-17 18:50       ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-17 18:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/17/2019 09:00 AM, Peter Zijlstra wrote:
> On Wed, Apr 17, 2019 at 02:41:01PM +0200, Peter Zijlstra wrote:
>> On Sat, Apr 13, 2019 at 01:22:51PM -0400, Waiman Long wrote:
>>> In the special case that there is no active lock and the handoff bit
>>> is set, optimistic spinning has to be stopped.
>>> @@ -500,9 +521,19 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
>>>  
>>>  	/*
>>>  	 * If there is a new owner or the owner is not set, we continue
>>> -	 * spinning.
>>> +	 * spinning except when here is no active locks and the handoff bit
>>> +	 * is set. In this case, we have to stop spinning.
>>>  	 */
>>> -	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
>>> +	owner = READ_ONCE(sem->owner);
>>> +	if (!is_rwsem_owner_spinnable(owner))
>>> +		return OWNER_NONSPINNABLE;
>>> +	if (owner && !is_rwsem_owner_reader(owner))
>>> +		return OWNER_WRITER;
>>> +
>>> +	count = atomic_long_read(&sem->count);
>>> +	if (RWSEM_COUNT_HANDOFF(count) && !RWSEM_COUNT_LOCKED(count))
>>> +		return OWNER_NONSPINNABLE;
>>> +	return !owner ? OWNER_NULL : OWNER_READER;
>>>  }
>> So this fixes a straight up bug in the last patch (and thus should be
>> done before so the bug never exists), and creates unreadable code while
>> at it.
>>
>> Also, I think only checking HANDOFF after the loop is wrong; the moment
>> HANDOFF happens you have to terminate the loop, irrespective of what
>> @owner does.
>>
>> Does something like so work?
>>
>> ---
> enum owner_state {
> 	OWNER_NULL		= 1 << 0,
> 	OWNER_WRITER		= 1 << 1,
> 	OWNER_READER		= 1 << 2,
> 	OWNER_NONSPINNABLE	= 1 << 3,
> };
> #define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
>
> static inline enum owner_state rwsem_owner_state(unsigned long owner)
> {
> 	if (!owner)
> 		return OWNER_NULL;
>
> 	if (owner & RWSEM_ANONYMOUSLY_OWNED)
> 		return OWNER_NONSPINNABLE;
>
> 	if (owner & RWSEM_READER_OWNER)
> 		return OWNER_READER;
>
> 	return OWNER_WRITER;
> }
>
> static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
> {
> 	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
> 	enum owner_state state = rwsem_owner_state((unsigned long)owner);
>
> 	if (state != OWNER_WRITER)
> 		return state;
>
> 	rcu_read_lock();
> 	for (;;) {
> 		if (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF) {
> 			state = OWNER_NONSPINNABLE;
> 			break;
> 		}
>
> 		tmp = READ_ONCE(sem->owner);
> 		if (tmp != owner) {
> 			state = rwsem_owner_state((unsigned long)tmp);
> 			break;
> 		}
>
> 		/*
> 		 * Ensure we emit the owner->on_cpu, dereference _after_
> 		 * checking sem->owner still matches owner, if that fails,
> 		 * owner might point to free()d memory, if it still matches,
> 		 * the rcu_read_lock() ensures the memory stays valid.
> 		 */
> 		barrier();
>
> 		if (need_resched() || !owner_on_cpu(owner)) {
> 			state = OWNER_NONSPINNABLE;
> 			break;
> 		}
>
> 		cpu_relax();
> 	}
> 	rcu_read_unlock();
>
> 	return state;
> }

That code looks good to me. Thanks for the rewrite.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [tip:locking/core] locking/rwsem: Prevent unneeded warning during locking selftest
  2019-04-13 17:22 ` [PATCH v4 01/16] locking/rwsem: Prevent unneeded warning during locking selftest Waiman Long
@ 2019-04-18  8:04   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: tip-bot for Waiman Long @ 2019-04-18  8:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, huang.ying.caritas, linux-kernel, peterz, dave, torvalds,
	mingo, hpa, tim.c.chen, longman, will.deacon

Commit-ID:  26536e7c242e2b0f73c25c46fc50d2525ebe400b
Gitweb:     https://git.kernel.org/tip/26536e7c242e2b0f73c25c46fc50d2525ebe400b
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Sat, 13 Apr 2019 13:22:44 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 14 Apr 2019 11:09:35 +0200

locking/rwsem: Prevent unneeded warning during locking selftest

Disable the DEBUG_RWSEMS check when locking selftest is running with
debug_locks_silent flag set.

Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: http://lkml.kernel.org/r/20190413172259.2740-2-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/rwsem.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 37db17890e36..64877f5294e3 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -30,7 +30,8 @@
 
 #ifdef CONFIG_DEBUG_RWSEMS
 # define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
-	if (WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
+	if (!debug_locks_silent &&				\
+	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
 		#c, atomic_long_read(&(sem)->count),		\
 		(long)((sem)->owner), (long)current,		\
 		list_empty(&(sem)->wait_list) ? "" : "not "))	\

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-17 16:39         ` Waiman Long
@ 2019-04-18  8:22           ` Peter Zijlstra
  0 siblings, 0 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-18  8:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Wed, Apr 17, 2019 at 12:39:19PM -0400, Waiman Long wrote:
> On 04/17/2019 04:05 AM, Peter Zijlstra wrote:
> > So what is wrong with the below?
> >
> > --- a/include/linux/sched/wake_q.h
> > +++ b/include/linux/sched/wake_q.h
> > @@ -51,6 +51,11 @@ static inline void wake_q_init(struct wa
> >  	head->lastp = &head->first;
> >  }
> >  
> > +static inline bool wake_q_empty(struct wake_q_head *head)
> > +{
> > +	return head->first == WAKE_Q_TAIL;
> > +}
> > +
> >  extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
> >  extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
> >  extern void wake_up_q(struct wake_q_head *head);
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -700,25 +700,22 @@ __rwsem_down_write_failed_common(struct
> >  		 *     must be read owned; so we try to wake any read lock
> >  		 *     waiters that were queued ahead of us.
> >  		 */
> > -		if (!(count & RWSEM_LOCKED_MASK))
> > -			__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
> > -		else if (!(count & RWSEM_WRITER_MASK) &&
> > -				(count & RWSEM_READER_MASK))
> > -			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
> > -		else
> > +		if (count & RWSEM_WRITER_LOCKED)
> >  			goto wait;
> > -		/*
> > -		 * The wakeup is normally called _after_ the wait_lock
> > -		 * is released, but given that we are proactively waking
> > -		 * readers we can deal with the wake_q overhead as it is
> > -		 * similar to releasing and taking the wait_lock again
> > -		 * for attempting rwsem_try_write_lock().
> > -		 */
> > -		wake_up_q(&wake_q);
> > -		/*
> > -		 * Reinitialize wake_q after use.
> > -		 */
> > -		wake_q_init(&wake_q);
> > +
> > +		__rwsem_mark_wake(sem, (count & RWSEM_READER_MASK) ?
> > +				RWSEM_WAKE_READERS :
> > +				RWSEM_WAKE_ANY, &wake_q);
> > +
> > +		if (!wake_q_empty(&wake_q)) {
> > +			raw_spin_unlock_irq(&sem->wait_lock);
> > +			wake_up_q(&wake_q);
> > +			/* used again, reinit */
> > +			wake_q_init(&wake_q);
> > +			raw_spin_lock_irq(&sem->wait_lock);
> > +			if (rwsem_waiter_is_first(sem, &waiter))
> > +				wstate = WRITER_FIRST;
> > +		}
> >  	} else {
> >  		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
> >  	}
> 
> Yes, we can certainly do that. My point is that I haven't changed the
> existing logic regarding that wakeup, I only move it around in the
> patch. As it is not related to lock handoff, we can do it as a separate
> patch.

Ah, I missed that the old code did that too (too much looking at the new
code I suppose). Then yes, a separate patch fixing this would be good.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-17 18:29       ` Waiman Long
@ 2019-04-18  8:39         ` Peter Zijlstra
  0 siblings, 0 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-18  8:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Wed, Apr 17, 2019 at 02:29:02PM -0400, Waiman Long wrote:
> On 04/17/2019 08:47 AM, Peter Zijlstra wrote:
> >>
> >> enum owner_state {
> >> 	OWNER_NULL		= 1 << 0,
> >> 	OWNER_WRITER		= 1 << 1,
> >> 	OWNER_READER		= 1 << 2,
> >> 	OWNER_NONSPINNABLE	= 1 << 3,
> >> };
> >> #define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
> > Hmm, we should not spin on OWNER_NULL. Or at least not mixed in with the
> > patch that changes the shape of all this. That should go in the RT
> > thingy patch, which comes after this.
> 
> We do spin on OWNER_NULL right now, not in rwsem_spin_on_owner() but in
> the main rwsem_optimistic_spin() function.
> 
> RT task will quit if owner is NULL.

Yeah, I figured it out eventually :-)

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 09/16] locking/rwsem: Ensure an RT task will not spin on reader
  2019-04-17 18:47     ` Waiman Long
@ 2019-04-18  8:52       ` Peter Zijlstra
  2019-04-18 13:27         ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-18  8:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Wed, Apr 17, 2019 at 02:47:07PM -0400, Waiman Long wrote:
> >> @@ -566,13 +573,28 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
> >>  		}
> >>  
> >>  		/*
> >> -		 * When there's no owner, we might have preempted between the
> >> -		 * owner acquiring the lock and setting the owner field. If
> >> -		 * we're an RT task that will live-lock because we won't let
> >> -		 * the owner complete.
> >> +		 * An RT task cannot do optimistic spinning if it cannot
> >> +		 * be sure the lock holder is running or live-lock may
> >> +		 * happen if the current task and the lock holder happen
> >> +		 * to run in the same CPU.
> >> +		 *
> >> +		 * When there's no owner or is reader-owned, an RT task
> >> +		 * will stop spinning if the owner state is not a writer
> >> +		 * at the previous iteration of the loop. This allows the
> >> +		 * RT task to recheck if the task that steals the lock is
> >> +		 * a spinnable writer. If so, it can keeps on spinning.
> >> +		 *
> >> +		 * If the owner is a writer, the need_resched() check is
> >> +		 * done inside rwsem_spin_on_owner(). If the owner is not
> >> +		 * a writer, need_resched() check needs to be done here.
> >>  		 */
> >> -		if (!sem->owner && (need_resched() || rt_task(current)))
> >> -			break;
> >> +		if (owner_state != OWNER_WRITER) {
> >> +			if (need_resched())
> >> +				break;
> >> +			if (is_rt_task && (prev_owner_state != OWNER_WRITER))
> >> +				break;
> >> +		}
> >> +		prev_owner_state = owner_state;
> >>  
> >>  		/*
> >>  		 * The cpu_relax() call is a compiler barrier which forces
> > This patch confuses me mightily. I mean, I see what it does, but I can't
> > figure out why. The Changelog is just one big source of confusion.
> 
> Sorry for confusing you. If count and owner are separate, there is a
> time lag where the owner is NULL, but the lock is not free yet.

Right.

> Similarly, the lock could be free but another task may have stolen the
> lock if the waiter bit isn't set.

> In the former case,

(free)

> an extra iteration gives it more time for the lock holder to release
> the lock.


> In the latter case,

(stolen)

> if the new lock owner is a writer and set owner in time,
> the RT task can keep on spinning. Will clarify that in the commit log
> and the comment.

Blergh.. so by going around once extra, you hope ->owner will be set
again and we keep spinning. And this is actually measurable.

Yuck yuck yuck. I much prefer getting rid of that hole, as you do later
on in the series, that would avoid this complecity. Let me continue
reading...




^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-17 17:34     ` Waiman Long
@ 2019-04-18  8:57       ` Peter Zijlstra
  2019-04-18 14:35         ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-18  8:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Wed, Apr 17, 2019 at 01:34:01PM -0400, Waiman Long wrote:
> On 04/17/2019 09:56 AM, Peter Zijlstra wrote:
> > On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
> >> @@ -549,7 +582,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
> >>  	return !owner ? OWNER_NULL : OWNER_READER;
> >>  }
> >>  
> >> -static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
> >> +static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
> >>  {
> >>  	bool taken = false;
> >>  	bool is_rt_task = rt_task(current);
> >> @@ -558,9 +591,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
> >>  	preempt_disable();
> >>  
> >>  	/* sem->wait_lock should not be held when doing optimistic spinning */
> >> -	if (!rwsem_can_spin_on_owner(sem))
> >> -		goto done;
> >> -
> >>  	if (!osq_lock(&sem->osq))
> >>  		goto done;
> >>  
> >> @@ -580,10 +610,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
> >>  		/*
> >>  		 * Try to acquire the lock
> >>  		 */
> >> -		if (rwsem_try_write_lock_unqueued(sem)) {
> >> -			taken = true;
> >> +		taken = wlock ? rwsem_try_write_lock_unqueued(sem)
> >> +			      : rwsem_try_read_lock_unqueued(sem);
> >> +
> >> +		if (taken)
> >>  			break;
> >> -		}
> >>  
> >>  		/*
> >>  		 * An RT task cannot do optimistic spinning if it cannot
> > Alternatively you pass the trylock function as an argument:
> >
> > static bool rwsem_optimistic_spin(struct rw_semaphore *sem,
> > 				  bool (*trylock)(struct rw_semaphore *sem))
> > {
> > 	...
> > 		if (trylock(sem)) {
> > 			taken = true;
> > 			goto unlock;
> > 		}
> > 	...
> > }
> >
> With retpoline, an indirect function call will be slower.

With compiler optimization we can avoid that. Just mark the function as
__always_inline, there's only two call-sites, each with a different
trylock.

It might have already done that anyway, and used constant propagation
on your bool, but the function pointer one is far easier to read.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-17 17:45     ` Waiman Long
@ 2019-04-18  9:00       ` Peter Zijlstra
  2019-04-18 13:40         ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-18  9:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Wed, Apr 17, 2019 at 01:45:10PM -0400, Waiman Long wrote:
> On 04/17/2019 09:58 AM, Peter Zijlstra wrote:
> > On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
> >> +/*
> >> + * Try to acquire read lock before the reader is put on wait queue.
> >> + * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
> >> + * is ongoing.
> >> + */
> >> +static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
> >> +{
> >> +	long count = atomic_long_read(&sem->count);
> >> +
> >> +	if (RWSEM_COUNT_WLOCKED_OR_HANDOFF(count))
> >> +		return false;
> >> +
> >> +	count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
> >> +	if (!RWSEM_COUNT_WLOCKED_OR_HANDOFF(count)) {
> >> +		rwsem_set_reader_owned(sem);
> >> +		lockevent_inc(rwsem_opt_rlock);
> >> +		return true;
> >> +	}
> >> +
> >> +	/* Back out the change */
> >> +	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> >> +	return false;
> >> +}
> > Doesn't a cmpxchg 'loop' make more sense here?
> 
> Not really. A cmpxchg loop will have one more correctible failure mode -
> a new reader acquire the lock or a reader owner does an unlock. Failures
> caused by the setting of the handoff bit or writer acquiring the lock
> are the same for both cases. I don't see any advantage in using cmpxchg
> loop.

It depends on how many failures vs successes you have. I was expecting
failure to be the most common case, and then you go from 2 atomics to 1.



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-17 17:51     ` Waiman Long
@ 2019-04-18  9:11       ` Peter Zijlstra
  2019-04-18 14:37         ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-18  9:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Wed, Apr 17, 2019 at 01:51:24PM -0400, Waiman Long wrote:
> On 04/17/2019 10:05 AM, Peter Zijlstra wrote:
> > On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
> >> @@ -650,6 +686,33 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
> >>  	struct rwsem_waiter waiter;
> >>  	DEFINE_WAKE_Q(wake_q);
> >>  
> >> +	if (!rwsem_can_spin_on_owner(sem))
> >> +		goto queue;
> >> +
> >> +	/*
> >> +	 * Undo read bias from down_read() and do optimistic spinning.
> >> +	 */
> >> +	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> >> +	adjustment = 0;
> >> +	if (rwsem_optimistic_spin(sem, false)) {
> >> +		unsigned long flags;
> >> +
> >> +		/*
> >> +		 * Opportunistically wake up other readers in the wait queue.
> >> +		 * It has another chance of wakeup at unlock time.
> >> +		 */
> >> +		if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS) &&
> >> +		    raw_spin_trylock_irqsave(&sem->wait_lock, flags)) {
> > why trylock?

> I used trylock to avoid getting stuck in the spinlock while holding a
> read lock on the rwsem.

Is that a real concern? I would think that not waking further readers
would, esp. under high contention, be a bigger deal.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-04-13 17:22 ` [PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
@ 2019-04-18 13:06   ` Peter Zijlstra
  2019-04-18 15:15     ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-18 13:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying


So I really dislike time based spinning, and we've always rejected it
before.

On Sat, Apr 13, 2019 at 01:22:55PM -0400, Waiman Long wrote:

> +static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
> +{
> +	long count = atomic_long_read(&sem->count);
> +	int reader_cnt = atomic_long_read(&sem->count) >> RWSEM_READER_SHIFT;
> +
> +	if (reader_cnt > 30)
> +		reader_cnt = 30;
> +	return sched_clock() + ((count & RWSEM_FLAG_WAITERS)
> +		? 10 * NSEC_PER_USEC + reader_cnt * NSEC_PER_USEC/2
> +		: 25 * NSEC_PER_USEC);
> +}

Urgh, why do you _have_ to write unreadable code :-(

static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
{
	long count = atomic_long_read(&sem->count);
	u64 delta = 25 * NSEC_PER_USEC;

	if (count & RWSEM_FLAG_WAITERS) {
		int readers = count >> RWSEM_READER_SHIFT;

		if (readers > 30)
			readers = 30;

		delta = (20 + readers) * NSEC_PER_USEC / 2;
	}

	return sched_clock() + delta;
}

I don't get it though; the number of current read-owners is independent
of WAITERS, while the hold time does correspond to it.

So why do we have that WAITERS check in there?

> @@ -616,6 +678,35 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
>  		if (taken)
>  			break;
>  
> +		/*
> +		 * Time-based reader-owned rwsem optimistic spinning
> +		 */

This relies on rwsem_spin_on_owner() not actually spinning for
read-owned.

> +		if (wlock && (owner_state == OWNER_READER)) {
> +			/*
> +			 * Initialize rspin_threshold when the owner
> +			 * state changes from non-reader to reader.
> +			 */
> +			if (prev_owner_state != OWNER_READER) {
> +				if (!is_rwsem_spinnable(sem))
> +					break;
> +				rspin_threshold = rwsem_rspin_threshold(sem);
> +				loop = 0;
> +			}

This seems fragile, why not to the rspin_threshold thing _once_ at the
start of this function?

This way it can be reset.

> +			/*
> +			 * Check time threshold every 16 iterations to
> +			 * avoid calling sched_clock() too frequently.
> +			 * This will make the actual spinning time a
> +			 * bit more than that specified in the threshold.
> +			 */
> +			else if (!(++loop & 0xf) &&
> +				 (sched_clock() > rspin_threshold)) {

Why is calling sched_clock() lots a problem?

> +				rwsem_set_nonspinnable(sem);
> +				lockevent_inc(rwsem_opt_nospin);
> +				break;
> +			}
> +		}
> +
>  		/*
>  		 * An RT task cannot do optimistic spinning if it cannot
>  		 * be sure the lock holder is running or live-lock may



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 09/16] locking/rwsem: Ensure an RT task will not spin on reader
  2019-04-18  8:52       ` Peter Zijlstra
@ 2019-04-18 13:27         ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-18 13:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/18/2019 04:52 AM, Peter Zijlstra wrote:
> On Wed, Apr 17, 2019 at 02:47:07PM -0400, Waiman Long wrote:
>>>> @@ -566,13 +573,28 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>>>>  		}
>>>>  
>>>>  		/*
>>>> -		 * When there's no owner, we might have preempted between the
>>>> -		 * owner acquiring the lock and setting the owner field. If
>>>> -		 * we're an RT task that will live-lock because we won't let
>>>> -		 * the owner complete.
>>>> +		 * An RT task cannot do optimistic spinning if it cannot
>>>> +		 * be sure the lock holder is running or live-lock may
>>>> +		 * happen if the current task and the lock holder happen
>>>> +		 * to run in the same CPU.
>>>> +		 *
>>>> +		 * When there's no owner or is reader-owned, an RT task
>>>> +		 * will stop spinning if the owner state is not a writer
>>>> +		 * at the previous iteration of the loop. This allows the
>>>> +		 * RT task to recheck if the task that steals the lock is
>>>> +		 * a spinnable writer. If so, it can keeps on spinning.
>>>> +		 *
>>>> +		 * If the owner is a writer, the need_resched() check is
>>>> +		 * done inside rwsem_spin_on_owner(). If the owner is not
>>>> +		 * a writer, need_resched() check needs to be done here.
>>>>  		 */
>>>> -		if (!sem->owner && (need_resched() || rt_task(current)))
>>>> -			break;
>>>> +		if (owner_state != OWNER_WRITER) {
>>>> +			if (need_resched())
>>>> +				break;
>>>> +			if (is_rt_task && (prev_owner_state != OWNER_WRITER))
>>>> +				break;
>>>> +		}
>>>> +		prev_owner_state = owner_state;
>>>>  
>>>>  		/*
>>>>  		 * The cpu_relax() call is a compiler barrier which forces
>>> This patch confuses me mightily. I mean, I see what it does, but I can't
>>> figure out why. The Changelog is just one big source of confusion.
>> Sorry for confusing you. If count and owner are separate, there is a
>> time lag where the owner is NULL, but the lock is not free yet.
> Right.
>
>> Similarly, the lock could be free but another task may have stolen the
>> lock if the waiter bit isn't set.
>> In the former case,
> (free)
>
>> an extra iteration gives it more time for the lock holder to release
>> the lock.
>
>> In the latter case,
> (stolen)
>
>> if the new lock owner is a writer and set owner in time,
>> the RT task can keep on spinning. Will clarify that in the commit log
>> and the comment.
> Blergh.. so by going around once extra, you hope ->owner will be set
> again and we keep spinning. And this is actually measurable.

Right. That is the plan.

>
> Yuck yuck yuck. I much prefer getting rid of that hole, as you do later
> on in the series, that would avoid this complecity. Let me continue
> reading...

Well, there is limitation in merging owner to rwsem. First of all, we
can't do that for 32-bit. Right now owner merging is enabled for x86-64
only. I will need to study the max physical address bits for the other
architectures later on once I am done with this patchset. So doing an
extra loop will still be helpful.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-18  9:00       ` Peter Zijlstra
@ 2019-04-18 13:40         ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-18 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/18/2019 05:00 AM, Peter Zijlstra wrote:
> On Wed, Apr 17, 2019 at 01:45:10PM -0400, Waiman Long wrote:
>> On 04/17/2019 09:58 AM, Peter Zijlstra wrote:
>>> On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
>>>> +/*
>>>> + * Try to acquire read lock before the reader is put on wait queue.
>>>> + * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
>>>> + * is ongoing.
>>>> + */
>>>> +static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
>>>> +{
>>>> +	long count = atomic_long_read(&sem->count);
>>>> +
>>>> +	if (RWSEM_COUNT_WLOCKED_OR_HANDOFF(count))
>>>> +		return false;
>>>> +
>>>> +	count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
>>>> +	if (!RWSEM_COUNT_WLOCKED_OR_HANDOFF(count)) {
>>>> +		rwsem_set_reader_owned(sem);
>>>> +		lockevent_inc(rwsem_opt_rlock);
>>>> +		return true;
>>>> +	}
>>>> +
>>>> +	/* Back out the change */
>>>> +	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
>>>> +	return false;
>>>> +}
>>> Doesn't a cmpxchg 'loop' make more sense here?
>> Not really. A cmpxchg loop will have one more correctible failure mode -
>> a new reader acquire the lock or a reader owner does an unlock. Failures
>> caused by the setting of the handoff bit or writer acquiring the lock
>> are the same for both cases. I don't see any advantage in using cmpxchg
>> loop.
> It depends on how many failures vs successes you have. I was expecting
> failure to be the most common case, and then you go from 2 atomics to 1.

Well, it really depends on the workloads. Note that an atomic trylock
will only be issued when it sees the lock is ready to be acquired. I
don't see handoff as a likely scenario. So it is either a writer has
just acquired the lock which is not possible if there is existing owning
readers or the reader count changes. For ll//sc architectures, the
failure path above will have one acquire-atomic and one relaxed-atomic.
A cmpxchg loop will have 2 acquire-atomics. In case a large number of
readers is trying to acquire the lock, we may need multiple iterations
of the cmpxchg loop to really acquire the lock. So it can have a far
more worse worst-case situation.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-13 17:22 ` [PATCH v4 14/16] locking/rwsem: Guard against making count negative Waiman Long
@ 2019-04-18 13:51   ` Peter Zijlstra
  2019-04-18 14:08     ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-18 13:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:57PM -0400, Waiman Long wrote:
>  inline void __down_read(struct rw_semaphore *sem)
>  {
> +	long count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
> +						   &sem->count);
> +
> +	if (unlikely(count & RWSEM_READ_FAILED_MASK)) {
> +		rwsem_down_read_failed(sem, count);
>  		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
>  	} else {
>  		rwsem_set_reader_owned(sem);

*groan*, that is not provably correct. It is entirely possible to get
enough fetch_add()s piled on top of one another to overflow regardless.

Unlikely, yes, impossible, no.

This makes me nervious as heck, I really don't want to ever have to
debug something like that :-(

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-18 13:51   ` Peter Zijlstra
@ 2019-04-18 14:08     ` Waiman Long
  2019-04-18 14:30       ` Peter Zijlstra
  2019-04-18 14:40       ` Peter Zijlstra
  0 siblings, 2 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-18 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/18/2019 09:51 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:57PM -0400, Waiman Long wrote:
>>  inline void __down_read(struct rw_semaphore *sem)
>>  {
>> +	long count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>> +						   &sem->count);
>> +
>> +	if (unlikely(count & RWSEM_READ_FAILED_MASK)) {
>> +		rwsem_down_read_failed(sem, count);
>>  		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
>>  	} else {
>>  		rwsem_set_reader_owned(sem);
> *groan*, that is not provably correct. It is entirely possible to get
> enough fetch_add()s piled on top of one another to overflow regardless.
>
> Unlikely, yes, impossible, no.
>
> This makes me nervious as heck, I really don't want to ever have to
> debug something like that :-(

The number of fetch_add() that can pile up is limited by the number of
CPUs available in the system. Yes, if you have a 32k processor system
that have all the CPUs trying to acquire the same read-lock, we will
have a problem. Or as Linus had said that if we could have tasks kept
preempted right after doing the fetch_add with newly scheduled tasks
doing the fetch_add at the same lock again, we could have overflow with
less CPUs. How about disabling preemption before fetch_all and re-enable
it afterward to address the latter concern? I have no solution for the
first case, though.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 15/16] locking/rwsem: Merge owner into count on x86-64
  2019-04-13 17:22 ` [PATCH v4 15/16] locking/rwsem: Merge owner into count on x86-64 Waiman Long
@ 2019-04-18 14:28   ` Peter Zijlstra
  2019-04-18 14:40     ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-18 14:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Sat, Apr 13, 2019 at 01:22:58PM -0400, Waiman Long wrote:
> +#ifdef CONFIG_X86_64
> +#define RWSEM_MERGE_OWNER_TO_COUNT
> +#endif

> +#ifdef RWSEM_MERGE_OWNER_TO_COUNT
> +
> +#ifdef __PHYSICAL_MASK_SHIFT
> +#define RWSEM_PA_MASK_SHIFT	__PHYSICAL_MASK_SHIFT
> +#else
> +#define RWSEM_PA_MASK_SHIFT	52
> +#endif

I really dislike how this hardcodes x86_64.

It would be much better to have a CONFIG variable that gives us the
PA_BITS. Then all an architecture needs to do is set that right.

FWIW:

arch/arm64/Kconfig:config ARM64_PA_BITS_48
arch/arm64/Kconfig:config ARM64_PA_BITS_52

So ARM64 could also use this -- provided we get that overflow thing
sorted.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-18 14:08     ` Waiman Long
@ 2019-04-18 14:30       ` Peter Zijlstra
  2019-04-18 14:40       ` Peter Zijlstra
  1 sibling, 0 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-18 14:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Thu, Apr 18, 2019 at 10:08:28AM -0400, Waiman Long wrote:
> On 04/18/2019 09:51 AM, Peter Zijlstra wrote:
> > On Sat, Apr 13, 2019 at 01:22:57PM -0400, Waiman Long wrote:
> >>  inline void __down_read(struct rw_semaphore *sem)
> >>  {
> >> +	long count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
> >> +						   &sem->count);
> >> +
> >> +	if (unlikely(count & RWSEM_READ_FAILED_MASK)) {
> >> +		rwsem_down_read_failed(sem, count);
> >>  		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> >>  	} else {
> >>  		rwsem_set_reader_owned(sem);
> > *groan*, that is not provably correct. It is entirely possible to get
> > enough fetch_add()s piled on top of one another to overflow regardless.
> >
> > Unlikely, yes, impossible, no.
> >
> > This makes me nervious as heck, I really don't want to ever have to
> > debug something like that :-(
> 
> The number of fetch_add() that can pile up is limited by the number of
> CPUs available in the system.

Uhhn, no. There is no preempt_disable() anywhere here. So even UP can
overflow if it has enough tasks.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-18  8:57       ` Peter Zijlstra
@ 2019-04-18 14:35         ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-18 14:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/18/2019 04:57 AM, Peter Zijlstra wrote:
> On Wed, Apr 17, 2019 at 01:34:01PM -0400, Waiman Long wrote:
>> On 04/17/2019 09:56 AM, Peter Zijlstra wrote:
>>> On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
>>>> @@ -549,7 +582,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
>>>>  	return !owner ? OWNER_NULL : OWNER_READER;
>>>>  }
>>>>  
>>>> -static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>>>> +static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
>>>>  {
>>>>  	bool taken = false;
>>>>  	bool is_rt_task = rt_task(current);
>>>> @@ -558,9 +591,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>>>>  	preempt_disable();
>>>>  
>>>>  	/* sem->wait_lock should not be held when doing optimistic spinning */
>>>> -	if (!rwsem_can_spin_on_owner(sem))
>>>> -		goto done;
>>>> -
>>>>  	if (!osq_lock(&sem->osq))
>>>>  		goto done;
>>>>  
>>>> @@ -580,10 +610,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>>>>  		/*
>>>>  		 * Try to acquire the lock
>>>>  		 */
>>>> -		if (rwsem_try_write_lock_unqueued(sem)) {
>>>> -			taken = true;
>>>> +		taken = wlock ? rwsem_try_write_lock_unqueued(sem)
>>>> +			      : rwsem_try_read_lock_unqueued(sem);
>>>> +
>>>> +		if (taken)
>>>>  			break;
>>>> -		}
>>>>  
>>>>  		/*
>>>>  		 * An RT task cannot do optimistic spinning if it cannot
>>> Alternatively you pass the trylock function as an argument:
>>>
>>> static bool rwsem_optimistic_spin(struct rw_semaphore *sem,
>>> 				  bool (*trylock)(struct rw_semaphore *sem))
>>> {
>>> 	...
>>> 		if (trylock(sem)) {
>>> 			taken = true;
>>> 			goto unlock;
>>> 		}
>>> 	...
>>> }
>>>
>> With retpoline, an indirect function call will be slower.
> With compiler optimization we can avoid that. Just mark the function as
> __always_inline, there's only two call-sites, each with a different
> trylock.
>
> It might have already done that anyway, and used constant propagation
> on your bool, but the function pointer one is far easier to read.

The bool was extended to an "unsigned long" and trylock functions will
have different argument list in the last patch.

-Longman



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
  2019-04-18  9:11       ` Peter Zijlstra
@ 2019-04-18 14:37         ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-18 14:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/18/2019 05:11 AM, Peter Zijlstra wrote:
> On Wed, Apr 17, 2019 at 01:51:24PM -0400, Waiman Long wrote:
>> On 04/17/2019 10:05 AM, Peter Zijlstra wrote:
>>> On Sat, Apr 13, 2019 at 01:22:54PM -0400, Waiman Long wrote:
>>>> @@ -650,6 +686,33 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
>>>>  	struct rwsem_waiter waiter;
>>>>  	DEFINE_WAKE_Q(wake_q);
>>>>  
>>>> +	if (!rwsem_can_spin_on_owner(sem))
>>>> +		goto queue;
>>>> +
>>>> +	/*
>>>> +	 * Undo read bias from down_read() and do optimistic spinning.
>>>> +	 */
>>>> +	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
>>>> +	adjustment = 0;
>>>> +	if (rwsem_optimistic_spin(sem, false)) {
>>>> +		unsigned long flags;
>>>> +
>>>> +		/*
>>>> +		 * Opportunistically wake up other readers in the wait queue.
>>>> +		 * It has another chance of wakeup at unlock time.
>>>> +		 */
>>>> +		if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS) &&
>>>> +		    raw_spin_trylock_irqsave(&sem->wait_lock, flags)) {
>>> why trylock?
>> I used trylock to avoid getting stuck in the spinlock while holding a
>> read lock on the rwsem.
> Is that a real concern? I would think that not waking further readers
> would, esp. under high contention, be a bigger deal.

Yes, I can certainly change it to a regular spin_lock().

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 15/16] locking/rwsem: Merge owner into count on x86-64
  2019-04-18 14:28   ` Peter Zijlstra
@ 2019-04-18 14:40     ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-18 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/18/2019 10:28 AM, Peter Zijlstra wrote:
> On Sat, Apr 13, 2019 at 01:22:58PM -0400, Waiman Long wrote:
>> +#ifdef CONFIG_X86_64
>> +#define RWSEM_MERGE_OWNER_TO_COUNT
>> +#endif
>> +#ifdef RWSEM_MERGE_OWNER_TO_COUNT
>> +
>> +#ifdef __PHYSICAL_MASK_SHIFT
>> +#define RWSEM_PA_MASK_SHIFT	__PHYSICAL_MASK_SHIFT
>> +#else
>> +#define RWSEM_PA_MASK_SHIFT	52
>> +#endif
> I really dislike how this hardcodes x86_64.
>
> It would be much better to have a CONFIG variable that gives us the
> PA_BITS. Then all an architecture needs to do is set that right.
>
> FWIW:
>
> arch/arm64/Kconfig:config ARM64_PA_BITS_48
> arch/arm64/Kconfig:config ARM64_PA_BITS_52
>
> So ARM64 could also use this -- provided we get that overflow thing
> sorted.

That will work too. I just a bit hesitant to add new config variable.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-18 14:08     ` Waiman Long
  2019-04-18 14:30       ` Peter Zijlstra
@ 2019-04-18 14:40       ` Peter Zijlstra
  2019-04-18 14:54         ` Waiman Long
  1 sibling, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-18 14:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Thu, Apr 18, 2019 at 10:08:28AM -0400, Waiman Long wrote:
> On 04/18/2019 09:51 AM, Peter Zijlstra wrote:
> > On Sat, Apr 13, 2019 at 01:22:57PM -0400, Waiman Long wrote:
> >>  inline void __down_read(struct rw_semaphore *sem)
> >>  {
> >> +	long count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
> >> +						   &sem->count);
> >> +
> >> +	if (unlikely(count & RWSEM_READ_FAILED_MASK)) {
> >> +		rwsem_down_read_failed(sem, count);
> >>  		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> >>  	} else {
> >>  		rwsem_set_reader_owned(sem);
> > *groan*, that is not provably correct. It is entirely possible to get
> > enough fetch_add()s piled on top of one another to overflow regardless.
> >
> > Unlikely, yes, impossible, no.
> >
> > This makes me nervious as heck, I really don't want to ever have to
> > debug something like that :-(
> 
> The number of fetch_add() that can pile up is limited by the number of
> CPUs available in the system.

> Yes, if you have a 32k processor system that have all the CPUs trying
> to acquire the same read-lock, we will have a problem.

Having more CPUs than that is not impossible these days.

> Or as Linus had said that if we could have tasks kept
> preempted right after doing the fetch_add with newly scheduled tasks
> doing the fetch_add at the same lock again, we could have overflow with
> less CPUs.

That.

> How about disabling preemption before fetch_all and re-enable
> it afterward to address the latter concern? 

Performance might be an issue, look at what preempt_disable() +
preempt_enable() generate for ARM64 for example. That's not particularly
pretty.

> I have no solution for the first case, though.

A cmpxchg() loop can fix this, but that again has performance
implications like you mentioned a while back.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-18 14:40       ` Peter Zijlstra
@ 2019-04-18 14:54         ` Waiman Long
  2019-04-19 10:26           ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-18 14:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/18/2019 10:40 AM, Peter Zijlstra wrote:
> On Thu, Apr 18, 2019 at 10:08:28AM -0400, Waiman Long wrote:
>> On 04/18/2019 09:51 AM, Peter Zijlstra wrote:
>>> On Sat, Apr 13, 2019 at 01:22:57PM -0400, Waiman Long wrote:
>>>>  inline void __down_read(struct rw_semaphore *sem)
>>>>  {
>>>> +	long count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>>>> +						   &sem->count);
>>>> +
>>>> +	if (unlikely(count & RWSEM_READ_FAILED_MASK)) {
>>>> +		rwsem_down_read_failed(sem, count);
>>>>  		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
>>>>  	} else {
>>>>  		rwsem_set_reader_owned(sem);
>>> *groan*, that is not provably correct. It is entirely possible to get
>>> enough fetch_add()s piled on top of one another to overflow regardless.
>>>
>>> Unlikely, yes, impossible, no.
>>>
>>> This makes me nervious as heck, I really don't want to ever have to
>>> debug something like that :-(
>> The number of fetch_add() that can pile up is limited by the number of
>> CPUs available in the system.
>> Yes, if you have a 32k processor system that have all the CPUs trying
>> to acquire the same read-lock, we will have a problem.
> Having more CPUs than that is not impossible these days.
>

Having more than 32k CPUs contending for the same cacheline will be
horribly slow.

>> Or as Linus had said that if we could have tasks kept
>> preempted right after doing the fetch_add with newly scheduled tasks
>> doing the fetch_add at the same lock again, we could have overflow with
>> less CPUs.
> That.
>
>> How about disabling preemption before fetch_all and re-enable
>> it afterward to address the latter concern? 
> Performance might be an issue, look at what preempt_disable() +
> preempt_enable() generate for ARM64 for example. That's not particularly
> pretty.

That is just for the preempt kernel. Right? Thinking about it some more,
the above scenario is less likely to happen for CONFIG_PREEMPT_VOLUNTARY
kernel and the preempt_disable cost will be lower. A preempt RT kernel
is less likely to run on system with many CPUs anyway. We could make
that a conifg option as well in a follow-on patch and let the
distributors decide.

>> I have no solution for the first case, though.
> A cmpxchg() loop can fix this, but that again has performance
> implications like you mentioned a while back.

Exactly.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-04-18 13:06   ` Peter Zijlstra
@ 2019-04-18 15:15     ` Waiman Long
  2019-04-19  7:56       ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-18 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/18/2019 09:06 AM, Peter Zijlstra wrote:
> So I really dislike time based spinning, and we've always rejected it
> before.
>
> On Sat, Apr 13, 2019 at 01:22:55PM -0400, Waiman Long wrote:
>
>> +static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
>> +{
>> +	long count = atomic_long_read(&sem->count);
>> +	int reader_cnt = atomic_long_read(&sem->count) >> RWSEM_READER_SHIFT;
>> +
>> +	if (reader_cnt > 30)
>> +		reader_cnt = 30;
>> +	return sched_clock() + ((count & RWSEM_FLAG_WAITERS)
>> +		? 10 * NSEC_PER_USEC + reader_cnt * NSEC_PER_USEC/2
>> +		: 25 * NSEC_PER_USEC);
>> +}
> Urgh, why do you _have_ to write unreadable code :-(

I guess my code writing style is less readable to others. I will try to
write simpler code that will be more readable in the future :-)

>
> static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
> {
> 	long count = atomic_long_read(&sem->count);
> 	u64 delta = 25 * NSEC_PER_USEC;
>
> 	if (count & RWSEM_FLAG_WAITERS) {
> 		int readers = count >> RWSEM_READER_SHIFT;
>
> 		if (readers > 30)
> 			readers = 30;
>
> 		delta = (20 + readers) * NSEC_PER_USEC / 2;
> 	}
>
> 	return sched_clock() + delta;
> }
>
> I don't get it though; the number of current read-owners is independent
> of WAITERS, while the hold time does correspond to it.
>
> So why do we have that WAITERS check in there?

It is not a waiter check, it is checking the number of readers that are
holding the lock. My thinking was that in the wakeup process done by
__rwsem_mark_wake(), the wakeup is done one-by-one. So the more readers
you have, the more time it takes for the last reader to actually wake up
and run its critical section. That is the main reason for that logic.

>
>> @@ -616,6 +678,35 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
>>  		if (taken)
>>  			break;
>>  
>> +		/*
>> +		 * Time-based reader-owned rwsem optimistic spinning
>> +		 */
> This relies on rwsem_spin_on_owner() not actually spinning for
> read-owned.
>

Yes, because there is no task structure to spin on.

>> +		if (wlock && (owner_state == OWNER_READER)) {
>> +			/*
>> +			 * Initialize rspin_threshold when the owner
>> +			 * state changes from non-reader to reader.
>> +			 */
>> +			if (prev_owner_state != OWNER_READER) {
>> +				if (!is_rwsem_spinnable(sem))
>> +					break;
>> +				rspin_threshold = rwsem_rspin_threshold(sem);
>> +				loop = 0;
>> +			}
> This seems fragile, why not to the rspin_threshold thing _once_ at the
> start of this function?
>
> This way it can be reset.

You can have a situation as follows:

Lock owner: R [W] R

So a writer comes in and get the lock before the spinner. You can then
actually spin on that writer. After that a reader come in and steal the
lock, the code above will allows us to reset the timeout period for the
new reader phase.

>> +			/*
>> +			 * Check time threshold every 16 iterations to
>> +			 * avoid calling sched_clock() too frequently.
>> +			 * This will make the actual spinning time a
>> +			 * bit more than that specified in the threshold.
>> +			 */
>> +			else if (!(++loop & 0xf) &&
>> +				 (sched_clock() > rspin_threshold)) {
> Why is calling sched_clock() lots a problem?

Actually I am more concern about the latency introduced by the
sched_clock() call. BTW, I haven't done any measurement myself. Do you
know how much cost the sched_clock() call is?

If the cost is relatively high, the average latency period after the
lock is free and the spinner is ready to do a trylock will increase.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-04-18 15:15     ` Waiman Long
@ 2019-04-19  7:56       ` Peter Zijlstra
  2019-04-19 14:33         ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-19  7:56 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Thu, Apr 18, 2019 at 11:15:33AM -0400, Waiman Long wrote:
> On 04/18/2019 09:06 AM, Peter Zijlstra wrote:

> >> +			/*
> >> +			 * Check time threshold every 16 iterations to
> >> +			 * avoid calling sched_clock() too frequently.
> >> +			 * This will make the actual spinning time a
> >> +			 * bit more than that specified in the threshold.
> >> +			 */
> >> +			else if (!(++loop & 0xf) &&
> >> +				 (sched_clock() > rspin_threshold)) {
> > Why is calling sched_clock() lots a problem?
> 
> Actually I am more concern about the latency introduced by the
> sched_clock() call. BTW, I haven't done any measurement myself. Do you
> know how much cost the sched_clock() call is?
> 
> If the cost is relatively high, the average latency period after the
> lock is free and the spinner is ready to do a trylock will increase.

Totally depends on the arch or course :/ For 'sane' x86 it is: RDTSC,
MUL; SHRD; SHR; ADD, which is plenty fast.

I know we have poll loops with sched_clock/local_clock in them, I just
can't seem to find any atm.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-18 14:54         ` Waiman Long
@ 2019-04-19 10:26           ` Peter Zijlstra
  2019-04-19 12:02             ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-19 10:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Thu, Apr 18, 2019 at 10:54:19AM -0400, Waiman Long wrote:
> On 04/18/2019 10:40 AM, Peter Zijlstra wrote:

> > Having more CPUs than that is not impossible these days.
> >
> 
> Having more than 32k CPUs contending for the same cacheline will be
> horribly slow.

No question about that.

> >> How about disabling preemption before fetch_all and re-enable
> >> it afterward to address the latter concern? 
> > Performance might be an issue, look at what preempt_disable() +
> > preempt_enable() generate for ARM64 for example. That's not particularly
> > pretty.
> 
> That is just for the preempt kernel. Right? Thinking about it some more,
> the above scenario is less likely to happen for CONFIG_PREEMPT_VOLUNTARY
> kernel and the preempt_disable cost will be lower.

Depends a bit on what specific CONFIG knobs are used. IIRC something
like NOHZ_FULL will also select PREEMPT_COUNT, it will just not have the
actual preemption calls in.

> A preempt RT kernel is less likely to run on system with many CPUs
> anyway. We could make that a conifg option as well in a follow-on
> patch and let the distributors decide.

RT has a whole different rwsem implementation anyway, so we don't need
to worry about them.

> >> I have no solution for the first case, though.

> > A cmpxchg() loop can fix this, but that again has performance
> > implications like you mentioned a while back.

I thought of a horrible horrible alternative:

union rwsem_count {
	struct { /* assuming LP64-LE */
		unsigned short  other[3];
		unsigned short  readers;
	};
	unsigned long	value;
};

void down_read(struct rw_semaphore *sem)
{
	union rwsem_count c;
	unsigned short o;

	c.value = atomic_long_read(&sem->count);
	c.readers++;

	if (!c.readers || (c.value & RWSEM_FLAG_WRITER))
		goto slow;

	o = xchg(&((union rwsem_count *)sem)->readers, c.readers);
	if (o != c.readers-1) {
		c.value = atomic_long_fetch_add(&sem->count, o-(c.readers-1));
	} else {
		c.value = atomic_long_read(&sem->count);
		c.readers = o + 1;
	}

	if (!(c.value & RWSEM_FLAG_WRITER))
		return;

slow:
	rwsem_down_read_slow(sem, c.value);
}

It is deterministic in that is has at most 2 unconditional atomic ops,
no cmpxchg loop, and a best case of a single op.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-19 10:26           ` Peter Zijlstra
@ 2019-04-19 12:02             ` Peter Zijlstra
  2019-04-19 13:03               ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-19 12:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Fri, Apr 19, 2019 at 12:26:47PM +0200, Peter Zijlstra wrote:
> I thought of a horrible horrible alternative:

Hurm, that's broken as heck. Let me try again.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-19 12:02             ` Peter Zijlstra
@ 2019-04-19 13:03               ` Peter Zijlstra
  2019-04-19 13:15                 ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-19 13:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Fri, Apr 19, 2019 at 02:02:07PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 19, 2019 at 12:26:47PM +0200, Peter Zijlstra wrote:
> > I thought of a horrible horrible alternative:
> 
> Hurm, that's broken as heck. Let me try again.

So I can't make that scheme work, it all ends up wanting to have
cmpxchg().

Do we have a performance comparison somewhere of xadd vs cmpxchg
readers? I tried looking in the old threads, but I can't seem to locate
it.

We need new instructions :/ Or more clever than I can muster just now.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-19 13:03               ` Peter Zijlstra
@ 2019-04-19 13:15                 ` Peter Zijlstra
  2019-04-19 19:39                   ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-19 13:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On Fri, Apr 19, 2019 at 03:03:04PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 19, 2019 at 02:02:07PM +0200, Peter Zijlstra wrote:
> > On Fri, Apr 19, 2019 at 12:26:47PM +0200, Peter Zijlstra wrote:
> > > I thought of a horrible horrible alternative:
> > 
> > Hurm, that's broken as heck. Let me try again.
> 
> So I can't make that scheme work, it all ends up wanting to have
> cmpxchg().
> 
> Do we have a performance comparison somewhere of xadd vs cmpxchg
> readers? I tried looking in the old threads, but I can't seem to locate
> it.
> 
> We need new instructions :/ Or more clever than I can muster just now.

In particular, an (unsigned) saturation arithmetic variant of XADD would
be very nice to have at this point.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-04-19  7:56       ` Peter Zijlstra
@ 2019-04-19 14:33         ` Waiman Long
  2019-04-19 15:36           ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-19 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/19/2019 03:56 AM, Peter Zijlstra wrote:
> On Thu, Apr 18, 2019 at 11:15:33AM -0400, Waiman Long wrote:
>> On 04/18/2019 09:06 AM, Peter Zijlstra wrote:
>>>> +			/*
>>>> +			 * Check time threshold every 16 iterations to
>>>> +			 * avoid calling sched_clock() too frequently.
>>>> +			 * This will make the actual spinning time a
>>>> +			 * bit more than that specified in the threshold.
>>>> +			 */
>>>> +			else if (!(++loop & 0xf) &&
>>>> +				 (sched_clock() > rspin_threshold)) {
>>> Why is calling sched_clock() lots a problem?
>> Actually I am more concern about the latency introduced by the
>> sched_clock() call. BTW, I haven't done any measurement myself. Do you
>> know how much cost the sched_clock() call is?
>>
>> If the cost is relatively high, the average latency period after the
>> lock is free and the spinner is ready to do a trylock will increase.
> Totally depends on the arch or course :/ For 'sane' x86 it is: RDTSC,
> MUL; SHRD; SHR; ADD, which is plenty fast.
>
> I know we have poll loops with sched_clock/local_clock in them, I just
> can't seem to find any atm.

Thanks, I will do some time measurement myself. If it is fast enough, I
can change the code to do sched_clock on every iteration.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-04-19 14:33         ` Waiman Long
@ 2019-04-19 15:36           ` Waiman Long
  0 siblings, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-19 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/19/2019 10:33 AM, Waiman Long wrote:
> On 04/19/2019 03:56 AM, Peter Zijlstra wrote:
>> On Thu, Apr 18, 2019 at 11:15:33AM -0400, Waiman Long wrote:
>>> On 04/18/2019 09:06 AM, Peter Zijlstra wrote:
>>>>> +			/*
>>>>> +			 * Check time threshold every 16 iterations to
>>>>> +			 * avoid calling sched_clock() too frequently.
>>>>> +			 * This will make the actual spinning time a
>>>>> +			 * bit more than that specified in the threshold.
>>>>> +			 */
>>>>> +			else if (!(++loop & 0xf) &&
>>>>> +				 (sched_clock() > rspin_threshold)) {
>>>> Why is calling sched_clock() lots a problem?
>>> Actually I am more concern about the latency introduced by the
>>> sched_clock() call. BTW, I haven't done any measurement myself. Do you
>>> know how much cost the sched_clock() call is?
>>>
>>> If the cost is relatively high, the average latency period after the
>>> lock is free and the spinner is ready to do a trylock will increase.
>> Totally depends on the arch or course :/ For 'sane' x86 it is: RDTSC,
>> MUL; SHRD; SHR; ADD, which is plenty fast.
>>
>> I know we have poll loops with sched_clock/local_clock in them, I just
>> can't seem to find any atm.
> Thanks, I will do some time measurement myself. If it is fast enough, I
> can change the code to do sched_clock on every iteration.
>
> Cheers,
> Longman
>
I had measured the time of doing 10 sched_clock() calls. On a 2.1GHz
skylake system, it was 83ns (~18 cycles per call). On a 2.5GHz Thunder
X2 arm system, it was 860ns (~215 cycles per call). On a 2.2GHz AMD EPYC
system, it was 200ns (~44 cycles per call). Intel is fastest, followed
by AMD and then the ARM64 chip.

Cheers,
Longman




^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-19 13:15                 ` Peter Zijlstra
@ 2019-04-19 19:39                   ` Waiman Long
  2019-04-21 21:07                     ` Waiman Long
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-19 19:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 04/19/2019 09:15 AM, Peter Zijlstra wrote:
> On Fri, Apr 19, 2019 at 03:03:04PM +0200, Peter Zijlstra wrote:
>> On Fri, Apr 19, 2019 at 02:02:07PM +0200, Peter Zijlstra wrote:
>>> On Fri, Apr 19, 2019 at 12:26:47PM +0200, Peter Zijlstra wrote:
>>>> I thought of a horrible horrible alternative:
>>> Hurm, that's broken as heck. Let me try again.
>> So I can't make that scheme work, it all ends up wanting to have
>> cmpxchg().
>>
>> Do we have a performance comparison somewhere of xadd vs cmpxchg
>> readers? I tried looking in the old threads, but I can't seem to locate
>> it.
>>
>> We need new instructions :/ Or more clever than I can muster just now.
> In particular, an (unsigned) saturation arithmetic variant of XADD would
> be very nice to have at this point.

I just want to clear about my current scheme. There will be 16 bits
allocated for reader count. I use the MS bit for signaling that there
are too many readers. So the fast path will fail and the readers will be
put into the wait list. This effectively limit readers to 32k-1, but it
doesn't mean the actual reader count cannot go over that. As long as the
actual count is less than 64k, everything should still work perfectly.
IOW, even though we have reached the limit of 32k, we need to pile on an
additional 32k readers to really overflow the count and cause problem.

Cheers,
Longman




^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-19 19:39                   ` Waiman Long
@ 2019-04-21 21:07                     ` Waiman Long
  2019-04-23 14:17                       ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-21 21:07 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, linux-kernel, x86,
	Davidlohr Bueso, Linus Torvalds, Tim Chen, huang ying

On 4/19/19 3:39 PM, Waiman Long wrote:
> On 04/19/2019 09:15 AM, Peter Zijlstra wrote:
>> On Fri, Apr 19, 2019 at 03:03:04PM +0200, Peter Zijlstra wrote:
>>> On Fri, Apr 19, 2019 at 02:02:07PM +0200, Peter Zijlstra wrote:
>>>> On Fri, Apr 19, 2019 at 12:26:47PM +0200, Peter Zijlstra wrote:
>>>>> I thought of a horrible horrible alternative:
>>>> Hurm, that's broken as heck. Let me try again.
>>> So I can't make that scheme work, it all ends up wanting to have
>>> cmpxchg().
>>>
>>> Do we have a performance comparison somewhere of xadd vs cmpxchg
>>> readers? I tried looking in the old threads, but I can't seem to locate
>>> it.
>>>
>>> We need new instructions :/ Or more clever than I can muster just now.
>> In particular, an (unsigned) saturation arithmetic variant of XADD would
>> be very nice to have at this point.
> I just want to clear about my current scheme. There will be 16 bits
> allocated for reader count. I use the MS bit for signaling that there
> are too many readers. So the fast path will fail and the readers will be
> put into the wait list. This effectively limit readers to 32k-1, but it
> doesn't mean the actual reader count cannot go over that. As long as the
> actual count is less than 64k, everything should still work perfectly.
> IOW, even though we have reached the limit of 32k, we need to pile on an
> additional 32k readers to really overflow the count and cause problem.

How about the following chunks to disable preemption temporarily for the
increment-check-decrement sequence?

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index dd92b1a93919..4cc03ac66e13 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -250,6 +250,8 @@ do { \
 #define preempt_enable_notrace()               barrier()
 #define preemptible()                          0
 
+#define __preempt_disable_nop  /* preempt_disable() is nop */
+
 #endif /* CONFIG_PREEMPT_COUNT */
 
 #ifdef MODULE
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 043fd29b7534..54029e6af17b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -256,11 +256,64 @@ static inline struct task_struct
*rwsem_get_owner(struct r
        return (struct task_struct *) (cowner
                ? cowner | (sowner & RWSEM_NONSPINNABLE) : sowner);
 }
+
+/*
+ * If __preempt_disable_nop is defined, calling preempt_disable() and
+ * preempt_enable() directly is the most efficient way. Otherwise, it may
+ * be more efficient to disable and enable interrupt instead for disabling
+ * preemption tempoarily.
+ */
+#ifdef __preempt_disable_nop
+#define disable_preemption()   preempt_disable()
+#define enable_preemption()    preempt_enable()
+#else
+#define disable_preemption()   local_irq_disable()
+#define enable_preemption()    local_irq_enable()
+#endif
+
+/*
+ * When the owner task structure pointer is merged into couunt, less bits
+ * will be available for readers. Therefore, there is a very slight chance
+ * that the reader count may overflow. We try to prevent that from
happening
+ * by checking for the MS bit of the count and failing the trylock attempt
+ * if this bit is set.
+ *
+ * With preemption enabled, there is a remote possibility that preemption
+ * can happen in the narrow timing window between incrementing and
+ * decrementing the reader count and the task is put to sleep for a
+ * considerable amount of time. If sufficient number of such unfortunate
+ * sequence of events happen, we may still overflow the reader count.
+ * To avoid such possibility, we have to disable preemption for the
+ * whole increment-check-decrement sequence.
+ *
+ * The function returns true if there are too many readers and the count
+ * has already been properly decremented so the reader must go directly
+ * into the wait list.
+ */
+static inline bool rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
+{
+       bool wait = false;      /* Wait now flag */
+
+       disable_preemption();
+       *cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
&sem->count);
+       if (unlikely(*cnt < 0)) {
+               atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+               wait = true;
+       }
+       enable_preemption();
+       return wait;
+}
 #else /* !CONFIG_RWSEM_OWNER_COUNT */
 static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
 {
        return READ_ONCE(sem->owner);
 }
+
+static inline bool rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
+{
+       *cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
&sem->count);
+       return false;
+}
 #endif /* CONFIG_RWSEM_OWNER_COUNT */
 
 /*
@@ -981,32 +1034,18 @@ static inline void clear_wr_nonspinnable(struct
rw_semaph
  * Wait for the read lock to be granted
  */
 static struct rw_semaphore __sched *
-rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long count)
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, const
bool wait)
 {
-       long adjustment = -RWSEM_READER_BIAS;
+       long count, adjustment = -RWSEM_READER_BIAS;
        bool wake = false;
        struct rwsem_waiter waiter;
        DEFINE_WAKE_Q(wake_q);
 
-       if (unlikely(count < 0)) {
+       if (unlikely(wait)) {
                /*
-                * The sign bit has been set meaning that too many active
-                * readers are present. We need to decrement reader count &
-                * enter wait queue immediately to avoid overflowing the
-                * reader count.
-                *
-                * As preemption is not disabled, there is a remote
-                * possibility that preemption can happen in the narrow
-                * timing window between incrementing and decrementing
-                * the reader count and the task is put to sleep for a
-                * considerable amount of time. If sufficient number
-                * of such unfortunate sequence of events happen, we
-                * may still overflow the reader count. It is extremely
-                * unlikey, though. If this is a concern, we should consider
-                * disable preemption during this timing window to make
-                * sure that such unfortunate event will not happen.
+                * The reader count has already been decremented and the
+                * reader should go directly into the wait list now.
                 */
-               atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
                adjustment = 0;
                goto queue;
        }
@@ -1358,11 +1397,12 @@ static struct rw_semaphore
*rwsem_downgrade_wake(struct
  */
 inline void __down_read(struct rw_semaphore *sem)
 {
-       long tmp = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-                                                &sem->count);
+       long tmp;
+       bool wait;
 
+       wait = rwsem_read_trylock(sem, &tmp);
        if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
-               rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, tmp);
+               rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, wait);
                DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
        } else {
                rwsem_set_reader_owned(sem);
@@ -1371,11 +1411,12 @@ inline void __down_read(struct rw_semaphore *sem)
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-       long tmp = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-                                                &sem->count);
+       long tmp;
+       bool wait;
 
+       wait = rwsem_read_trylock(sem, &tmp);
        if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
-               if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
tmp)))
+               if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
wait)))
                        return -EINTR;
                DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
        } else {


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-21 21:07                     ` Waiman Long
@ 2019-04-23 14:17                       ` Peter Zijlstra
  2019-04-23 14:31                         ` Waiman Long
  2019-04-23 16:27                         ` Linus Torvalds
  0 siblings, 2 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-23 14:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Thomas Gleixner,
	linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying

On Sun, Apr 21, 2019 at 05:07:56PM -0400, Waiman Long wrote:

> How about the following chunks to disable preemption temporarily for the
> increment-check-decrement sequence?
> 
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index dd92b1a93919..4cc03ac66e13 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -250,6 +250,8 @@ do { \
>  #define preempt_enable_notrace()               barrier()
>  #define preemptible()                          0
>  
> +#define __preempt_disable_nop  /* preempt_disable() is nop */
> +
>  #endif /* CONFIG_PREEMPT_COUNT */
>  
>  #ifdef MODULE
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 043fd29b7534..54029e6af17b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -256,11 +256,64 @@ static inline struct task_struct
> *rwsem_get_owner(struct r
>         return (struct task_struct *) (cowner
>                 ? cowner | (sowner & RWSEM_NONSPINNABLE) : sowner);
>  }
> +
> +/*
> + * If __preempt_disable_nop is defined, calling preempt_disable() and
> + * preempt_enable() directly is the most efficient way. Otherwise, it may
> + * be more efficient to disable and enable interrupt instead for disabling
> + * preemption tempoarily.
> + */
> +#ifdef __preempt_disable_nop
> +#define disable_preemption()   preempt_disable()
> +#define enable_preemption()    preempt_enable()
> +#else
> +#define disable_preemption()   local_irq_disable()
> +#define enable_preemption()    local_irq_enable()
> +#endif

I'm not aware of an architecture where disabling interrupts is faster
than disabling preemption.

> +/*
> + * When the owner task structure pointer is merged into couunt, less bits
> + * will be available for readers. Therefore, there is a very slight chance
> + * that the reader count may overflow. We try to prevent that from
> happening
> + * by checking for the MS bit of the count and failing the trylock attempt
> + * if this bit is set.
> + *
> + * With preemption enabled, there is a remote possibility that preemption
> + * can happen in the narrow timing window between incrementing and
> + * decrementing the reader count and the task is put to sleep for a
> + * considerable amount of time. If sufficient number of such unfortunate
> + * sequence of events happen, we may still overflow the reader count.
> + * To avoid such possibility, we have to disable preemption for the
> + * whole increment-check-decrement sequence.
> + *
> + * The function returns true if there are too many readers and the count
> + * has already been properly decremented so the reader must go directly
> + * into the wait list.
> + */
> +static inline bool rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
> +{
> +       bool wait = false;      /* Wait now flag */
> +
> +       disable_preemption();
> +       *cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
> &sem->count);
> +       if (unlikely(*cnt < 0)) {
> +               atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> +               wait = true;
> +       }
> +       enable_preemption();
> +       return wait;
> +}
>  #else /* !CONFIG_RWSEM_OWNER_COUNT */

This also means you have to ensure CONFIG_NR_CPUS < 32K for
RWSEM_OWNER_COUNT.

>  static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
>  {
>         return READ_ONCE(sem->owner);
>  }
> +
> +static inline bool rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
> +{
> +       *cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
> &sem->count);
> +       return false;
> +}
>  #endif /* CONFIG_RWSEM_OWNER_COUNT */
>  
>  /*
> @@ -981,32 +1034,18 @@ static inline void clear_wr_nonspinnable(struct
> rw_semaph
>   * Wait for the read lock to be granted
>   */
>  static struct rw_semaphore __sched *
> -rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long count)
> +rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, const
> bool wait)
>  {
> -       long adjustment = -RWSEM_READER_BIAS;
> +       long count, adjustment = -RWSEM_READER_BIAS;
>         bool wake = false;
>         struct rwsem_waiter waiter;
>         DEFINE_WAKE_Q(wake_q);
>  
> -       if (unlikely(count < 0)) {
> +       if (unlikely(wait)) {
>                 /*
> -                * The sign bit has been set meaning that too many active
> -                * readers are present. We need to decrement reader count &
> -                * enter wait queue immediately to avoid overflowing the
> -                * reader count.
> -                *
> -                * As preemption is not disabled, there is a remote
> -                * possibility that preemption can happen in the narrow
> -                * timing window between incrementing and decrementing
> -                * the reader count and the task is put to sleep for a
> -                * considerable amount of time. If sufficient number
> -                * of such unfortunate sequence of events happen, we
> -                * may still overflow the reader count. It is extremely
> -                * unlikey, though. If this is a concern, we should consider
> -                * disable preemption during this timing window to make
> -                * sure that such unfortunate event will not happen.
> +                * The reader count has already been decremented and the
> +                * reader should go directly into the wait list now.
>                  */
> -               atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
>                 adjustment = 0;
>                 goto queue;
>         }
> @@ -1358,11 +1397,12 @@ static struct rw_semaphore
> *rwsem_downgrade_wake(struct
>   */
>  inline void __down_read(struct rw_semaphore *sem)
>  {
> -       long tmp = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
> -                                                &sem->count);
> +       long tmp;
> +       bool wait;
>  
> +       wait = rwsem_read_trylock(sem, &tmp);
>         if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
> -               rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, tmp);
> +               rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, wait);
>                 DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
>         } else {
>                 rwsem_set_reader_owned(sem);

I think I prefer that function returning/taking the bias/adjustment
value instead of a bool, if it is all the same.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-23 14:17                       ` Peter Zijlstra
@ 2019-04-23 14:31                         ` Waiman Long
  2019-04-23 16:27                         ` Linus Torvalds
  1 sibling, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-23 14:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Thomas Gleixner,
	linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying

On 4/23/19 10:17 AM, Peter Zijlstra wrote:
> On Sun, Apr 21, 2019 at 05:07:56PM -0400, Waiman Long wrote:
>
>> How about the following chunks to disable preemption temporarily for the
>> increment-check-decrement sequence?
>>
>> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
>> index dd92b1a93919..4cc03ac66e13 100644
>> --- a/include/linux/preempt.h
>> +++ b/include/linux/preempt.h
>> @@ -250,6 +250,8 @@ do { \
>>  #define preempt_enable_notrace()               barrier()
>>  #define preemptible()                          0
>>  
>> +#define __preempt_disable_nop  /* preempt_disable() is nop */
>> +
>>  #endif /* CONFIG_PREEMPT_COUNT */
>>  
>>  #ifdef MODULE
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index 043fd29b7534..54029e6af17b 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -256,11 +256,64 @@ static inline struct task_struct
>> *rwsem_get_owner(struct r
>>         return (struct task_struct *) (cowner
>>                 ? cowner | (sowner & RWSEM_NONSPINNABLE) : sowner);
>>  }
>> +
>> +/*
>> + * If __preempt_disable_nop is defined, calling preempt_disable() and
>> + * preempt_enable() directly is the most efficient way. Otherwise, it may
>> + * be more efficient to disable and enable interrupt instead for disabling
>> + * preemption tempoarily.
>> + */
>> +#ifdef __preempt_disable_nop
>> +#define disable_preemption()   preempt_disable()
>> +#define enable_preemption()    preempt_enable()
>> +#else
>> +#define disable_preemption()   local_irq_disable()
>> +#define enable_preemption()    local_irq_enable()
>> +#endif
> I'm not aware of an architecture where disabling interrupts is faster
> than disabling preemption.

I have actually done some performance test measuring the effects of
disabling interrupt and preemption on readers (on x86-64 system).

  Threads    Before patch    Disable irq    Disable preemption
  -------    ------------    -----------    ------------------
     1          9,088          8,766           9,172
     2          9,296          9,169           8,707
     4         11,192         11,205          10,712
     8         11,329         11,332          11,213

For uncontended case, disable interrupt is slower. The slowdown is gone
once the rwsem becomes contended. So it may not be a good idea to
disable interrupt as a proxy of disabling preemption.

BTW, preemption count is not enabled in typical distro production
kernels like RHEL. So preempt_disable() is just a barrier. It is turned
on in the debug kernel, though.


>> +/*
>> + * When the owner task structure pointer is merged into couunt, less bits
>> + * will be available for readers. Therefore, there is a very slight chance
>> + * that the reader count may overflow. We try to prevent that from
>> happening
>> + * by checking for the MS bit of the count and failing the trylock attempt
>> + * if this bit is set.
>> + *
>> + * With preemption enabled, there is a remote possibility that preemption
>> + * can happen in the narrow timing window between incrementing and
>> + * decrementing the reader count and the task is put to sleep for a
>> + * considerable amount of time. If sufficient number of such unfortunate
>> + * sequence of events happen, we may still overflow the reader count.
>> + * To avoid such possibility, we have to disable preemption for the
>> + * whole increment-check-decrement sequence.
>> + *
>> + * The function returns true if there are too many readers and the count
>> + * has already been properly decremented so the reader must go directly
>> + * into the wait list.
>> + */
>> +static inline bool rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
>> +{
>> +       bool wait = false;      /* Wait now flag */
>> +
>> +       disable_preemption();
>> +       *cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>> &sem->count);
>> +       if (unlikely(*cnt < 0)) {
>> +               atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
>> +               wait = true;
>> +       }
>> +       enable_preemption();
>> +       return wait;
>> +}
>>  #else /* !CONFIG_RWSEM_OWNER_COUNT */
> This also means you have to ensure CONFIG_NR_CPUS < 32K for
> RWSEM_OWNER_COUNT.


Yes, that can be done.


>
>>  static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
>>  {
>>         return READ_ONCE(sem->owner);
>>  }
>> +
>> +static inline bool rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
>> +{
>> +       *cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>> &sem->count);
>> +       return false;
>> +}
>>  #endif /* CONFIG_RWSEM_OWNER_COUNT */
>>  
>>  /*
>> @@ -981,32 +1034,18 @@ static inline void clear_wr_nonspinnable(struct
>> rw_semaph
>>   * Wait for the read lock to be granted
>>   */
>>  static struct rw_semaphore __sched *
>> -rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long count)
>> +rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, const
>> bool wait)
>>  {
>> -       long adjustment = -RWSEM_READER_BIAS;
>> +       long count, adjustment = -RWSEM_READER_BIAS;
>>         bool wake = false;
>>         struct rwsem_waiter waiter;
>>         DEFINE_WAKE_Q(wake_q);
>>  
>> -       if (unlikely(count < 0)) {
>> +       if (unlikely(wait)) {
>>                 /*
>> -                * The sign bit has been set meaning that too many active
>> -                * readers are present. We need to decrement reader count &
>> -                * enter wait queue immediately to avoid overflowing the
>> -                * reader count.
>> -                *
>> -                * As preemption is not disabled, there is a remote
>> -                * possibility that preemption can happen in the narrow
>> -                * timing window between incrementing and decrementing
>> -                * the reader count and the task is put to sleep for a
>> -                * considerable amount of time. If sufficient number
>> -                * of such unfortunate sequence of events happen, we
>> -                * may still overflow the reader count. It is extremely
>> -                * unlikey, though. If this is a concern, we should consider
>> -                * disable preemption during this timing window to make
>> -                * sure that such unfortunate event will not happen.
>> +                * The reader count has already been decremented and the
>> +                * reader should go directly into the wait list now.
>>                  */
>> -               atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
>>                 adjustment = 0;
>>                 goto queue;
>>         }
>> @@ -1358,11 +1397,12 @@ static struct rw_semaphore
>> *rwsem_downgrade_wake(struct
>>   */
>>  inline void __down_read(struct rw_semaphore *sem)
>>  {
>> -       long tmp = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>> -                                                &sem->count);
>> +       long tmp;
>> +       bool wait;
>>  
>> +       wait = rwsem_read_trylock(sem, &tmp);
>>         if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
>> -               rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, tmp);
>> +               rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, wait);
>>                 DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
>>         } else {
>>                 rwsem_set_reader_owned(sem);
> I think I prefer that function returning/taking the bias/adjustment
> value instead of a bool, if it is all the same.

Sure, I can do that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-23 14:17                       ` Peter Zijlstra
  2019-04-23 14:31                         ` Waiman Long
@ 2019-04-23 16:27                         ` Linus Torvalds
  2019-04-23 19:12                           ` Waiman Long
  1 sibling, 1 reply; 112+ messages in thread
From: Linus Torvalds @ 2019-04-23 16:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Waiman Long, Ingo Molnar, Will Deacon,
	Thomas Gleixner, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying

On Tue, Apr 23, 2019 at 7:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> I'm not aware of an architecture where disabling interrupts is faster
> than disabling preemption.

I don't thin kit ever is, but I'd worry a bit about the
preempt_enable() just because it also checks if need_resched() is true
when re-enabling preemption.

So doing preempt_enable() as part of rwsem_read_trylock() might cause
us to schedule in *exactly* the wrong place,

So if we play preemption games, I wonder if we should make them more
explicit than hiding them in that helper function, because
particularly for the slow path case, I think we'd be much better off
just avoiding the busy-loop in the slow path, rather than first
scheduling due to preempt_enable(), and then starting to look at the
slow path onlly afterwards.

IOW, I get the feeling that the preemption-off area might be better
off being potentially much bigger, and covering the whole (or a large
portion) of the semaphore operation, rather than just the
rwsem_read_trylock() fastpath.

Hmm?

                   Linus

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-23 16:27                         ` Linus Torvalds
@ 2019-04-23 19:12                           ` Waiman Long
  2019-04-23 19:34                             ` Peter Zijlstra
  2019-04-24  7:09                             ` [PATCH v4 14/16] locking/rwsem: Guard against making count negative Peter Zijlstra
  0 siblings, 2 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-23 19:12 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying

On 4/23/19 12:27 PM, Linus Torvalds wrote:
> On Tue, Apr 23, 2019 at 7:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> I'm not aware of an architecture where disabling interrupts is faster
>> than disabling preemption.
> I don't thin kit ever is, but I'd worry a bit about the
> preempt_enable() just because it also checks if need_resched() is true
> when re-enabling preemption.
>
> So doing preempt_enable() as part of rwsem_read_trylock() might cause
> us to schedule in *exactly* the wrong place,

You are right on that. However, there is a variant called
preempt_enable_no_resched() that doesn't have this side effect. So I am
going to use that one instead.

> So if we play preemption games, I wonder if we should make them more
> explicit than hiding them in that helper function, because
> particularly for the slow path case, I think we'd be much better off
> just avoiding the busy-loop in the slow path, rather than first
> scheduling due to preempt_enable(), and then starting to look at the
> slow path onlly afterwards.
>
> IOW, I get the feeling that the preemption-off area might be better
> off being potentially much bigger, and covering the whole (or a large
> portion) of the semaphore operation, rather than just the
> rwsem_read_trylock() fastpath.
>
> Hmm?

That is true in general, but doing preempt_disable/enable across
function boundary is ugly and prone to further problems down the road.

Cheers,
Longman

>                    Linus



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-23 19:12                           ` Waiman Long
@ 2019-04-23 19:34                             ` Peter Zijlstra
  2019-04-23 19:41                               ` Waiman Long
  2019-04-24  7:09                             ` [PATCH v4 14/16] locking/rwsem: Guard against making count negative Peter Zijlstra
  1 sibling, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-23 19:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying

On Tue, Apr 23, 2019 at 03:12:16PM -0400, Waiman Long wrote:
> On 4/23/19 12:27 PM, Linus Torvalds wrote:
> > On Tue, Apr 23, 2019 at 7:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >> I'm not aware of an architecture where disabling interrupts is faster
> >> than disabling preemption.
> > I don't thin kit ever is, but I'd worry a bit about the
> > preempt_enable() just because it also checks if need_resched() is true
> > when re-enabling preemption.
> >
> > So doing preempt_enable() as part of rwsem_read_trylock() might cause
> > us to schedule in *exactly* the wrong place,
> 
> You are right on that. However, there is a variant called
> preempt_enable_no_resched() that doesn't have this side effect. So I am
> going to use that one instead.

Only if the very next line is schedule(). Otherwise you're very much not
going to use that function.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-23 19:34                             ` Peter Zijlstra
@ 2019-04-23 19:41                               ` Waiman Long
  2019-04-23 19:55                                 ` [PATCH] bpf: Fix preempt_enable_no_resched() abuse Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-23 19:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying

On 4/23/19 3:34 PM, Peter Zijlstra wrote:
> On Tue, Apr 23, 2019 at 03:12:16PM -0400, Waiman Long wrote:
>> On 4/23/19 12:27 PM, Linus Torvalds wrote:
>>> On Tue, Apr 23, 2019 at 7:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>> I'm not aware of an architecture where disabling interrupts is faster
>>>> than disabling preemption.
>>> I don't thin kit ever is, but I'd worry a bit about the
>>> preempt_enable() just because it also checks if need_resched() is true
>>> when re-enabling preemption.
>>>
>>> So doing preempt_enable() as part of rwsem_read_trylock() might cause
>>> us to schedule in *exactly* the wrong place,
>> You are right on that. However, there is a variant called
>> preempt_enable_no_resched() that doesn't have this side effect. So I am
>> going to use that one instead.
> Only if the very next line is schedule(). Otherwise you're very much not
> going to use that function.

May I know the reason why. I saw a number of instances of
preempt_enable_no_resched() without right next a schedule().

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCH] bpf: Fix preempt_enable_no_resched() abuse
  2019-04-23 19:41                               ` Waiman Long
@ 2019-04-23 19:55                                 ` Peter Zijlstra
  2019-04-23 20:03                                   ` [PATCH] trace: " Peter Zijlstra
                                                     ` (2 more replies)
  0 siblings, 3 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-23 19:55 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying, Roman Gushchin,
	Alexei Starovoitov, Daniel Borkmann

On Tue, Apr 23, 2019 at 03:41:32PM -0400, Waiman Long wrote:
> On 4/23/19 3:34 PM, Peter Zijlstra wrote:
> > On Tue, Apr 23, 2019 at 03:12:16PM -0400, Waiman Long wrote:

> >> You are right on that. However, there is a variant called
> >> preempt_enable_no_resched() that doesn't have this side effect. So I am
> >> going to use that one instead.
> > Only if the very next line is schedule(). Otherwise you're very much not
> > going to use that function.
> 
> May I know the reason why. 

Because it can 'consume' a need_resched and introduces arbitrary delays
before the schedule() eventually happens, breaking the very notion of
PREEMPT=y (and the fundamentals RT relies on).

> I saw a number of instances of
> preempt_enable_no_resched() without right next a schedule().

Look more closely.. and let me know, if true, those are bugs that need
fixing.

Argghhh.. BPF...

Also, with the recent RCU rework, we can probably drop that
rcu_read_lock()/rcu_read_unlock() from there if we're disabling
preemption anyway.

---
Subject: bpf: Fix preempt_enable_no_resched() abuse

Unless the very next line is schedule(), or implies it, one must not use
preempt_enable_no_resched(). It can cause a preemption to go missing and
thereby cause arbitrary delays, breaking the PREEMPT=y invariant.

Cc: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f02367faa58d..944ccc310201 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -510,7 +510,7 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
 		}					\
 _out:							\
 		rcu_read_unlock();			\
-		preempt_enable_no_resched();		\
+		preempt_enable();			\
 		_ret;					\
 	 })
 

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH] trace: Fix preempt_enable_no_resched() abuse
  2019-04-23 19:55                                 ` [PATCH] bpf: Fix preempt_enable_no_resched() abuse Peter Zijlstra
@ 2019-04-23 20:03                                   ` Peter Zijlstra
  2019-04-23 23:58                                     ` Steven Rostedt
  2019-04-29  6:39                                     ` [tip:sched/core] " tip-bot for Peter Zijlstra
  2019-04-23 20:27                                   ` [PATCH] bpf: " Linus Torvalds
  2019-04-25 21:23                                   ` Alexei Starovoitov
  2 siblings, 2 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-23 20:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying, Roman Gushchin,
	Alexei Starovoitov, Daniel Borkmann, Steven Rostedt

On Tue, Apr 23, 2019 at 09:55:59PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 23, 2019 at 03:41:32PM -0400, Waiman Long wrote:

> > I saw a number of instances of
> > preempt_enable_no_resched() without right next a schedule().
> 
> Look more closely.. and let me know, if true, those are bugs that need
> fixing.
> 
> Argghhh.. BPF...

/me shakes head, Steve...

---
Subject: trace: Fix preempt_enable_no_resched() abuse

Unless the very next line is schedule(), or implies it, one must not use
preempt_enable_no_resched(). It can cause a preemption to go missing and
thereby cause arbitrary delays, breaking the PREEMPT=y invariant.

Cc: Steven Rostedt <rostedt@goodmis.org>
Fixes: 37886f6a9f62 ("ring-buffer: add api to allow a tracer to change clock source")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/trace/ring_buffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 41b6f96e5366..4ee8d8aa3d0f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -762,7 +762,7 @@ u64 ring_buffer_time_stamp(struct ring_buffer *buffer, int cpu)
 
 	preempt_disable_notrace();
 	time = rb_time_stamp(buffer);
-	preempt_enable_no_resched_notrace();
+	preempt_enable_notrace();
 
 	return time;
 }

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH] bpf: Fix preempt_enable_no_resched() abuse
  2019-04-23 19:55                                 ` [PATCH] bpf: Fix preempt_enable_no_resched() abuse Peter Zijlstra
  2019-04-23 20:03                                   ` [PATCH] trace: " Peter Zijlstra
@ 2019-04-23 20:27                                   ` Linus Torvalds
  2019-04-23 20:35                                     ` Peter Zijlstra
  2019-04-25 21:23                                   ` Alexei Starovoitov
  2 siblings, 1 reply; 112+ messages in thread
From: Linus Torvalds @ 2019-04-23 20:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying, Roman Gushchin,
	Alexei Starovoitov, Daniel Borkmann

On Tue, Apr 23, 2019 at 12:56 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Unless the very next line is schedule(), or implies it, one must not use
> preempt_enable_no_resched(). It can cause a preemption to go missing and
> thereby cause arbitrary delays, breaking the PREEMPT=y invariant.

That language may be a bit strong, or m,aybe the "implies it" might at
least be extended on.

It doesn't need to be "schedule()" per se, it can be any of the things
that check if we _need_ to be scheduled.

IOW, various variations of "if (need_resched())" exiting a loop, and
then outside the loop there's a cond_resched() or similar.

                    Linus

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH] bpf: Fix preempt_enable_no_resched() abuse
  2019-04-23 20:27                                   ` [PATCH] bpf: " Linus Torvalds
@ 2019-04-23 20:35                                     ` Peter Zijlstra
  2019-04-23 20:45                                       ` Linus Torvalds
  2019-04-24 13:19                                       ` Peter Zijlstra
  0 siblings, 2 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-23 20:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying, Roman Gushchin,
	Alexei Starovoitov, Daniel Borkmann

On Tue, Apr 23, 2019 at 01:27:29PM -0700, Linus Torvalds wrote:
> On Tue, Apr 23, 2019 at 12:56 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Unless the very next line is schedule(), or implies it, one must not use
> > preempt_enable_no_resched(). It can cause a preemption to go missing and
> > thereby cause arbitrary delays, breaking the PREEMPT=y invariant.
> 
> That language may be a bit strong, or m,aybe the "implies it" might at
> least be extended on.
> 
> It doesn't need to be "schedule()" per se, it can be any of the things
> that check if we _need_ to be scheduled.

I'll try and word-smith that tomorrow, brain is fried. But yes,
something that ends up in schedule() 'soon'.

The usage in ist_exit() is particularly 'fun'. That really should've had
a comment. That relies on the return-from-interrupt path.

> IOW, various variations of "if (need_resched())" exiting a loop, and
> then outside the loop there's a cond_resched() or similar.

That one 'funnily' doesn't actually work; cond_resched() is a no-op on
PREEMPT=y.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH] bpf: Fix preempt_enable_no_resched() abuse
  2019-04-23 20:35                                     ` Peter Zijlstra
@ 2019-04-23 20:45                                       ` Linus Torvalds
  2019-04-24 13:19                                       ` Peter Zijlstra
  1 sibling, 0 replies; 112+ messages in thread
From: Linus Torvalds @ 2019-04-23 20:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying, Roman Gushchin,
	Alexei Starovoitov, Daniel Borkmann

On Tue, Apr 23, 2019 at 1:35 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> That one 'funnily' doesn't actually work; cond_resched() is a no-op on
> PREEMPT=y.

Uhhuh. I "knew" that, but it's one of those bitrotting things.

Which does make your argument stronger, of course. This is way too
easy to get wrong even if you think you are being careful.

I guess it could be another thing to check for with objtool, since you
have now gotten the experience with it ;)

                   Linus

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH] trace: Fix preempt_enable_no_resched() abuse
  2019-04-23 20:03                                   ` [PATCH] trace: " Peter Zijlstra
@ 2019-04-23 23:58                                     ` Steven Rostedt
  2019-04-29  6:39                                     ` [tip:sched/core] " tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 112+ messages in thread
From: Steven Rostedt @ 2019-04-23 23:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Linus Torvalds, Ingo Molnar, Will Deacon,
	Thomas Gleixner, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying,
	Roman Gushchin, Alexei Starovoitov, Daniel Borkmann

On Tue, 23 Apr 2019 22:03:18 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Apr 23, 2019 at 09:55:59PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 23, 2019 at 03:41:32PM -0400, Waiman Long wrote:  
> 
> > > I saw a number of instances of
> > > preempt_enable_no_resched() without right next a schedule().  
> > 
> > Look more closely.. and let me know, if true, those are bugs that need
> > fixing.
> > 
> > Argghhh.. BPF...  
> 
> /me shakes head, Steve...

/me points finger to Frederic ;-)


> 
> ---
> Subject: trace: Fix preempt_enable_no_resched() abuse
> 
> Unless the very next line is schedule(), or implies it, one must not use
> preempt_enable_no_resched(). It can cause a preemption to go missing and
> thereby cause arbitrary delays, breaking the PREEMPT=y invariant.
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Fixes: 37886f6a9f62 ("ring-buffer: add api to allow a tracer to change clock source")

That commit just moved the buggy code. That tag should be:

Fixes: 2c2d7329d8af ("tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()")

OK, this isn't quite fair to point all the blame on Frederic, because
it did fix a bug. But the real fix for that bug was your fix here:

499d79559ffe4b ("sched/core: More notrace annotations")

-- Steve


> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/trace/ring_buffer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 41b6f96e5366..4ee8d8aa3d0f 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -762,7 +762,7 @@ u64 ring_buffer_time_stamp(struct ring_buffer
> *buffer, int cpu) 
>  	preempt_disable_notrace();
>  	time = rb_time_stamp(buffer);
> -	preempt_enable_no_resched_notrace();
> +	preempt_enable_notrace();
>  
>  	return time;
>  }


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-23 19:12                           ` Waiman Long
  2019-04-23 19:34                             ` Peter Zijlstra
@ 2019-04-24  7:09                             ` Peter Zijlstra
  2019-04-24 16:49                               ` Waiman Long
  1 sibling, 1 reply; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-24  7:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying

On Tue, Apr 23, 2019 at 03:12:16PM -0400, Waiman Long wrote:
> That is true in general, but doing preempt_disable/enable across
> function boundary is ugly and prone to further problems down the road.

We do worse things in this code, and the thing Linus proposes is
actually quite simple, something like so:

---
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -912,7 +904,7 @@ rwsem_down_read_slowpath(struct rw_semap
 			raw_spin_unlock_irq(&sem->wait_lock);
 			break;
 		}
-		schedule();
+		schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_reader);
 	}
 
@@ -1121,6 +1113,7 @@ static struct rw_semaphore *rwsem_downgr
  */
 inline void __down_read(struct rw_semaphore *sem)
 {
+	preempt_disable();
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
@@ -1129,10 +1122,12 @@ inline void __down_read(struct rw_semaph
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
+	preempt_enable();
 }
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
+	preempt_disable();
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
@@ -1142,6 +1137,7 @@ static inline int __down_read_killable(s
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
+	preempt_enable();
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH] bpf: Fix preempt_enable_no_resched() abuse
  2019-04-23 20:35                                     ` Peter Zijlstra
  2019-04-23 20:45                                       ` Linus Torvalds
@ 2019-04-24 13:19                                       ` Peter Zijlstra
  1 sibling, 0 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-24 13:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying, Roman Gushchin,
	Alexei Starovoitov, Daniel Borkmann

On Tue, Apr 23, 2019 at 10:35:12PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 23, 2019 at 01:27:29PM -0700, Linus Torvalds wrote:
> > On Tue, Apr 23, 2019 at 12:56 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > Unless the very next line is schedule(), or implies it, one must not use
> > > preempt_enable_no_resched(). It can cause a preemption to go missing and
> > > thereby cause arbitrary delays, breaking the PREEMPT=y invariant.
> > 
> > That language may be a bit strong, or m,aybe the "implies it" might at
> > least be extended on.
> > 
> > It doesn't need to be "schedule()" per se, it can be any of the things
> > that check if we _need_ to be scheduled.
> 
> I'll try and word-smith that tomorrow, brain is fried. But yes,
> something that ends up in schedule() 'soon'.

I've made that:

Unless there is a call into schedule() in the immediate
(deterministic) future, one must not use preempt_enable_no_resched().
It can cause a preemption to go missing and thereby cause arbitrary
delays, breaking the PREEMPT=y invariant.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-24  7:09                             ` [PATCH v4 14/16] locking/rwsem: Guard against making count negative Peter Zijlstra
@ 2019-04-24 16:49                               ` Waiman Long
  2019-04-24 17:01                                 ` Peter Zijlstra
  0 siblings, 1 reply; 112+ messages in thread
From: Waiman Long @ 2019-04-24 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying

On 4/24/19 3:09 AM, Peter Zijlstra wrote:
> On Tue, Apr 23, 2019 at 03:12:16PM -0400, Waiman Long wrote:
>> That is true in general, but doing preempt_disable/enable across
>> function boundary is ugly and prone to further problems down the road.
> We do worse things in this code, and the thing Linus proposes is
> actually quite simple, something like so:
>
> ---
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -912,7 +904,7 @@ rwsem_down_read_slowpath(struct rw_semap
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			break;
>  		}
> -		schedule();
> +		schedule_preempt_disabled();
>  		lockevent_inc(rwsem_sleep_reader);
>  	}
>  
> @@ -1121,6 +1113,7 @@ static struct rw_semaphore *rwsem_downgr
>   */
>  inline void __down_read(struct rw_semaphore *sem)
>  {
> +	preempt_disable();
>  	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>  			&sem->count) & RWSEM_READ_FAILED_MASK)) {
>  		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
> @@ -1129,10 +1122,12 @@ inline void __down_read(struct rw_semaph
>  	} else {
>  		rwsem_set_reader_owned(sem);
>  	}
> +	preempt_enable();
>  }
>  
>  static inline int __down_read_killable(struct rw_semaphore *sem)
>  {
> +	preempt_disable();
>  	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>  			&sem->count) & RWSEM_READ_FAILED_MASK)) {
>  		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
> @@ -1142,6 +1137,7 @@ static inline int __down_read_killable(s
>  	} else {
>  		rwsem_set_reader_owned(sem);
>  	}
> +	preempt_enable();
>  	return 0;
>  }
>  

Making that change will help the slowpath to has less preemption points.
For an uncontended rwsem, this offers no real benefit. Adding
preempt_disable() is more complicated than I originally thought.

Maybe we are too paranoid about the possibility of a large number of
preemptions happening just at the right moment. If p is the probably of
a preemption in the middle of the inc-check-dec sequence, which I have
already moved as close to each other as possible. We are talking a
probability of p^32768. Since p will be really small, the compound
probability will be infinitesimally small.

So I would like to not do preemption now for the current patchset. We
can restart the discussion later on if there is a real concern that it
may actually happen. Please let me know if you still want to add
preempt_disable() for the read lock.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-24 16:49                               ` Waiman Long
@ 2019-04-24 17:01                                 ` Peter Zijlstra
  2019-04-24 17:10                                   ` Waiman Long
  2019-04-24 17:56                                   ` Linus Torvalds
  0 siblings, 2 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-24 17:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying

On Wed, Apr 24, 2019 at 12:49:05PM -0400, Waiman Long wrote:
> On 4/24/19 3:09 AM, Peter Zijlstra wrote:
> > On Tue, Apr 23, 2019 at 03:12:16PM -0400, Waiman Long wrote:
> >> That is true in general, but doing preempt_disable/enable across
> >> function boundary is ugly and prone to further problems down the road.
> > We do worse things in this code, and the thing Linus proposes is
> > actually quite simple, something like so:
> >
> > ---
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -912,7 +904,7 @@ rwsem_down_read_slowpath(struct rw_semap
> >  			raw_spin_unlock_irq(&sem->wait_lock);
> >  			break;
> >  		}
> > -		schedule();
> > +		schedule_preempt_disabled();
> >  		lockevent_inc(rwsem_sleep_reader);
> >  	}
> >  
> > @@ -1121,6 +1113,7 @@ static struct rw_semaphore *rwsem_downgr
> >   */
> >  inline void __down_read(struct rw_semaphore *sem)
> >  {
> > +	preempt_disable();
> >  	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
> >  			&sem->count) & RWSEM_READ_FAILED_MASK)) {
> >  		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
> > @@ -1129,10 +1122,12 @@ inline void __down_read(struct rw_semaph
> >  	} else {
> >  		rwsem_set_reader_owned(sem);
> >  	}
> > +	preempt_enable();
> >  }
> >  
> >  static inline int __down_read_killable(struct rw_semaphore *sem)
> >  {
> > +	preempt_disable();
> >  	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
> >  			&sem->count) & RWSEM_READ_FAILED_MASK)) {
> >  		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
> > @@ -1142,6 +1137,7 @@ static inline int __down_read_killable(s
> >  	} else {
> >  		rwsem_set_reader_owned(sem);
> >  	}
> > +	preempt_enable();
> >  	return 0;
> >  }
> >  
> 
> Making that change will help the slowpath to has less preemption points.

That doesn't matter, right? Either it blocks or it goes through quickly.

If you're worried about a parituclar spot we can easily put in explicit
preemption points.

> For an uncontended rwsem, this offers no real benefit. Adding
> preempt_disable() is more complicated than I originally thought.

I'm not sure I get your objection?

> Maybe we are too paranoid about the possibility of a large number of
> preemptions happening just at the right moment. If p is the probably of
> a preemption in the middle of the inc-check-dec sequence, which I have
> already moved as close to each other as possible. We are talking a
> probability of p^32768. Since p will be really small, the compound
> probability will be infinitesimally small.

Sure; but we run on many millions of machines every second, so the
actual accumulated chance of it happening eventually is still fairly
significant.

> So I would like to not do preemption now for the current patchset. We
> can restart the discussion later on if there is a real concern that it
> may actually happen. Please let me know if you still want to add
> preempt_disable() for the read lock.

I like provably correct schemes over prayers.

As you noted, distros don't usually ship with PREEMPT=y and therefore
will not be bothered much by any of this.

The old scheme basically worked by the fact that the total supported
reader count was higher than the number of addressable pages in the
system and therefore the overflow could not happen.

We now transition to number of CPUs, and for that we pay a little price
with PREEMPT=y kernels. Either that or cmpxchg.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-24 17:01                                 ` Peter Zijlstra
@ 2019-04-24 17:10                                   ` Waiman Long
  2019-04-24 17:56                                   ` Linus Torvalds
  1 sibling, 0 replies; 112+ messages in thread
From: Waiman Long @ 2019-04-24 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying

On 4/24/19 1:01 PM, Peter Zijlstra wrote:
> On Wed, Apr 24, 2019 at 12:49:05PM -0400, Waiman Long wrote:
>> On 4/24/19 3:09 AM, Peter Zijlstra wrote:
>>> On Tue, Apr 23, 2019 at 03:12:16PM -0400, Waiman Long wrote:
>>>> That is true in general, but doing preempt_disable/enable across
>>>> function boundary is ugly and prone to further problems down the road.
>>> We do worse things in this code, and the thing Linus proposes is
>>> actually quite simple, something like so:
>>>
>>> ---
>>> --- a/kernel/locking/rwsem.c
>>> +++ b/kernel/locking/rwsem.c
>>> @@ -912,7 +904,7 @@ rwsem_down_read_slowpath(struct rw_semap
>>>  			raw_spin_unlock_irq(&sem->wait_lock);
>>>  			break;
>>>  		}
>>> -		schedule();
>>> +		schedule_preempt_disabled();
>>>  		lockevent_inc(rwsem_sleep_reader);
>>>  	}
>>>  
>>> @@ -1121,6 +1113,7 @@ static struct rw_semaphore *rwsem_downgr
>>>   */
>>>  inline void __down_read(struct rw_semaphore *sem)
>>>  {
>>> +	preempt_disable();
>>>  	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>>>  			&sem->count) & RWSEM_READ_FAILED_MASK)) {
>>>  		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
>>> @@ -1129,10 +1122,12 @@ inline void __down_read(struct rw_semaph
>>>  	} else {
>>>  		rwsem_set_reader_owned(sem);
>>>  	}
>>> +	preempt_enable();
>>>  }
>>>  
>>>  static inline int __down_read_killable(struct rw_semaphore *sem)
>>>  {
>>> +	preempt_disable();
>>>  	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>>>  			&sem->count) & RWSEM_READ_FAILED_MASK)) {
>>>  		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
>>> @@ -1142,6 +1137,7 @@ static inline int __down_read_killable(s
>>>  	} else {
>>>  		rwsem_set_reader_owned(sem);
>>>  	}
>>> +	preempt_enable();
>>>  	return 0;
>>>  }
>>>  
>> Making that change will help the slowpath to has less preemption points.
> That doesn't matter, right? Either it blocks or it goes through quickly.
>
> If you're worried about a parituclar spot we can easily put in explicit
> preemption points.
>
>> For an uncontended rwsem, this offers no real benefit. Adding
>> preempt_disable() is more complicated than I originally thought.
> I'm not sure I get your objection?
>
>> Maybe we are too paranoid about the possibility of a large number of
>> preemptions happening just at the right moment. If p is the probably of
>> a preemption in the middle of the inc-check-dec sequence, which I have
>> already moved as close to each other as possible. We are talking a
>> probability of p^32768. Since p will be really small, the compound
>> probability will be infinitesimally small.
> Sure; but we run on many millions of machines every second, so the
> actual accumulated chance of it happening eventually is still fairly
> significant.
>
>> So I would like to not do preemption now for the current patchset. We
>> can restart the discussion later on if there is a real concern that it
>> may actually happen. Please let me know if you still want to add
>> preempt_disable() for the read lock.
> I like provably correct schemes over prayers.


I am fine with adding preempt_disable(). I just want confirmation that
you want to have that.


>
> As you noted, distros don't usually ship with PREEMPT=y and therefore
> will not be bothered much by any of this.
>
> The old scheme basically worked by the fact that the total supported
> reader count was higher than the number of addressable pages in the
> system and therefore the overflow could not happen.
>
> We now transition to number of CPUs, and for that we pay a little price
> with PREEMPT=y kernels. Either that or cmpxchg.

I also thought about switching to a cmpxchg loop for PREEMPT=y kernel.
Let start with just preempt_disable() for now. We can evaluate the
cmpxchg loop alternative later on.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative
  2019-04-24 17:01                                 ` Peter Zijlstra
  2019-04-24 17:10                                   ` Waiman Long
@ 2019-04-24 17:56                                   ` Linus Torvalds
  1 sibling, 0 replies; 112+ messages in thread
From: Linus Torvalds @ 2019-04-24 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Linux List Kernel Mailing, the arch/x86 maintainers,
	Davidlohr Bueso, Tim Chen, huang ying

On Wed, Apr 24, 2019 at 10:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > For an uncontended rwsem, this offers no real benefit. Adding
> > preempt_disable() is more complicated than I originally thought.
>
> I'm not sure I get your objection?

I'm not sure it's an objection, but I do think that it's sad if we
have to do the preempt_enable/disable around the fastpath.

Is the *only* reason for the preempt-disable to avoid the (very
unlikely) case of unbounded preemption in between the "increment
reader counts" and "decrement it again because we noticed it turned
negative"?

If that's the only reason, then I think we should just accept the
race. You still have a "slop" of 15 bits (so 16k processes) hitting
the same mutex, and they'd all have to be preempted in that same
"small handful instructions" window.

Even if the likelihood of *one* process hitting that race is 90% (and
no, it really isn't), then the likelihood of having 16k processes
hitting that race is 0.9**16384.

We call numbers like that "we'll hit it some time long after the heat
death of the universe" numbers.

     Linus

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH] bpf: Fix preempt_enable_no_resched() abuse
  2019-04-23 19:55                                 ` [PATCH] bpf: Fix preempt_enable_no_resched() abuse Peter Zijlstra
  2019-04-23 20:03                                   ` [PATCH] trace: " Peter Zijlstra
  2019-04-23 20:27                                   ` [PATCH] bpf: " Linus Torvalds
@ 2019-04-25 21:23                                   ` Alexei Starovoitov
  2019-04-26  7:14                                     ` Peter Zijlstra
  2 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2019-04-25 21:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Linus Torvalds, Ingo Molnar, Will Deacon,
	Thomas Gleixner, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying,
	Roman Gushchin, Alexei Starovoitov, Daniel Borkmann

On Tue, Apr 23, 2019 at 09:55:59PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 23, 2019 at 03:41:32PM -0400, Waiman Long wrote:
> > On 4/23/19 3:34 PM, Peter Zijlstra wrote:
> > > On Tue, Apr 23, 2019 at 03:12:16PM -0400, Waiman Long wrote:
> 
> > >> You are right on that. However, there is a variant called
> > >> preempt_enable_no_resched() that doesn't have this side effect. So I am
> > >> going to use that one instead.
> > > Only if the very next line is schedule(). Otherwise you're very much not
> > > going to use that function.
> > 
> > May I know the reason why. 
> 
> Because it can 'consume' a need_resched and introduces arbitrary delays
> before the schedule() eventually happens, breaking the very notion of
> PREEMPT=y (and the fundamentals RT relies on).
> 
> > I saw a number of instances of
> > preempt_enable_no_resched() without right next a schedule().
> 
> Look more closely.. and let me know, if true, those are bugs that need
> fixing.
> 
> Argghhh.. BPF...
> 
> Also, with the recent RCU rework, we can probably drop that
> rcu_read_lock()/rcu_read_unlock() from there if we're disabling
> preemption anyway.
> 
> ---
> Subject: bpf: Fix preempt_enable_no_resched() abuse
> 
> Unless the very next line is schedule(), or implies it, one must not use
> preempt_enable_no_resched(). It can cause a preemption to go missing and
> thereby cause arbitrary delays, breaking the PREEMPT=y invariant.
> 
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f02367faa58d..944ccc310201 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -510,7 +510,7 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
>  		}					\
>  _out:							\
>  		rcu_read_unlock();			\
> -		preempt_enable_no_resched();		\
> +		preempt_enable();			\
>  		_ret;					\

Applied to bpf tree. Thanks!
It should have been fixed long ago. Not sure how we kept forgetting about it.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH] bpf: Fix preempt_enable_no_resched() abuse
  2019-04-25 21:23                                   ` Alexei Starovoitov
@ 2019-04-26  7:14                                     ` Peter Zijlstra
  0 siblings, 0 replies; 112+ messages in thread
From: Peter Zijlstra @ 2019-04-26  7:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Waiman Long, Linus Torvalds, Ingo Molnar, Will Deacon,
	Thomas Gleixner, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying,
	Roman Gushchin, Alexei Starovoitov, Daniel Borkmann

On Thu, Apr 25, 2019 at 02:23:40PM -0700, Alexei Starovoitov wrote:
> Applied to bpf tree. Thanks!
> It should have been fixed long ago. Not sure how we kept forgetting about it.

Thanks!

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [tip:sched/core] trace: Fix preempt_enable_no_resched() abuse
  2019-04-23 20:03                                   ` [PATCH] trace: " Peter Zijlstra
  2019-04-23 23:58                                     ` Steven Rostedt
@ 2019-04-29  6:39                                     ` tip-bot for Peter Zijlstra
  2019-04-29 13:31                                       ` Steven Rostedt
  1 sibling, 1 reply; 112+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-04-29  6:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: ast, rostedt, tim.c.chen, dave, hpa, peterz, guro, linux-kernel,
	mingo, tglx, daniel, huang.ying.caritas, will.deacon, longman,
	torvalds

Commit-ID:  e8bd5814989b994cf1b0cb179e1c777e40c0f02c
Gitweb:     https://git.kernel.org/tip/e8bd5814989b994cf1b0cb179e1c777e40c0f02c
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 23 Apr 2019 22:03:18 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 29 Apr 2019 08:27:09 +0200

trace: Fix preempt_enable_no_resched() abuse

Unless there is a call into schedule() in the immediate
(deterministic) future, one must not use preempt_enable_no_resched().
It can cause a preemption to go missing and thereby cause arbitrary
delays, breaking the PREEMPT=y invariant.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Fixes: 2c2d7329d8af ("tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()")
Link: https://lkml.kernel.org/r/20190423200318.GY14281@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/trace/ring_buffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 41b6f96e5366..4ee8d8aa3d0f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -762,7 +762,7 @@ u64 ring_buffer_time_stamp(struct ring_buffer *buffer, int cpu)
 
 	preempt_disable_notrace();
 	time = rb_time_stamp(buffer);
-	preempt_enable_no_resched_notrace();
+	preempt_enable_notrace();
 
 	return time;
 }

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [tip:sched/core] trace: Fix preempt_enable_no_resched() abuse
  2019-04-29  6:39                                     ` [tip:sched/core] " tip-bot for Peter Zijlstra
@ 2019-04-29 13:31                                       ` Steven Rostedt
  2019-04-29 14:08                                         ` Ingo Molnar
  0 siblings, 1 reply; 112+ messages in thread
From: Steven Rostedt @ 2019-04-29 13:31 UTC (permalink / raw)
  To: tip-bot for Peter Zijlstra
  Cc: linux-kernel, mingo, guro, peterz, hpa, dave, tim.c.chen,
	rostedt, ast, torvalds, longman, will.deacon, huang.ying.caritas,
	daniel, tglx, linux-tip-commits

On Sun, 28 Apr 2019 23:39:03 -0700
tip-bot for Peter Zijlstra <tipbot@zytor.com> wrote:

> Commit-ID:  e8bd5814989b994cf1b0cb179e1c777e40c0f02c
> Gitweb:     https://git.kernel.org/tip/e8bd5814989b994cf1b0cb179e1c777e40c0f02c
> Author:     Peter Zijlstra <peterz@infradead.org>
> AuthorDate: Tue, 23 Apr 2019 22:03:18 +0200
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Mon, 29 Apr 2019 08:27:09 +0200
> 
> trace: Fix preempt_enable_no_resched() abuse

Hi Ingo,

I already sent this fix to Linus, and it's been pulled in to his tree.

Commit: d6097c9e4454adf1f8f2c9547c2fa6060d55d952

-- Steve

> 
> Unless there is a call into schedule() in the immediate
> (deterministic) future, one must not use preempt_enable_no_resched().
> It can cause a preemption to go missing and thereby cause arbitrary
> delays, breaking the PREEMPT=y invariant.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: huang ying <huang.ying.caritas@gmail.com>
> Fixes: 2c2d7329d8af ("tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()")
> Link: https://lkml.kernel.org/r/20190423200318.GY14281@hirez.programming.kicks-ass.net
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/trace/ring_buffer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 41b6f96e5366..4ee8d8aa3d0f 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -762,7 +762,7 @@ u64 ring_buffer_time_stamp(struct ring_buffer *buffer, int cpu)
>  
>  	preempt_disable_notrace();
>  	time = rb_time_stamp(buffer);
> -	preempt_enable_no_resched_notrace();
> +	preempt_enable_notrace();
>  
>  	return time;
>  }


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [tip:sched/core] trace: Fix preempt_enable_no_resched() abuse
  2019-04-29 13:31                                       ` Steven Rostedt
@ 2019-04-29 14:08                                         ` Ingo Molnar
  0 siblings, 0 replies; 112+ messages in thread
From: Ingo Molnar @ 2019-04-29 14:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tip-bot for Peter Zijlstra, linux-kernel, guro, peterz, hpa,
	dave, tim.c.chen, ast, torvalds, longman, will.deacon,
	huang.ying.caritas, daniel, tglx, linux-tip-commits


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sun, 28 Apr 2019 23:39:03 -0700
> tip-bot for Peter Zijlstra <tipbot@zytor.com> wrote:
> 
> > Commit-ID:  e8bd5814989b994cf1b0cb179e1c777e40c0f02c
> > Gitweb:     https://git.kernel.org/tip/e8bd5814989b994cf1b0cb179e1c777e40c0f02c
> > Author:     Peter Zijlstra <peterz@infradead.org>
> > AuthorDate: Tue, 23 Apr 2019 22:03:18 +0200
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Mon, 29 Apr 2019 08:27:09 +0200
> > 
> > trace: Fix preempt_enable_no_resched() abuse
> 
> Hi Ingo,
> 
> I already sent this fix to Linus, and it's been pulled in to his tree.
> 
> Commit: d6097c9e4454adf1f8f2c9547c2fa6060d55d952

Thanks, missed that - I've zapped it from tip:sched/core.

	Ingo

^ permalink raw reply	[flat|nested] 112+ messages in thread

end of thread, other threads:[~2019-04-29 14:08 UTC | newest]

Thread overview: 112+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-13 17:22 [PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
2019-04-13 17:22 ` [PATCH v4 01/16] locking/rwsem: Prevent unneeded warning during locking selftest Waiman Long
2019-04-18  8:04   ` [tip:locking/core] " tip-bot for Waiman Long
2019-04-13 17:22 ` [PATCH v4 02/16] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER Waiman Long
2019-04-13 17:22 ` [PATCH v4 03/16] locking/rwsem: Remove rwsem_wake() wakeup optimization Waiman Long
2019-04-13 17:22 ` [PATCH v4 04/16] locking/rwsem: Implement a new locking scheme Waiman Long
2019-04-16 13:22   ` Peter Zijlstra
2019-04-16 13:32     ` Waiman Long
2019-04-16 14:18       ` Peter Zijlstra
2019-04-16 14:42         ` Peter Zijlstra
2019-04-13 17:22 ` [PATCH v4 05/16] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c Waiman Long
2019-04-13 17:22 ` [PATCH v4 06/16] locking/rwsem: Code cleanup after files merging Waiman Long
2019-04-16 16:01   ` Peter Zijlstra
2019-04-16 16:17     ` Peter Zijlstra
2019-04-16 19:45       ` Waiman Long
2019-04-13 17:22 ` [PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
2019-04-16 14:12   ` Peter Zijlstra
2019-04-16 20:26     ` Waiman Long
2019-04-16 21:07       ` Waiman Long
2019-04-17  7:13         ` Peter Zijlstra
2019-04-17 16:22           ` Waiman Long
2019-04-16 15:49   ` Peter Zijlstra
2019-04-16 16:15     ` Peter Zijlstra
2019-04-16 18:41       ` Waiman Long
2019-04-16 18:16     ` Waiman Long
2019-04-16 18:32       ` Peter Zijlstra
2019-04-17  7:35       ` Peter Zijlstra
2019-04-17 16:35         ` Waiman Long
2019-04-17  8:05       ` Peter Zijlstra
2019-04-17 16:39         ` Waiman Long
2019-04-18  8:22           ` Peter Zijlstra
2019-04-17  8:17   ` Peter Zijlstra
2019-04-13 17:22 ` [PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
2019-04-17  9:00   ` Peter Zijlstra
2019-04-17 16:42     ` Waiman Long
2019-04-17 10:19   ` Peter Zijlstra
2019-04-17 16:53     ` Waiman Long
2019-04-17 12:41   ` Peter Zijlstra
2019-04-17 12:47     ` Peter Zijlstra
2019-04-17 18:29       ` Waiman Long
2019-04-18  8:39         ` Peter Zijlstra
2019-04-17 13:00     ` Peter Zijlstra
2019-04-17 18:50       ` Waiman Long
2019-04-13 17:22 ` [PATCH v4 09/16] locking/rwsem: Ensure an RT task will not spin on reader Waiman Long
2019-04-17 13:18   ` Peter Zijlstra
2019-04-17 18:47     ` Waiman Long
2019-04-18  8:52       ` Peter Zijlstra
2019-04-18 13:27         ` Waiman Long
2019-04-13 17:22 ` [PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
2019-04-16 16:50   ` Davidlohr Bueso
2019-04-16 17:37     ` Waiman Long
2019-04-17 13:39   ` Peter Zijlstra
2019-04-17 17:16     ` Waiman Long
2019-04-13 17:22 ` [PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer Waiman Long
2019-04-17 13:56   ` Peter Zijlstra
2019-04-17 17:34     ` Waiman Long
2019-04-18  8:57       ` Peter Zijlstra
2019-04-18 14:35         ` Waiman Long
2019-04-17 13:58   ` Peter Zijlstra
2019-04-17 17:45     ` Waiman Long
2019-04-18  9:00       ` Peter Zijlstra
2019-04-18 13:40         ` Waiman Long
2019-04-17 14:05   ` Peter Zijlstra
2019-04-17 17:51     ` Waiman Long
2019-04-18  9:11       ` Peter Zijlstra
2019-04-18 14:37         ` Waiman Long
2019-04-13 17:22 ` [PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
2019-04-18 13:06   ` Peter Zijlstra
2019-04-18 15:15     ` Waiman Long
2019-04-19  7:56       ` Peter Zijlstra
2019-04-19 14:33         ` Waiman Long
2019-04-19 15:36           ` Waiman Long
2019-04-13 17:22 ` [PATCH v4 13/16] locking/rwsem: Add more rwsem owner access helpers Waiman Long
2019-04-13 17:22 ` [PATCH v4 14/16] locking/rwsem: Guard against making count negative Waiman Long
2019-04-18 13:51   ` Peter Zijlstra
2019-04-18 14:08     ` Waiman Long
2019-04-18 14:30       ` Peter Zijlstra
2019-04-18 14:40       ` Peter Zijlstra
2019-04-18 14:54         ` Waiman Long
2019-04-19 10:26           ` Peter Zijlstra
2019-04-19 12:02             ` Peter Zijlstra
2019-04-19 13:03               ` Peter Zijlstra
2019-04-19 13:15                 ` Peter Zijlstra
2019-04-19 19:39                   ` Waiman Long
2019-04-21 21:07                     ` Waiman Long
2019-04-23 14:17                       ` Peter Zijlstra
2019-04-23 14:31                         ` Waiman Long
2019-04-23 16:27                         ` Linus Torvalds
2019-04-23 19:12                           ` Waiman Long
2019-04-23 19:34                             ` Peter Zijlstra
2019-04-23 19:41                               ` Waiman Long
2019-04-23 19:55                                 ` [PATCH] bpf: Fix preempt_enable_no_resched() abuse Peter Zijlstra
2019-04-23 20:03                                   ` [PATCH] trace: " Peter Zijlstra
2019-04-23 23:58                                     ` Steven Rostedt
2019-04-29  6:39                                     ` [tip:sched/core] " tip-bot for Peter Zijlstra
2019-04-29 13:31                                       ` Steven Rostedt
2019-04-29 14:08                                         ` Ingo Molnar
2019-04-23 20:27                                   ` [PATCH] bpf: " Linus Torvalds
2019-04-23 20:35                                     ` Peter Zijlstra
2019-04-23 20:45                                       ` Linus Torvalds
2019-04-24 13:19                                       ` Peter Zijlstra
2019-04-25 21:23                                   ` Alexei Starovoitov
2019-04-26  7:14                                     ` Peter Zijlstra
2019-04-24  7:09                             ` [PATCH v4 14/16] locking/rwsem: Guard against making count negative Peter Zijlstra
2019-04-24 16:49                               ` Waiman Long
2019-04-24 17:01                                 ` Peter Zijlstra
2019-04-24 17:10                                   ` Waiman Long
2019-04-24 17:56                                   ` Linus Torvalds
2019-04-13 17:22 ` [PATCH v4 15/16] locking/rwsem: Merge owner into count on x86-64 Waiman Long
2019-04-18 14:28   ` Peter Zijlstra
2019-04-18 14:40     ` Waiman Long
2019-04-13 17:22 ` [PATCH v4 16/16] locking/rwsem: Remove redundant computation of writer lock word Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.