All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Switch arm64 over to qrwlock
@ 2017-10-06 13:34 ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, Jeremy.Linton, peterz, mingo, longman,
	boqun.feng, paulmck, Will Deacon

Hi all,

This is version two of the patches I posted yesterday:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html

I'd normally leave it longer before posting again, but Peter had a good
suggestion to rework the layout of the lock word, so I wanted to post a
version that follows that approach.

I've updated my branch if you're after the full patch stack:

  git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock

As before, all comments (particularly related to testing and performance)
welcome!

Cheers,

Will

--->8

Will Deacon (5):
  kernel/locking: Use struct qrwlock instead of struct __qrwlock
  locking/atomic: Add atomic_cond_read_acquire
  kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
  arm64: locking: Move rwlock implementation over to qrwlocks
  kernel/locking: Prevent slowpath writers getting held up by fastpath

 arch/arm64/Kconfig                      |  17 ++++
 arch/arm64/include/asm/Kbuild           |   1 +
 arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
 arch/arm64/include/asm/spinlock_types.h |   6 +-
 include/asm-generic/atomic-long.h       |   3 +
 include/asm-generic/qrwlock.h           |  20 +---
 include/asm-generic/qrwlock_types.h     |  15 ++-
 include/linux/atomic.h                  |   4 +
 kernel/locking/qrwlock.c                |  83 +++-------------
 9 files changed, 58 insertions(+), 255 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 0/5] Switch arm64 over to qrwlock
@ 2017-10-06 13:34 ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

This is version two of the patches I posted yesterday:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html

I'd normally leave it longer before posting again, but Peter had a good
suggestion to rework the layout of the lock word, so I wanted to post a
version that follows that approach.

I've updated my branch if you're after the full patch stack:

  git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock

As before, all comments (particularly related to testing and performance)
welcome!

Cheers,

Will

--->8

Will Deacon (5):
  kernel/locking: Use struct qrwlock instead of struct __qrwlock
  locking/atomic: Add atomic_cond_read_acquire
  kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
  arm64: locking: Move rwlock implementation over to qrwlocks
  kernel/locking: Prevent slowpath writers getting held up by fastpath

 arch/arm64/Kconfig                      |  17 ++++
 arch/arm64/include/asm/Kbuild           |   1 +
 arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
 arch/arm64/include/asm/spinlock_types.h |   6 +-
 include/asm-generic/atomic-long.h       |   3 +
 include/asm-generic/qrwlock.h           |  20 +---
 include/asm-generic/qrwlock_types.h     |  15 ++-
 include/linux/atomic.h                  |   4 +
 kernel/locking/qrwlock.c                |  83 +++-------------
 9 files changed, 58 insertions(+), 255 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 1/5] kernel/locking: Use struct qrwlock instead of struct __qrwlock
  2017-10-06 13:34 ` Will Deacon
@ 2017-10-06 13:34   ` Will Deacon
  -1 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, Jeremy.Linton, peterz, mingo, longman,
	boqun.feng, paulmck, Will Deacon

There's no good reason to keep the internal structure of struct qrwlock
hidden from qrwlock.h, particularly as it's actually needed for unlock
and ends up being abstracted independently behind the __qrwlock_write_byte
function.

Stop pretending we can hide this stuff, and move the __qrwlock definition
into qrwlock, removing the __qrwlock_write_byte nastiness and using the
same struct definition everywhere instead.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/qrwlock.h       | 12 +-----------
 include/asm-generic/qrwlock_types.h | 15 +++++++++++++--
 kernel/locking/qrwlock.c            | 26 ++------------------------
 3 files changed, 16 insertions(+), 37 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 50925327b0a8..02c0a768e6b0 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -129,22 +129,12 @@ static inline void queued_read_unlock(struct qrwlock *lock)
 }
 
 /**
- * __qrwlock_write_byte - retrieve the write byte address of a queue rwlock
- * @lock : Pointer to queue rwlock structure
- * Return: the write byte address of a queue rwlock
- */
-static inline u8 *__qrwlock_write_byte(struct qrwlock *lock)
-{
-	return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN);
-}
-
-/**
  * queued_write_unlock - release write lock of a queue rwlock
  * @lock : Pointer to queue rwlock structure
  */
 static inline void queued_write_unlock(struct qrwlock *lock)
 {
-	smp_store_release(__qrwlock_write_byte(lock), 0);
+	smp_store_release(&lock->wmode, 0);
 }
 
 /*
diff --git a/include/asm-generic/qrwlock_types.h b/include/asm-generic/qrwlock_types.h
index 0abc6b6062fb..507f2dc51bba 100644
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -9,12 +9,23 @@
  */
 
 typedef struct qrwlock {
-	atomic_t		cnts;
+	union {
+		atomic_t cnts;
+		struct {
+#ifdef __LITTLE_ENDIAN
+			u8 wmode;	/* Writer mode   */
+			u8 rcnts[3];	/* Reader counts */
+#else
+			u8 rcnts[3];	/* Reader counts */
+			u8 wmode;	/* Writer mode   */
+#endif
+		};
+	};
 	arch_spinlock_t		wait_lock;
 } arch_rwlock_t;
 
 #define	__ARCH_RW_LOCK_UNLOCKED {		\
-	.cnts = ATOMIC_INIT(0),			\
+	{ .cnts = ATOMIC_INIT(0), },		\
 	.wait_lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
 }
 
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 2655f26ec882..1af791e37348 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -23,26 +23,6 @@
 #include <linux/spinlock.h>
 #include <asm/qrwlock.h>
 
-/*
- * This internal data structure is used for optimizing access to some of
- * the subfields within the atomic_t cnts.
- */
-struct __qrwlock {
-	union {
-		atomic_t cnts;
-		struct {
-#ifdef __LITTLE_ENDIAN
-			u8 wmode;	/* Writer mode   */
-			u8 rcnts[3];	/* Reader counts */
-#else
-			u8 rcnts[3];	/* Reader counts */
-			u8 wmode;	/* Writer mode   */
-#endif
-		};
-	};
-	arch_spinlock_t	lock;
-};
-
 /**
  * rspin_until_writer_unlock - inc reader count & spin until writer is gone
  * @lock  : Pointer to queue rwlock structure
@@ -125,10 +105,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 	 * or wait for a previous writer to go away.
 	 */
 	for (;;) {
-		struct __qrwlock *l = (struct __qrwlock *)lock;
-
-		if (!READ_ONCE(l->wmode) &&
-		   (cmpxchg_relaxed(&l->wmode, 0, _QW_WAITING) == 0))
+		if (!READ_ONCE(lock->wmode) &&
+		   (cmpxchg_relaxed(&lock->wmode, 0, _QW_WAITING) == 0))
 			break;
 
 		cpu_relax();
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 1/5] kernel/locking: Use struct qrwlock instead of struct __qrwlock
@ 2017-10-06 13:34   ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-arm-kernel

There's no good reason to keep the internal structure of struct qrwlock
hidden from qrwlock.h, particularly as it's actually needed for unlock
and ends up being abstracted independently behind the __qrwlock_write_byte
function.

Stop pretending we can hide this stuff, and move the __qrwlock definition
into qrwlock, removing the __qrwlock_write_byte nastiness and using the
same struct definition everywhere instead.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/qrwlock.h       | 12 +-----------
 include/asm-generic/qrwlock_types.h | 15 +++++++++++++--
 kernel/locking/qrwlock.c            | 26 ++------------------------
 3 files changed, 16 insertions(+), 37 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 50925327b0a8..02c0a768e6b0 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -129,22 +129,12 @@ static inline void queued_read_unlock(struct qrwlock *lock)
 }
 
 /**
- * __qrwlock_write_byte - retrieve the write byte address of a queue rwlock
- * @lock : Pointer to queue rwlock structure
- * Return: the write byte address of a queue rwlock
- */
-static inline u8 *__qrwlock_write_byte(struct qrwlock *lock)
-{
-	return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN);
-}
-
-/**
  * queued_write_unlock - release write lock of a queue rwlock
  * @lock : Pointer to queue rwlock structure
  */
 static inline void queued_write_unlock(struct qrwlock *lock)
 {
-	smp_store_release(__qrwlock_write_byte(lock), 0);
+	smp_store_release(&lock->wmode, 0);
 }
 
 /*
diff --git a/include/asm-generic/qrwlock_types.h b/include/asm-generic/qrwlock_types.h
index 0abc6b6062fb..507f2dc51bba 100644
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -9,12 +9,23 @@
  */
 
 typedef struct qrwlock {
-	atomic_t		cnts;
+	union {
+		atomic_t cnts;
+		struct {
+#ifdef __LITTLE_ENDIAN
+			u8 wmode;	/* Writer mode   */
+			u8 rcnts[3];	/* Reader counts */
+#else
+			u8 rcnts[3];	/* Reader counts */
+			u8 wmode;	/* Writer mode   */
+#endif
+		};
+	};
 	arch_spinlock_t		wait_lock;
 } arch_rwlock_t;
 
 #define	__ARCH_RW_LOCK_UNLOCKED {		\
-	.cnts = ATOMIC_INIT(0),			\
+	{ .cnts = ATOMIC_INIT(0), },		\
 	.wait_lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
 }
 
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 2655f26ec882..1af791e37348 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -23,26 +23,6 @@
 #include <linux/spinlock.h>
 #include <asm/qrwlock.h>
 
-/*
- * This internal data structure is used for optimizing access to some of
- * the subfields within the atomic_t cnts.
- */
-struct __qrwlock {
-	union {
-		atomic_t cnts;
-		struct {
-#ifdef __LITTLE_ENDIAN
-			u8 wmode;	/* Writer mode   */
-			u8 rcnts[3];	/* Reader counts */
-#else
-			u8 rcnts[3];	/* Reader counts */
-			u8 wmode;	/* Writer mode   */
-#endif
-		};
-	};
-	arch_spinlock_t	lock;
-};
-
 /**
  * rspin_until_writer_unlock - inc reader count & spin until writer is gone
  * @lock  : Pointer to queue rwlock structure
@@ -125,10 +105,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 	 * or wait for a previous writer to go away.
 	 */
 	for (;;) {
-		struct __qrwlock *l = (struct __qrwlock *)lock;
-
-		if (!READ_ONCE(l->wmode) &&
-		   (cmpxchg_relaxed(&l->wmode, 0, _QW_WAITING) == 0))
+		if (!READ_ONCE(lock->wmode) &&
+		   (cmpxchg_relaxed(&lock->wmode, 0, _QW_WAITING) == 0))
 			break;
 
 		cpu_relax();
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 2/5] locking/atomic: Add atomic_cond_read_acquire
  2017-10-06 13:34 ` Will Deacon
@ 2017-10-06 13:34   ` Will Deacon
  -1 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, Jeremy.Linton, peterz, mingo, longman,
	boqun.feng, paulmck, Will Deacon

smp_cond_load_acquire provides a way to spin on a variable with acquire
semantics until some conditional expression involing the variable is
satisfied. Architectures such as arm64 can potentially enter a low-power
state, waking up only when the value of the variable changes, which
reduces the system impact of tight polling loops.

This patch makes the same interface available to users of atomic_t,
atomic64_t and atomic_long_t, rather than require messy accesses to the
structure internals.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/atomic-long.h | 3 +++
 include/linux/atomic.h            | 4 ++++
 2 files changed, 7 insertions(+)

diff --git a/include/asm-generic/atomic-long.h b/include/asm-generic/atomic-long.h
index 288cc9e96395..f2d97b782031 100644
--- a/include/asm-generic/atomic-long.h
+++ b/include/asm-generic/atomic-long.h
@@ -243,4 +243,7 @@ static inline long atomic_long_add_unless(atomic_long_t *l, long a, long u)
 #define atomic_long_inc_not_zero(l) \
 	ATOMIC_LONG_PFX(_inc_not_zero)((ATOMIC_LONG_PFX(_t) *)(l))
 
+#define atomic_long_cond_read_acquire(v, c) \
+	ATOMIC_LONG_PFX(_cond_read_acquire)((ATOMIC_LONG_PFX(_t) *)(v), (c))
+
 #endif  /*  _ASM_GENERIC_ATOMIC_LONG_H  */
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 40d6bfec0e0d..0aeb2b3f4578 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -653,6 +653,8 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 }
 #endif
 
+#define atomic_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
+
 #ifdef CONFIG_GENERIC_ATOMIC64
 #include <asm-generic/atomic64.h>
 #endif
@@ -1072,6 +1074,8 @@ static inline long long atomic64_fetch_andnot_release(long long i, atomic64_t *v
 }
 #endif
 
+#define atomic64_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
+
 #include <asm-generic/atomic-long.h>
 
 #endif /* _LINUX_ATOMIC_H */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 2/5] locking/atomic: Add atomic_cond_read_acquire
@ 2017-10-06 13:34   ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-arm-kernel

smp_cond_load_acquire provides a way to spin on a variable with acquire
semantics until some conditional expression involing the variable is
satisfied. Architectures such as arm64 can potentially enter a low-power
state, waking up only when the value of the variable changes, which
reduces the system impact of tight polling loops.

This patch makes the same interface available to users of atomic_t,
atomic64_t and atomic_long_t, rather than require messy accesses to the
structure internals.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/atomic-long.h | 3 +++
 include/linux/atomic.h            | 4 ++++
 2 files changed, 7 insertions(+)

diff --git a/include/asm-generic/atomic-long.h b/include/asm-generic/atomic-long.h
index 288cc9e96395..f2d97b782031 100644
--- a/include/asm-generic/atomic-long.h
+++ b/include/asm-generic/atomic-long.h
@@ -243,4 +243,7 @@ static inline long atomic_long_add_unless(atomic_long_t *l, long a, long u)
 #define atomic_long_inc_not_zero(l) \
 	ATOMIC_LONG_PFX(_inc_not_zero)((ATOMIC_LONG_PFX(_t) *)(l))
 
+#define atomic_long_cond_read_acquire(v, c) \
+	ATOMIC_LONG_PFX(_cond_read_acquire)((ATOMIC_LONG_PFX(_t) *)(v), (c))
+
 #endif  /*  _ASM_GENERIC_ATOMIC_LONG_H  */
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 40d6bfec0e0d..0aeb2b3f4578 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -653,6 +653,8 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 }
 #endif
 
+#define atomic_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
+
 #ifdef CONFIG_GENERIC_ATOMIC64
 #include <asm-generic/atomic64.h>
 #endif
@@ -1072,6 +1074,8 @@ static inline long long atomic64_fetch_andnot_release(long long i, atomic64_t *v
 }
 #endif
 
+#define atomic64_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
+
 #include <asm-generic/atomic-long.h>
 
 #endif /* _LINUX_ATOMIC_H */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 3/5] kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
  2017-10-06 13:34 ` Will Deacon
@ 2017-10-06 13:34   ` Will Deacon
  -1 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, Jeremy.Linton, peterz, mingo, longman,
	boqun.feng, paulmck, Will Deacon

The qrwlock slowpaths involve spinning when either a prospective reader
is waiting for a concurrent writer to drain, or a prospective writer is
waiting for concurrent readers to drain. In both of these situations,
atomic_cond_read_acquire can be used to avoid busy-waiting and make use
of any backoff functionality provided by the architecture.

This patch replaces the open-code loops and rspin_until_writer_unlock
implementation with atomic_cond_read_acquire. The write mode transition
zero to _QW_WAITING is left alone, since (a) this doesn't need acquire
semantics and (b) should be fast.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qrwlock.c | 47 +++++++++++------------------------------------
 1 file changed, 11 insertions(+), 36 deletions(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 1af791e37348..b7ea4647c74d 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -24,23 +24,6 @@
 #include <asm/qrwlock.h>
 
 /**
- * rspin_until_writer_unlock - inc reader count & spin until writer is gone
- * @lock  : Pointer to queue rwlock structure
- * @writer: Current queue rwlock writer status byte
- *
- * In interrupt context or at the head of the queue, the reader will just
- * increment the reader count & wait until the writer releases the lock.
- */
-static __always_inline void
-rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
-{
-	while ((cnts & _QW_WMASK) == _QW_LOCKED) {
-		cpu_relax();
-		cnts = atomic_read_acquire(&lock->cnts);
-	}
-}
-
-/**
  * queued_read_lock_slowpath - acquire read lock of a queue rwlock
  * @lock: Pointer to queue rwlock structure
  * @cnts: Current qrwlock lock value
@@ -53,13 +36,12 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	if (unlikely(in_interrupt())) {
 		/*
 		 * Readers in interrupt context will get the lock immediately
-		 * if the writer is just waiting (not holding the lock yet).
-		 * The rspin_until_writer_unlock() function returns immediately
-		 * in this case. Otherwise, they will spin (with ACQUIRE
-		 * semantics) until the lock is available without waiting in
-		 * the queue.
+		 * if the writer is just waiting (not holding the lock yet),
+		 * so spin with ACQUIRE semantics until the lock is available
+		 * without waiting in the queue.
 		 */
-		rspin_until_writer_unlock(lock, cnts);
+		atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK)
+					 != _QW_LOCKED);
 		return;
 	}
 	atomic_sub(_QR_BIAS, &lock->cnts);
@@ -68,14 +50,14 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	 * Put the reader into the wait queue
 	 */
 	arch_spin_lock(&lock->wait_lock);
+	atomic_add(_QR_BIAS, &lock->cnts);
 
 	/*
 	 * The ACQUIRE semantics of the following spinning code ensure
 	 * that accesses can't leak upwards out of our subsequent critical
 	 * section in the case that the lock is currently held for write.
 	 */
-	cnts = atomic_fetch_add_acquire(_QR_BIAS, &lock->cnts);
-	rspin_until_writer_unlock(lock, cnts);
+	atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK) != _QW_LOCKED);
 
 	/*
 	 * Signal the next one in queue to become queue head
@@ -90,8 +72,6 @@ EXPORT_SYMBOL(queued_read_lock_slowpath);
  */
 void queued_write_lock_slowpath(struct qrwlock *lock)
 {
-	u32 cnts;
-
 	/* Put the writer into the wait queue */
 	arch_spin_lock(&lock->wait_lock);
 
@@ -113,15 +93,10 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 	}
 
 	/* When no more readers, set the locked flag */
-	for (;;) {
-		cnts = atomic_read(&lock->cnts);
-		if ((cnts == _QW_WAITING) &&
-		    (atomic_cmpxchg_acquire(&lock->cnts, _QW_WAITING,
-					    _QW_LOCKED) == _QW_WAITING))
-			break;
-
-		cpu_relax();
-	}
+	do {
+		atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING);
+	} while (atomic_cmpxchg_relaxed(&lock->cnts, _QW_WAITING,
+					_QW_LOCKED) != _QW_WAITING);
 unlock:
 	arch_spin_unlock(&lock->wait_lock);
 }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 3/5] kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
@ 2017-10-06 13:34   ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-arm-kernel

The qrwlock slowpaths involve spinning when either a prospective reader
is waiting for a concurrent writer to drain, or a prospective writer is
waiting for concurrent readers to drain. In both of these situations,
atomic_cond_read_acquire can be used to avoid busy-waiting and make use
of any backoff functionality provided by the architecture.

This patch replaces the open-code loops and rspin_until_writer_unlock
implementation with atomic_cond_read_acquire. The write mode transition
zero to _QW_WAITING is left alone, since (a) this doesn't need acquire
semantics and (b) should be fast.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qrwlock.c | 47 +++++++++++------------------------------------
 1 file changed, 11 insertions(+), 36 deletions(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 1af791e37348..b7ea4647c74d 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -24,23 +24,6 @@
 #include <asm/qrwlock.h>
 
 /**
- * rspin_until_writer_unlock - inc reader count & spin until writer is gone
- * @lock  : Pointer to queue rwlock structure
- * @writer: Current queue rwlock writer status byte
- *
- * In interrupt context or at the head of the queue, the reader will just
- * increment the reader count & wait until the writer releases the lock.
- */
-static __always_inline void
-rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
-{
-	while ((cnts & _QW_WMASK) == _QW_LOCKED) {
-		cpu_relax();
-		cnts = atomic_read_acquire(&lock->cnts);
-	}
-}
-
-/**
  * queued_read_lock_slowpath - acquire read lock of a queue rwlock
  * @lock: Pointer to queue rwlock structure
  * @cnts: Current qrwlock lock value
@@ -53,13 +36,12 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	if (unlikely(in_interrupt())) {
 		/*
 		 * Readers in interrupt context will get the lock immediately
-		 * if the writer is just waiting (not holding the lock yet).
-		 * The rspin_until_writer_unlock() function returns immediately
-		 * in this case. Otherwise, they will spin (with ACQUIRE
-		 * semantics) until the lock is available without waiting in
-		 * the queue.
+		 * if the writer is just waiting (not holding the lock yet),
+		 * so spin with ACQUIRE semantics until the lock is available
+		 * without waiting in the queue.
 		 */
-		rspin_until_writer_unlock(lock, cnts);
+		atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK)
+					 != _QW_LOCKED);
 		return;
 	}
 	atomic_sub(_QR_BIAS, &lock->cnts);
@@ -68,14 +50,14 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	 * Put the reader into the wait queue
 	 */
 	arch_spin_lock(&lock->wait_lock);
+	atomic_add(_QR_BIAS, &lock->cnts);
 
 	/*
 	 * The ACQUIRE semantics of the following spinning code ensure
 	 * that accesses can't leak upwards out of our subsequent critical
 	 * section in the case that the lock is currently held for write.
 	 */
-	cnts = atomic_fetch_add_acquire(_QR_BIAS, &lock->cnts);
-	rspin_until_writer_unlock(lock, cnts);
+	atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK) != _QW_LOCKED);
 
 	/*
 	 * Signal the next one in queue to become queue head
@@ -90,8 +72,6 @@ EXPORT_SYMBOL(queued_read_lock_slowpath);
  */
 void queued_write_lock_slowpath(struct qrwlock *lock)
 {
-	u32 cnts;
-
 	/* Put the writer into the wait queue */
 	arch_spin_lock(&lock->wait_lock);
 
@@ -113,15 +93,10 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 	}
 
 	/* When no more readers, set the locked flag */
-	for (;;) {
-		cnts = atomic_read(&lock->cnts);
-		if ((cnts == _QW_WAITING) &&
-		    (atomic_cmpxchg_acquire(&lock->cnts, _QW_WAITING,
-					    _QW_LOCKED) == _QW_WAITING))
-			break;
-
-		cpu_relax();
-	}
+	do {
+		atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING);
+	} while (atomic_cmpxchg_relaxed(&lock->cnts, _QW_WAITING,
+					_QW_LOCKED) != _QW_WAITING);
 unlock:
 	arch_spin_unlock(&lock->wait_lock);
 }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 4/5] arm64: locking: Move rwlock implementation over to qrwlocks
  2017-10-06 13:34 ` Will Deacon
@ 2017-10-06 13:34   ` Will Deacon
  -1 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, Jeremy.Linton, peterz, mingo, longman,
	boqun.feng, paulmck, Will Deacon

Now that the qrwlock can make use of WFE, remove our homebrew rwlock
code in favour of the generic queued implementation.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/Kconfig                      |  17 ++++
 arch/arm64/include/asm/Kbuild           |   1 +
 arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
 arch/arm64/include/asm/spinlock_types.h |   6 +-
 4 files changed, 20 insertions(+), 168 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..6d32c9b0d4bb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -22,7 +22,24 @@ config ARM64
 	select ARCH_HAS_STRICT_MODULE_RWX
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG if ACPI_APEI_SEA
+	select ARCH_INLINE_READ_LOCK if !PREEMPT
+	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
+	select ARCH_INLINE_READ_LOCK_IRQ if !PREEMPT
+	select ARCH_INLINE_READ_LOCK_IRQSAVE if !PREEMPT
+	select ARCH_INLINE_READ_UNLOCK if !PREEMPT
+	select ARCH_INLINE_READ_UNLOCK_BH if !PREEMPT
+	select ARCH_INLINE_READ_UNLOCK_IRQ if !PREEMPT
+	select ARCH_INLINE_READ_UNLOCK_IRQSAVE if !PREEMPT
+	select ARCH_INLINE_WRITE_LOCK if !PREEMPT
+	select ARCH_INLINE_WRITE_LOCK_BH if !PREEMPT
+	select ARCH_INLINE_WRITE_LOCK_IRQ if !PREEMPT
+	select ARCH_INLINE_WRITE_LOCK_IRQSAVE if !PREEMPT
+	select ARCH_INLINE_WRITE_UNLOCK if !PREEMPT
+	select ARCH_INLINE_WRITE_UNLOCK_BH if !PREEMPT
+	select ARCH_INLINE_WRITE_UNLOCK_IRQ if !PREEMPT
+	select ARCH_INLINE_WRITE_UNLOCK_IRQSAVE if !PREEMPT
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_SUPPORTS_MEMORY_FAILURE
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_SUPPORTS_NUMA_BALANCING
diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 2326e39d5892..e63d0a8312de 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -16,6 +16,7 @@ generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += msi.h
 generic-y += preempt.h
+generic-y += qrwlock.h
 generic-y += rwsem.h
 generic-y += segment.h
 generic-y += serial.h
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index aa51a38e46e4..fdb827c7832f 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -137,169 +137,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 }
 #define arch_spin_is_contended	arch_spin_is_contended
 
-/*
- * Write lock implementation.
- *
- * Write locks set bit 31. Unlocking, is done by writing 0 since the lock is
- * exclusively held.
- *
- * The memory barriers are implicit with the load-acquire and store-release
- * instructions.
- */
-
-static inline void arch_write_lock(arch_rwlock_t *rw)
-{
-	unsigned int tmp;
-
-	asm volatile(ARM64_LSE_ATOMIC_INSN(
-	/* LL/SC */
-	"	sevl\n"
-	"1:	wfe\n"
-	"2:	ldaxr	%w0, %1\n"
-	"	cbnz	%w0, 1b\n"
-	"	stxr	%w0, %w2, %1\n"
-	"	cbnz	%w0, 2b\n"
-	__nops(1),
-	/* LSE atomics */
-	"1:	mov	%w0, wzr\n"
-	"2:	casa	%w0, %w2, %1\n"
-	"	cbz	%w0, 3f\n"
-	"	ldxr	%w0, %1\n"
-	"	cbz	%w0, 2b\n"
-	"	wfe\n"
-	"	b	1b\n"
-	"3:")
-	: "=&r" (tmp), "+Q" (rw->lock)
-	: "r" (0x80000000)
-	: "memory");
-}
-
-static inline int arch_write_trylock(arch_rwlock_t *rw)
-{
-	unsigned int tmp;
-
-	asm volatile(ARM64_LSE_ATOMIC_INSN(
-	/* LL/SC */
-	"1:	ldaxr	%w0, %1\n"
-	"	cbnz	%w0, 2f\n"
-	"	stxr	%w0, %w2, %1\n"
-	"	cbnz	%w0, 1b\n"
-	"2:",
-	/* LSE atomics */
-	"	mov	%w0, wzr\n"
-	"	casa	%w0, %w2, %1\n"
-	__nops(2))
-	: "=&r" (tmp), "+Q" (rw->lock)
-	: "r" (0x80000000)
-	: "memory");
-
-	return !tmp;
-}
-
-static inline void arch_write_unlock(arch_rwlock_t *rw)
-{
-	asm volatile(ARM64_LSE_ATOMIC_INSN(
-	"	stlr	wzr, %0",
-	"	swpl	wzr, wzr, %0")
-	: "=Q" (rw->lock) :: "memory");
-}
-
-/* write_can_lock - would write_trylock() succeed? */
-#define arch_write_can_lock(x)		((x)->lock == 0)
-
-/*
- * Read lock implementation.
- *
- * It exclusively loads the lock value, increments it and stores the new value
- * back if positive and the CPU still exclusively owns the location. If the
- * value is negative, the lock is already held.
- *
- * During unlocking there may be multiple active read locks but no write lock.
- *
- * The memory barriers are implicit with the load-acquire and store-release
- * instructions.
- *
- * Note that in UNDEFINED cases, such as unlocking a lock twice, the LL/SC
- * and LSE implementations may exhibit different behaviour (although this
- * will have no effect on lockdep).
- */
-static inline void arch_read_lock(arch_rwlock_t *rw)
-{
-	unsigned int tmp, tmp2;
-
-	asm volatile(
-	"	sevl\n"
-	ARM64_LSE_ATOMIC_INSN(
-	/* LL/SC */
-	"1:	wfe\n"
-	"2:	ldaxr	%w0, %2\n"
-	"	add	%w0, %w0, #1\n"
-	"	tbnz	%w0, #31, 1b\n"
-	"	stxr	%w1, %w0, %2\n"
-	"	cbnz	%w1, 2b\n"
-	__nops(1),
-	/* LSE atomics */
-	"1:	wfe\n"
-	"2:	ldxr	%w0, %2\n"
-	"	adds	%w1, %w0, #1\n"
-	"	tbnz	%w1, #31, 1b\n"
-	"	casa	%w0, %w1, %2\n"
-	"	sbc	%w0, %w1, %w0\n"
-	"	cbnz	%w0, 2b")
-	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
-	:
-	: "cc", "memory");
-}
-
-static inline void arch_read_unlock(arch_rwlock_t *rw)
-{
-	unsigned int tmp, tmp2;
-
-	asm volatile(ARM64_LSE_ATOMIC_INSN(
-	/* LL/SC */
-	"1:	ldxr	%w0, %2\n"
-	"	sub	%w0, %w0, #1\n"
-	"	stlxr	%w1, %w0, %2\n"
-	"	cbnz	%w1, 1b",
-	/* LSE atomics */
-	"	movn	%w0, #0\n"
-	"	staddl	%w0, %2\n"
-	__nops(2))
-	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
-	:
-	: "memory");
-}
-
-static inline int arch_read_trylock(arch_rwlock_t *rw)
-{
-	unsigned int tmp, tmp2;
-
-	asm volatile(ARM64_LSE_ATOMIC_INSN(
-	/* LL/SC */
-	"	mov	%w1, #1\n"
-	"1:	ldaxr	%w0, %2\n"
-	"	add	%w0, %w0, #1\n"
-	"	tbnz	%w0, #31, 2f\n"
-	"	stxr	%w1, %w0, %2\n"
-	"	cbnz	%w1, 1b\n"
-	"2:",
-	/* LSE atomics */
-	"	ldr	%w0, %2\n"
-	"	adds	%w1, %w0, #1\n"
-	"	tbnz	%w1, #31, 1f\n"
-	"	casa	%w0, %w1, %2\n"
-	"	sbc	%w1, %w1, %w0\n"
-	__nops(1)
-	"1:")
-	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
-	:
-	: "cc", "memory");
-
-	return !tmp2;
-}
-
-/* read_can_lock - would read_trylock() succeed? */
-#define arch_read_can_lock(x)		((x)->lock < 0x80000000)
+#include <asm/qrwlock.h>
 
 /* See include/linux/spinlock.h */
 #define smp_mb__after_spinlock()	smp_mb()
diff --git a/arch/arm64/include/asm/spinlock_types.h b/arch/arm64/include/asm/spinlock_types.h
index 55be59a35e3f..6b856012c51b 100644
--- a/arch/arm64/include/asm/spinlock_types.h
+++ b/arch/arm64/include/asm/spinlock_types.h
@@ -36,10 +36,6 @@ typedef struct {
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ 0 , 0 }
 
-typedef struct {
-	volatile unsigned int lock;
-} arch_rwlock_t;
-
-#define __ARCH_RW_LOCK_UNLOCKED		{ 0 }
+#include <asm-generic/qrwlock_types.h>
 
 #endif
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 4/5] arm64: locking: Move rwlock implementation over to qrwlocks
@ 2017-10-06 13:34   ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-arm-kernel

Now that the qrwlock can make use of WFE, remove our homebrew rwlock
code in favour of the generic queued implementation.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/Kconfig                      |  17 ++++
 arch/arm64/include/asm/Kbuild           |   1 +
 arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
 arch/arm64/include/asm/spinlock_types.h |   6 +-
 4 files changed, 20 insertions(+), 168 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..6d32c9b0d4bb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -22,7 +22,24 @@ config ARM64
 	select ARCH_HAS_STRICT_MODULE_RWX
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG if ACPI_APEI_SEA
+	select ARCH_INLINE_READ_LOCK if !PREEMPT
+	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
+	select ARCH_INLINE_READ_LOCK_IRQ if !PREEMPT
+	select ARCH_INLINE_READ_LOCK_IRQSAVE if !PREEMPT
+	select ARCH_INLINE_READ_UNLOCK if !PREEMPT
+	select ARCH_INLINE_READ_UNLOCK_BH if !PREEMPT
+	select ARCH_INLINE_READ_UNLOCK_IRQ if !PREEMPT
+	select ARCH_INLINE_READ_UNLOCK_IRQSAVE if !PREEMPT
+	select ARCH_INLINE_WRITE_LOCK if !PREEMPT
+	select ARCH_INLINE_WRITE_LOCK_BH if !PREEMPT
+	select ARCH_INLINE_WRITE_LOCK_IRQ if !PREEMPT
+	select ARCH_INLINE_WRITE_LOCK_IRQSAVE if !PREEMPT
+	select ARCH_INLINE_WRITE_UNLOCK if !PREEMPT
+	select ARCH_INLINE_WRITE_UNLOCK_BH if !PREEMPT
+	select ARCH_INLINE_WRITE_UNLOCK_IRQ if !PREEMPT
+	select ARCH_INLINE_WRITE_UNLOCK_IRQSAVE if !PREEMPT
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_SUPPORTS_MEMORY_FAILURE
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_SUPPORTS_NUMA_BALANCING
diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 2326e39d5892..e63d0a8312de 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -16,6 +16,7 @@ generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += msi.h
 generic-y += preempt.h
+generic-y += qrwlock.h
 generic-y += rwsem.h
 generic-y += segment.h
 generic-y += serial.h
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index aa51a38e46e4..fdb827c7832f 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -137,169 +137,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 }
 #define arch_spin_is_contended	arch_spin_is_contended
 
-/*
- * Write lock implementation.
- *
- * Write locks set bit 31. Unlocking, is done by writing 0 since the lock is
- * exclusively held.
- *
- * The memory barriers are implicit with the load-acquire and store-release
- * instructions.
- */
-
-static inline void arch_write_lock(arch_rwlock_t *rw)
-{
-	unsigned int tmp;
-
-	asm volatile(ARM64_LSE_ATOMIC_INSN(
-	/* LL/SC */
-	"	sevl\n"
-	"1:	wfe\n"
-	"2:	ldaxr	%w0, %1\n"
-	"	cbnz	%w0, 1b\n"
-	"	stxr	%w0, %w2, %1\n"
-	"	cbnz	%w0, 2b\n"
-	__nops(1),
-	/* LSE atomics */
-	"1:	mov	%w0, wzr\n"
-	"2:	casa	%w0, %w2, %1\n"
-	"	cbz	%w0, 3f\n"
-	"	ldxr	%w0, %1\n"
-	"	cbz	%w0, 2b\n"
-	"	wfe\n"
-	"	b	1b\n"
-	"3:")
-	: "=&r" (tmp), "+Q" (rw->lock)
-	: "r" (0x80000000)
-	: "memory");
-}
-
-static inline int arch_write_trylock(arch_rwlock_t *rw)
-{
-	unsigned int tmp;
-
-	asm volatile(ARM64_LSE_ATOMIC_INSN(
-	/* LL/SC */
-	"1:	ldaxr	%w0, %1\n"
-	"	cbnz	%w0, 2f\n"
-	"	stxr	%w0, %w2, %1\n"
-	"	cbnz	%w0, 1b\n"
-	"2:",
-	/* LSE atomics */
-	"	mov	%w0, wzr\n"
-	"	casa	%w0, %w2, %1\n"
-	__nops(2))
-	: "=&r" (tmp), "+Q" (rw->lock)
-	: "r" (0x80000000)
-	: "memory");
-
-	return !tmp;
-}
-
-static inline void arch_write_unlock(arch_rwlock_t *rw)
-{
-	asm volatile(ARM64_LSE_ATOMIC_INSN(
-	"	stlr	wzr, %0",
-	"	swpl	wzr, wzr, %0")
-	: "=Q" (rw->lock) :: "memory");
-}
-
-/* write_can_lock - would write_trylock() succeed? */
-#define arch_write_can_lock(x)		((x)->lock == 0)
-
-/*
- * Read lock implementation.
- *
- * It exclusively loads the lock value, increments it and stores the new value
- * back if positive and the CPU still exclusively owns the location. If the
- * value is negative, the lock is already held.
- *
- * During unlocking there may be multiple active read locks but no write lock.
- *
- * The memory barriers are implicit with the load-acquire and store-release
- * instructions.
- *
- * Note that in UNDEFINED cases, such as unlocking a lock twice, the LL/SC
- * and LSE implementations may exhibit different behaviour (although this
- * will have no effect on lockdep).
- */
-static inline void arch_read_lock(arch_rwlock_t *rw)
-{
-	unsigned int tmp, tmp2;
-
-	asm volatile(
-	"	sevl\n"
-	ARM64_LSE_ATOMIC_INSN(
-	/* LL/SC */
-	"1:	wfe\n"
-	"2:	ldaxr	%w0, %2\n"
-	"	add	%w0, %w0, #1\n"
-	"	tbnz	%w0, #31, 1b\n"
-	"	stxr	%w1, %w0, %2\n"
-	"	cbnz	%w1, 2b\n"
-	__nops(1),
-	/* LSE atomics */
-	"1:	wfe\n"
-	"2:	ldxr	%w0, %2\n"
-	"	adds	%w1, %w0, #1\n"
-	"	tbnz	%w1, #31, 1b\n"
-	"	casa	%w0, %w1, %2\n"
-	"	sbc	%w0, %w1, %w0\n"
-	"	cbnz	%w0, 2b")
-	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
-	:
-	: "cc", "memory");
-}
-
-static inline void arch_read_unlock(arch_rwlock_t *rw)
-{
-	unsigned int tmp, tmp2;
-
-	asm volatile(ARM64_LSE_ATOMIC_INSN(
-	/* LL/SC */
-	"1:	ldxr	%w0, %2\n"
-	"	sub	%w0, %w0, #1\n"
-	"	stlxr	%w1, %w0, %2\n"
-	"	cbnz	%w1, 1b",
-	/* LSE atomics */
-	"	movn	%w0, #0\n"
-	"	staddl	%w0, %2\n"
-	__nops(2))
-	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
-	:
-	: "memory");
-}
-
-static inline int arch_read_trylock(arch_rwlock_t *rw)
-{
-	unsigned int tmp, tmp2;
-
-	asm volatile(ARM64_LSE_ATOMIC_INSN(
-	/* LL/SC */
-	"	mov	%w1, #1\n"
-	"1:	ldaxr	%w0, %2\n"
-	"	add	%w0, %w0, #1\n"
-	"	tbnz	%w0, #31, 2f\n"
-	"	stxr	%w1, %w0, %2\n"
-	"	cbnz	%w1, 1b\n"
-	"2:",
-	/* LSE atomics */
-	"	ldr	%w0, %2\n"
-	"	adds	%w1, %w0, #1\n"
-	"	tbnz	%w1, #31, 1f\n"
-	"	casa	%w0, %w1, %2\n"
-	"	sbc	%w1, %w1, %w0\n"
-	__nops(1)
-	"1:")
-	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
-	:
-	: "cc", "memory");
-
-	return !tmp2;
-}
-
-/* read_can_lock - would read_trylock() succeed? */
-#define arch_read_can_lock(x)		((x)->lock < 0x80000000)
+#include <asm/qrwlock.h>
 
 /* See include/linux/spinlock.h */
 #define smp_mb__after_spinlock()	smp_mb()
diff --git a/arch/arm64/include/asm/spinlock_types.h b/arch/arm64/include/asm/spinlock_types.h
index 55be59a35e3f..6b856012c51b 100644
--- a/arch/arm64/include/asm/spinlock_types.h
+++ b/arch/arm64/include/asm/spinlock_types.h
@@ -36,10 +36,6 @@ typedef struct {
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ 0 , 0 }
 
-typedef struct {
-	volatile unsigned int lock;
-} arch_rwlock_t;
-
-#define __ARCH_RW_LOCK_UNLOCKED		{ 0 }
+#include <asm-generic/qrwlock_types.h>
 
 #endif
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 5/5] kernel/locking: Prevent slowpath writers getting held up by fastpath
  2017-10-06 13:34 ` Will Deacon
@ 2017-10-06 13:34   ` Will Deacon
  -1 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, Jeremy.Linton, peterz, mingo, longman,
	boqun.feng, paulmck, Will Deacon

When a prospective writer takes the qrwlock locking slowpath due to the
lock being held, it attempts to cmpxchg the wmode field from 0 to
_QW_WAITING so that concurrent lockers also take the slowpath and queue
on the spinlock accordingly, allowing the lockers to drain.

Unfortunately, this isn't fair, because a fastpath writer that comes in
after the lock is made available but before the _QW_WAITING flag is set
can effectively jump the queue. If there is a steady stream of prospective
writers, then the waiter will be held off indefinitely.

This patch restores fairness by separating _QW_WAITING and _QW_LOCKED
into two distinct fields: _QW_LOCKED continues to occupy the bottom byte
of the lockword so that it can be cleared unconditionally when unlocking,
but _QW_WAITING now occupies what used to be the bottom bit of the reader
count. This then forces the slow-path for concurrent lockers.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/qrwlock.h       | 10 +++++-----
 include/asm-generic/qrwlock_types.h |  8 ++++----
 kernel/locking/qrwlock.c            | 20 +++++---------------
 3 files changed, 14 insertions(+), 24 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 02c0a768e6b0..63cb7d347b25 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -40,10 +40,10 @@
  *       |      rd      | wr |
  *       +----+----+----+----+
  */
-#define	_QW_WAITING	1		/* A writer is waiting	   */
-#define	_QW_LOCKED	0xff		/* A writer holds the lock */
-#define	_QW_WMASK	0xff		/* Writer mask		   */
-#define	_QR_SHIFT	8		/* Reader count shift	   */
+#define	_QW_WAITING	0x100		/* A writer is waiting	   */
+#define	_QW_LOCKED	0x0ff		/* A writer holds the lock */
+#define	_QW_WMASK	0x1ff		/* Writer mask		   */
+#define	_QR_SHIFT	9		/* Reader count shift	   */
 #define _QR_BIAS	(1U << _QR_SHIFT)
 
 /*
@@ -134,7 +134,7 @@ static inline void queued_read_unlock(struct qrwlock *lock)
  */
 static inline void queued_write_unlock(struct qrwlock *lock)
 {
-	smp_store_release(&lock->wmode, 0);
+	smp_store_release(&lock->wlocked, 0);
 }
 
 /*
diff --git a/include/asm-generic/qrwlock_types.h b/include/asm-generic/qrwlock_types.h
index 507f2dc51bba..8af752acbdc0 100644
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -13,11 +13,11 @@ typedef struct qrwlock {
 		atomic_t cnts;
 		struct {
 #ifdef __LITTLE_ENDIAN
-			u8 wmode;	/* Writer mode   */
-			u8 rcnts[3];	/* Reader counts */
+			u8 wlocked;	/* Locked for write? */
+			u8 __lstate[3];
 #else
-			u8 rcnts[3];	/* Reader counts */
-			u8 wmode;	/* Writer mode   */
+			u8 __lstate[3];
+			u8 wlocked;	/* Locked for write? */
 #endif
 		};
 	};
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index b7ea4647c74d..e940f2c2b4f2 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -40,8 +40,7 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 		 * so spin with ACQUIRE semantics until the lock is available
 		 * without waiting in the queue.
 		 */
-		atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK)
-					 != _QW_LOCKED);
+		atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
 		return;
 	}
 	atomic_sub(_QR_BIAS, &lock->cnts);
@@ -57,7 +56,7 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	 * that accesses can't leak upwards out of our subsequent critical
 	 * section in the case that the lock is currently held for write.
 	 */
-	atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK) != _QW_LOCKED);
+	atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
 
 	/*
 	 * Signal the next one in queue to become queue head
@@ -80,19 +79,10 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 	    (atomic_cmpxchg_acquire(&lock->cnts, 0, _QW_LOCKED) == 0))
 		goto unlock;
 
-	/*
-	 * Set the waiting flag to notify readers that a writer is pending,
-	 * or wait for a previous writer to go away.
-	 */
-	for (;;) {
-		if (!READ_ONCE(lock->wmode) &&
-		   (cmpxchg_relaxed(&lock->wmode, 0, _QW_WAITING) == 0))
-			break;
-
-		cpu_relax();
-	}
+	/* Set the waiting flag to notify readers that a writer is pending */
+	atomic_add(_QW_WAITING, &lock->cnts);
 
-	/* When no more readers, set the locked flag */
+	/* When no more readers or writers, set the locked flag */
 	do {
 		atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING);
 	} while (atomic_cmpxchg_relaxed(&lock->cnts, _QW_WAITING,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 5/5] kernel/locking: Prevent slowpath writers getting held up by fastpath
@ 2017-10-06 13:34   ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-06 13:34 UTC (permalink / raw)
  To: linux-arm-kernel

When a prospective writer takes the qrwlock locking slowpath due to the
lock being held, it attempts to cmpxchg the wmode field from 0 to
_QW_WAITING so that concurrent lockers also take the slowpath and queue
on the spinlock accordingly, allowing the lockers to drain.

Unfortunately, this isn't fair, because a fastpath writer that comes in
after the lock is made available but before the _QW_WAITING flag is set
can effectively jump the queue. If there is a steady stream of prospective
writers, then the waiter will be held off indefinitely.

This patch restores fairness by separating _QW_WAITING and _QW_LOCKED
into two distinct fields: _QW_LOCKED continues to occupy the bottom byte
of the lockword so that it can be cleared unconditionally when unlocking,
but _QW_WAITING now occupies what used to be the bottom bit of the reader
count. This then forces the slow-path for concurrent lockers.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/qrwlock.h       | 10 +++++-----
 include/asm-generic/qrwlock_types.h |  8 ++++----
 kernel/locking/qrwlock.c            | 20 +++++---------------
 3 files changed, 14 insertions(+), 24 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 02c0a768e6b0..63cb7d347b25 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -40,10 +40,10 @@
  *       |      rd      | wr |
  *       +----+----+----+----+
  */
-#define	_QW_WAITING	1		/* A writer is waiting	   */
-#define	_QW_LOCKED	0xff		/* A writer holds the lock */
-#define	_QW_WMASK	0xff		/* Writer mask		   */
-#define	_QR_SHIFT	8		/* Reader count shift	   */
+#define	_QW_WAITING	0x100		/* A writer is waiting	   */
+#define	_QW_LOCKED	0x0ff		/* A writer holds the lock */
+#define	_QW_WMASK	0x1ff		/* Writer mask		   */
+#define	_QR_SHIFT	9		/* Reader count shift	   */
 #define _QR_BIAS	(1U << _QR_SHIFT)
 
 /*
@@ -134,7 +134,7 @@ static inline void queued_read_unlock(struct qrwlock *lock)
  */
 static inline void queued_write_unlock(struct qrwlock *lock)
 {
-	smp_store_release(&lock->wmode, 0);
+	smp_store_release(&lock->wlocked, 0);
 }
 
 /*
diff --git a/include/asm-generic/qrwlock_types.h b/include/asm-generic/qrwlock_types.h
index 507f2dc51bba..8af752acbdc0 100644
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -13,11 +13,11 @@ typedef struct qrwlock {
 		atomic_t cnts;
 		struct {
 #ifdef __LITTLE_ENDIAN
-			u8 wmode;	/* Writer mode   */
-			u8 rcnts[3];	/* Reader counts */
+			u8 wlocked;	/* Locked for write? */
+			u8 __lstate[3];
 #else
-			u8 rcnts[3];	/* Reader counts */
-			u8 wmode;	/* Writer mode   */
+			u8 __lstate[3];
+			u8 wlocked;	/* Locked for write? */
 #endif
 		};
 	};
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index b7ea4647c74d..e940f2c2b4f2 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -40,8 +40,7 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 		 * so spin with ACQUIRE semantics until the lock is available
 		 * without waiting in the queue.
 		 */
-		atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK)
-					 != _QW_LOCKED);
+		atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
 		return;
 	}
 	atomic_sub(_QR_BIAS, &lock->cnts);
@@ -57,7 +56,7 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	 * that accesses can't leak upwards out of our subsequent critical
 	 * section in the case that the lock is currently held for write.
 	 */
-	atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK) != _QW_LOCKED);
+	atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
 
 	/*
 	 * Signal the next one in queue to become queue head
@@ -80,19 +79,10 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 	    (atomic_cmpxchg_acquire(&lock->cnts, 0, _QW_LOCKED) == 0))
 		goto unlock;
 
-	/*
-	 * Set the waiting flag to notify readers that a writer is pending,
-	 * or wait for a previous writer to go away.
-	 */
-	for (;;) {
-		if (!READ_ONCE(lock->wmode) &&
-		   (cmpxchg_relaxed(&lock->wmode, 0, _QW_WAITING) == 0))
-			break;
-
-		cpu_relax();
-	}
+	/* Set the waiting flag to notify readers that a writer is pending */
+	atomic_add(_QW_WAITING, &lock->cnts);
 
-	/* When no more readers, set the locked flag */
+	/* When no more readers or writers, set the locked flag */
 	do {
 		atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING);
 	} while (atomic_cmpxchg_relaxed(&lock->cnts, _QW_WAITING,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 3/5] kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
  2017-10-06 13:34   ` Will Deacon
@ 2017-10-08  1:03     ` Boqun Feng
  -1 siblings, 0 replies; 40+ messages in thread
From: Boqun Feng @ 2017-10-08  1:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, Jeremy.Linton, peterz, mingo,
	longman, paulmck

[-- Attachment #1: Type: text/plain, Size: 3800 bytes --]

On Fri, Oct 06, 2017 at 01:34:40PM +0000, Will Deacon wrote:
> The qrwlock slowpaths involve spinning when either a prospective reader
> is waiting for a concurrent writer to drain, or a prospective writer is
> waiting for concurrent readers to drain. In both of these situations,
> atomic_cond_read_acquire can be used to avoid busy-waiting and make use
> of any backoff functionality provided by the architecture.
> 
> This patch replaces the open-code loops and rspin_until_writer_unlock
> implementation with atomic_cond_read_acquire. The write mode transition
> zero to _QW_WAITING is left alone, since (a) this doesn't need acquire
> semantics and (b) should be fast.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  kernel/locking/qrwlock.c | 47 +++++++++++------------------------------------
>  1 file changed, 11 insertions(+), 36 deletions(-)
> 
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index 1af791e37348..b7ea4647c74d 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -24,23 +24,6 @@
>  #include <asm/qrwlock.h>
>  
>  /**
> - * rspin_until_writer_unlock - inc reader count & spin until writer is gone
> - * @lock  : Pointer to queue rwlock structure
> - * @writer: Current queue rwlock writer status byte
> - *
> - * In interrupt context or at the head of the queue, the reader will just
> - * increment the reader count & wait until the writer releases the lock.
> - */
> -static __always_inline void
> -rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
> -{
> -	while ((cnts & _QW_WMASK) == _QW_LOCKED) {
> -		cpu_relax();
> -		cnts = atomic_read_acquire(&lock->cnts);
> -	}
> -}
> -
> -/**
>   * queued_read_lock_slowpath - acquire read lock of a queue rwlock
>   * @lock: Pointer to queue rwlock structure
>   * @cnts: Current qrwlock lock value
> @@ -53,13 +36,12 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)

So the second parameter(@cnts) could be removed entirely, right?
Any reason we still keep it?

Regards,
Boqun

>  	if (unlikely(in_interrupt())) {
>  		/*
>  		 * Readers in interrupt context will get the lock immediately
> -		 * if the writer is just waiting (not holding the lock yet).
> -		 * The rspin_until_writer_unlock() function returns immediately
> -		 * in this case. Otherwise, they will spin (with ACQUIRE
> -		 * semantics) until the lock is available without waiting in
> -		 * the queue.
> +		 * if the writer is just waiting (not holding the lock yet),
> +		 * so spin with ACQUIRE semantics until the lock is available
> +		 * without waiting in the queue.
>  		 */
> -		rspin_until_writer_unlock(lock, cnts);
> +		atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK)
> +					 != _QW_LOCKED);
>  		return;
>  	}
>  	atomic_sub(_QR_BIAS, &lock->cnts);
> @@ -68,14 +50,14 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>  	 * Put the reader into the wait queue
>  	 */
>  	arch_spin_lock(&lock->wait_lock);
> +	atomic_add(_QR_BIAS, &lock->cnts);
>  
>  	/*
>  	 * The ACQUIRE semantics of the following spinning code ensure
>  	 * that accesses can't leak upwards out of our subsequent critical
>  	 * section in the case that the lock is currently held for write.
>  	 */
> -	cnts = atomic_fetch_add_acquire(_QR_BIAS, &lock->cnts);
> -	rspin_until_writer_unlock(lock, cnts);
> +	atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK) != _QW_LOCKED);
>  
>  	/*
>  	 * Signal the next one in queue to become queue head
[...]

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 3/5] kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
@ 2017-10-08  1:03     ` Boqun Feng
  0 siblings, 0 replies; 40+ messages in thread
From: Boqun Feng @ 2017-10-08  1:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 06, 2017 at 01:34:40PM +0000, Will Deacon wrote:
> The qrwlock slowpaths involve spinning when either a prospective reader
> is waiting for a concurrent writer to drain, or a prospective writer is
> waiting for concurrent readers to drain. In both of these situations,
> atomic_cond_read_acquire can be used to avoid busy-waiting and make use
> of any backoff functionality provided by the architecture.
> 
> This patch replaces the open-code loops and rspin_until_writer_unlock
> implementation with atomic_cond_read_acquire. The write mode transition
> zero to _QW_WAITING is left alone, since (a) this doesn't need acquire
> semantics and (b) should be fast.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  kernel/locking/qrwlock.c | 47 +++++++++++------------------------------------
>  1 file changed, 11 insertions(+), 36 deletions(-)
> 
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index 1af791e37348..b7ea4647c74d 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -24,23 +24,6 @@
>  #include <asm/qrwlock.h>
>  
>  /**
> - * rspin_until_writer_unlock - inc reader count & spin until writer is gone
> - * @lock  : Pointer to queue rwlock structure
> - * @writer: Current queue rwlock writer status byte
> - *
> - * In interrupt context or at the head of the queue, the reader will just
> - * increment the reader count & wait until the writer releases the lock.
> - */
> -static __always_inline void
> -rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
> -{
> -	while ((cnts & _QW_WMASK) == _QW_LOCKED) {
> -		cpu_relax();
> -		cnts = atomic_read_acquire(&lock->cnts);
> -	}
> -}
> -
> -/**
>   * queued_read_lock_slowpath - acquire read lock of a queue rwlock
>   * @lock: Pointer to queue rwlock structure
>   * @cnts: Current qrwlock lock value
> @@ -53,13 +36,12 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)

So the second parameter(@cnts) could be removed entirely, right?
Any reason we still keep it?

Regards,
Boqun

>  	if (unlikely(in_interrupt())) {
>  		/*
>  		 * Readers in interrupt context will get the lock immediately
> -		 * if the writer is just waiting (not holding the lock yet).
> -		 * The rspin_until_writer_unlock() function returns immediately
> -		 * in this case. Otherwise, they will spin (with ACQUIRE
> -		 * semantics) until the lock is available without waiting in
> -		 * the queue.
> +		 * if the writer is just waiting (not holding the lock yet),
> +		 * so spin with ACQUIRE semantics until the lock is available
> +		 * without waiting in the queue.
>  		 */
> -		rspin_until_writer_unlock(lock, cnts);
> +		atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK)
> +					 != _QW_LOCKED);
>  		return;
>  	}
>  	atomic_sub(_QR_BIAS, &lock->cnts);
> @@ -68,14 +50,14 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>  	 * Put the reader into the wait queue
>  	 */
>  	arch_spin_lock(&lock->wait_lock);
> +	atomic_add(_QR_BIAS, &lock->cnts);
>  
>  	/*
>  	 * The ACQUIRE semantics of the following spinning code ensure
>  	 * that accesses can't leak upwards out of our subsequent critical
>  	 * section in the case that the lock is currently held for write.
>  	 */
> -	cnts = atomic_fetch_add_acquire(_QR_BIAS, &lock->cnts);
> -	rspin_until_writer_unlock(lock, cnts);
> +	atomic_cond_read_acquire(&lock->cnts, (VAL & _QW_WMASK) != _QW_LOCKED);
>  
>  	/*
>  	 * Signal the next one in queue to become queue head
[...]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20171008/ffa9409a/attachment.sig>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Switch arm64 over to qrwlock
  2017-10-06 13:34 ` Will Deacon
@ 2017-10-08 21:30   ` Yury Norov
  -1 siblings, 0 replies; 40+ messages in thread
From: Yury Norov @ 2017-10-08 21:30 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, Jeremy.Linton, peterz, mingo,
	longman, boqun.feng, paulmck

On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote:
> Hi all,
> 
> This is version two of the patches I posted yesterday:
> 
>   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
> 
> I'd normally leave it longer before posting again, but Peter had a good
> suggestion to rework the layout of the lock word, so I wanted to post a
> version that follows that approach.
> 
> I've updated my branch if you're after the full patch stack:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
> 
> As before, all comments (particularly related to testing and performance)
> welcome!
> 
> Cheers,
> 
> Will

Hi Will,

I tested your patches with locktorture and found measurable performance
regression. I also respin the patch of Jan Glauber [1], and I also
tried Jan's patch with patch 5 from this series. Numbers differ a lot
from my previous measurements, but since that I changed working
station and use qemu with the support of parallel threads.
                        Spinlock        Read-RW lock    Write-RW lock
Vanilla:                129804626       12340895        14716138
This series:            113718002       10982159        13068934
Jan patch:              117977108       11363462        13615449
Jan patch + #5:         121483176       11696728        13618967

The bottomline of discussion [1] was that queued locks are more
effective when SoC has many CPUs. And 4 is not many. My measurement
was made on the 4-CPU machine, and it seems it confirms that. Does
it make sense to make queued locks default for many-CPU machines only?

There were 2 preparing patches in the series: 
[PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock
and
[PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h

1st patch is not needed anymore because Babu Moger submitted similar patch that
is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with
qrwlock.c"). Could you revisit second patch?

[1] https://lkml.org/lkml/2017/5/3/330

Yury

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 0/5] Switch arm64 over to qrwlock
@ 2017-10-08 21:30   ` Yury Norov
  0 siblings, 0 replies; 40+ messages in thread
From: Yury Norov @ 2017-10-08 21:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote:
> Hi all,
> 
> This is version two of the patches I posted yesterday:
> 
>   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
> 
> I'd normally leave it longer before posting again, but Peter had a good
> suggestion to rework the layout of the lock word, so I wanted to post a
> version that follows that approach.
> 
> I've updated my branch if you're after the full patch stack:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
> 
> As before, all comments (particularly related to testing and performance)
> welcome!
> 
> Cheers,
> 
> Will

Hi Will,

I tested your patches with locktorture and found measurable performance
regression. I also respin the patch of Jan Glauber [1], and I also
tried Jan's patch with patch 5 from this series. Numbers differ a lot
from my previous measurements, but since that I changed working
station and use qemu with the support of parallel threads.
                        Spinlock        Read-RW lock    Write-RW lock
Vanilla:                129804626       12340895        14716138
This series:            113718002       10982159        13068934
Jan patch:              117977108       11363462        13615449
Jan patch + #5:         121483176       11696728        13618967

The bottomline of discussion [1] was that queued locks are more
effective when SoC has many CPUs. And 4 is not many. My measurement
was made on the 4-CPU machine, and it seems it confirms that. Does
it make sense to make queued locks default for many-CPU machines only?

There were 2 preparing patches in the series: 
[PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock
and
[PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h

1st patch is not needed anymore because Babu Moger submitted similar patch that
is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with
qrwlock.c"). Could you revisit second patch?

[1] https://lkml.org/lkml/2017/5/3/330

Yury

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Switch arm64 over to qrwlock
  2017-10-08 21:30   ` Yury Norov
@ 2017-10-09  6:52     ` Peter Zijlstra
  -1 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2017-10-09  6:52 UTC (permalink / raw)
  To: Yury Norov
  Cc: Will Deacon, linux-kernel, linux-arm-kernel, Jeremy.Linton,
	mingo, longman, boqun.feng, paulmck

On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> The bottomline of discussion [1] was that queued locks are more
> effective when SoC has many CPUs. And 4 is not many.

qspinlock, yes. qrwlock not, as it fully depends on arch_spinlock_t for
the queueing. qrwlock is just a generic rwlock_t implementation.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 0/5] Switch arm64 over to qrwlock
@ 2017-10-09  6:52     ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2017-10-09  6:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> The bottomline of discussion [1] was that queued locks are more
> effective when SoC has many CPUs. And 4 is not many.

qspinlock, yes. qrwlock not, as it fully depends on arch_spinlock_t for
the queueing. qrwlock is just a generic rwlock_t implementation.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Switch arm64 over to qrwlock
  2017-10-08 21:30   ` Yury Norov
@ 2017-10-09  9:59     ` Will Deacon
  -1 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-09  9:59 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, linux-arm-kernel, Jeremy.Linton, peterz, mingo,
	longman, boqun.feng, paulmck

Hi Yury,

On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote:
> > This is version two of the patches I posted yesterday:
> > 
> >   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
> > 
> > I'd normally leave it longer before posting again, but Peter had a good
> > suggestion to rework the layout of the lock word, so I wanted to post a
> > version that follows that approach.
> > 
> > I've updated my branch if you're after the full patch stack:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
> > 
> > As before, all comments (particularly related to testing and performance)
> > welcome!
> > 
> I tested your patches with locktorture and found measurable performance
> regression. I also respin the patch of Jan Glauber [1], and I also
> tried Jan's patch with patch 5 from this series. Numbers differ a lot
> from my previous measurements, but since that I changed working
> station and use qemu with the support of parallel threads.
>                         Spinlock        Read-RW lock    Write-RW lock
> Vanilla:                129804626       12340895        14716138
> This series:            113718002       10982159        13068934
> Jan patch:              117977108       11363462        13615449
> Jan patch + #5:         121483176       11696728        13618967
> 
> The bottomline of discussion [1] was that queued locks are more
> effective when SoC has many CPUs. And 4 is not many. My measurement
> was made on the 4-CPU machine, and it seems it confirms that. Does
> it make sense to make queued locks default for many-CPU machines only?

Just to confirm, you're running this under qemu on an x86 host, using full
AArch64 system emulation? If so, I really don't think we should base the
merits of qrwlocks on arm64 around this type of configuration. Given that
you work for a silicon vendor, could you try running on real arm64 hardware
instead, please? My measurements on 6-core and 8-core systems look a lot
better with qrwlock than what we currently have in mainline, and they
also fix a real starvation issue reported by Jeremy [1].

I'd also add that lock fairness comes at a cost, so I'd expect a small drop
in total throughput for some workloads. I encourage you to try passing
different arguments to locktorture to see this in action. For example, on
an 8-core machine:

# insmod ./locktorture.ko nwriters_stress=2 nreaders_stress=8 torture_type="rw_lock_irq" stat_interval=2

-rc3:

  Writes:  Total: 6612  Max/Min: 0/0   Fail: 0
  Reads :  Total: 1265230  Max/Min: 0/0   Fail: 0
  Writes:  Total: 6709  Max/Min: 0/0   Fail: 0
  Reads :  Total: 1916418  Max/Min: 0/0   Fail: 0
  Writes:  Total: 6725  Max/Min: 0/0   Fail: 0
  Reads :  Total: 5103727  Max/Min: 0/0   Fail: 0

notice how the writers are really struggling here (you only have to tweak a
bit more and you get RCU stalls, lose interrupts etc).

With the qrwlock:

  Writes:  Total: 47962  Max/Min: 0/0   Fail: 0
  Reads :  Total: 277903  Max/Min: 0/0   Fail: 0
  Writes:  Total: 100151  Max/Min: 0/0   Fail: 0
  Reads :  Total: 525781  Max/Min: 0/0   Fail: 0
  Writes:  Total: 155284  Max/Min: 0/0   Fail: 0
  Reads :  Total: 767703  Max/Min: 0/0   Fail: 0

which is an awful lot better for maximum latency and fairness, despite the
much lower reader count.

> There were 2 preparing patches in the series: 
> [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock
> and
> [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h
> 
> 1st patch is not needed anymore because Babu Moger submitted similar patch that
> is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with
> qrwlock.c"). Could you revisit second patch?

Sorry, not sure what you're asking me to do here.

Will

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534299.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 0/5] Switch arm64 over to qrwlock
@ 2017-10-09  9:59     ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-09  9:59 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Yury,

On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote:
> > This is version two of the patches I posted yesterday:
> > 
> >   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
> > 
> > I'd normally leave it longer before posting again, but Peter had a good
> > suggestion to rework the layout of the lock word, so I wanted to post a
> > version that follows that approach.
> > 
> > I've updated my branch if you're after the full patch stack:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
> > 
> > As before, all comments (particularly related to testing and performance)
> > welcome!
> > 
> I tested your patches with locktorture and found measurable performance
> regression. I also respin the patch of Jan Glauber [1], and I also
> tried Jan's patch with patch 5 from this series. Numbers differ a lot
> from my previous measurements, but since that I changed working
> station and use qemu with the support of parallel threads.
>                         Spinlock        Read-RW lock    Write-RW lock
> Vanilla:                129804626       12340895        14716138
> This series:            113718002       10982159        13068934
> Jan patch:              117977108       11363462        13615449
> Jan patch + #5:         121483176       11696728        13618967
> 
> The bottomline of discussion [1] was that queued locks are more
> effective when SoC has many CPUs. And 4 is not many. My measurement
> was made on the 4-CPU machine, and it seems it confirms that. Does
> it make sense to make queued locks default for many-CPU machines only?

Just to confirm, you're running this under qemu on an x86 host, using full
AArch64 system emulation? If so, I really don't think we should base the
merits of qrwlocks on arm64 around this type of configuration. Given that
you work for a silicon vendor, could you try running on real arm64 hardware
instead, please? My measurements on 6-core and 8-core systems look a lot
better with qrwlock than what we currently have in mainline, and they
also fix a real starvation issue reported by Jeremy [1].

I'd also add that lock fairness comes at a cost, so I'd expect a small drop
in total throughput for some workloads. I encourage you to try passing
different arguments to locktorture to see this in action. For example, on
an 8-core machine:

# insmod ./locktorture.ko nwriters_stress=2 nreaders_stress=8 torture_type="rw_lock_irq" stat_interval=2

-rc3:

  Writes:  Total: 6612  Max/Min: 0/0   Fail: 0
  Reads :  Total: 1265230  Max/Min: 0/0   Fail: 0
  Writes:  Total: 6709  Max/Min: 0/0   Fail: 0
  Reads :  Total: 1916418  Max/Min: 0/0   Fail: 0
  Writes:  Total: 6725  Max/Min: 0/0   Fail: 0
  Reads :  Total: 5103727  Max/Min: 0/0   Fail: 0

notice how the writers are really struggling here (you only have to tweak a
bit more and you get RCU stalls, lose interrupts etc).

With the qrwlock:

  Writes:  Total: 47962  Max/Min: 0/0   Fail: 0
  Reads :  Total: 277903  Max/Min: 0/0   Fail: 0
  Writes:  Total: 100151  Max/Min: 0/0   Fail: 0
  Reads :  Total: 525781  Max/Min: 0/0   Fail: 0
  Writes:  Total: 155284  Max/Min: 0/0   Fail: 0
  Reads :  Total: 767703  Max/Min: 0/0   Fail: 0

which is an awful lot better for maximum latency and fairness, despite the
much lower reader count.

> There were 2 preparing patches in the series: 
> [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock
> and
> [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h
> 
> 1st patch is not needed anymore because Babu Moger submitted similar patch that
> is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with
> qrwlock.c"). Could you revisit second patch?

Sorry, not sure what you're asking me to do here.

Will

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534299.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Switch arm64 over to qrwlock
  2017-10-09  6:52     ` Peter Zijlstra
@ 2017-10-09 10:02       ` Will Deacon
  -1 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-09 10:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yury Norov, linux-kernel, linux-arm-kernel, Jeremy.Linton, mingo,
	longman, boqun.feng, paulmck

On Mon, Oct 09, 2017 at 08:52:43AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> > The bottomline of discussion [1] was that queued locks are more
> > effective when SoC has many CPUs. And 4 is not many.
> 
> qspinlock, yes. qrwlock not, as it fully depends on arch_spinlock_t for
> the queueing. qrwlock is just a generic rwlock_t implementation.

Yup, and once I've knocked qrwlocks on the head I'll go take a look at
qspinlock. Either way, I'll keep our ticket implementation around because
(a) it's a tonne simpler (b) I don't have data to suggest that it sucks and
(c) there's been formal work to show that various parts of it are correct.

rwlock on the other hand has been shown to be broken, I know that it sucks
and there's not been any formal work, so I'll be glad to see the back of it!

Will

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 0/5] Switch arm64 over to qrwlock
@ 2017-10-09 10:02       ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-09 10:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 09, 2017 at 08:52:43AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> > The bottomline of discussion [1] was that queued locks are more
> > effective when SoC has many CPUs. And 4 is not many.
> 
> qspinlock, yes. qrwlock not, as it fully depends on arch_spinlock_t for
> the queueing. qrwlock is just a generic rwlock_t implementation.

Yup, and once I've knocked qrwlocks on the head I'll go take a look at
qspinlock. Either way, I'll keep our ticket implementation around because
(a) it's a tonne simpler (b) I don't have data to suggest that it sucks and
(c) there's been formal work to show that various parts of it are correct.

rwlock on the other hand has been shown to be broken, I know that it sucks
and there's not been any formal work, so I'll be glad to see the back of it!

Will

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 3/5] kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
  2017-10-08  1:03     ` Boqun Feng
@ 2017-10-09 11:30       ` Will Deacon
  -1 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-09 11:30 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linux-arm-kernel, Jeremy.Linton, peterz, mingo,
	longman, paulmck

On Sun, Oct 08, 2017 at 09:03:34AM +0800, Boqun Feng wrote:
> On Fri, Oct 06, 2017 at 01:34:40PM +0000, Will Deacon wrote:
> > The qrwlock slowpaths involve spinning when either a prospective reader
> > is waiting for a concurrent writer to drain, or a prospective writer is
> > waiting for concurrent readers to drain. In both of these situations,
> > atomic_cond_read_acquire can be used to avoid busy-waiting and make use
> > of any backoff functionality provided by the architecture.
> > 
> > This patch replaces the open-code loops and rspin_until_writer_unlock
> > implementation with atomic_cond_read_acquire. The write mode transition
> > zero to _QW_WAITING is left alone, since (a) this doesn't need acquire
> > semantics and (b) should be fast.
> > 
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Boqun Feng <boqun.feng@gmail.com>
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  kernel/locking/qrwlock.c | 47 +++++++++++------------------------------------
> >  1 file changed, 11 insertions(+), 36 deletions(-)
> > 
> > diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> > index 1af791e37348..b7ea4647c74d 100644
> > --- a/kernel/locking/qrwlock.c
> > +++ b/kernel/locking/qrwlock.c
> > @@ -24,23 +24,6 @@
> >  #include <asm/qrwlock.h>
> >  
> >  /**
> > - * rspin_until_writer_unlock - inc reader count & spin until writer is gone
> > - * @lock  : Pointer to queue rwlock structure
> > - * @writer: Current queue rwlock writer status byte
> > - *
> > - * In interrupt context or at the head of the queue, the reader will just
> > - * increment the reader count & wait until the writer releases the lock.
> > - */
> > -static __always_inline void
> > -rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
> > -{
> > -	while ((cnts & _QW_WMASK) == _QW_LOCKED) {
> > -		cpu_relax();
> > -		cnts = atomic_read_acquire(&lock->cnts);
> > -	}
> > -}
> > -
> > -/**
> >   * queued_read_lock_slowpath - acquire read lock of a queue rwlock
> >   * @lock: Pointer to queue rwlock structure
> >   * @cnts: Current qrwlock lock value
> > @@ -53,13 +36,12 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
> 
> So the second parameter(@cnts) could be removed entirely, right?
> Any reason we still keep it?

Well spotted! I'll remove it.

Will

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 3/5] kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
@ 2017-10-09 11:30       ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-09 11:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Oct 08, 2017 at 09:03:34AM +0800, Boqun Feng wrote:
> On Fri, Oct 06, 2017 at 01:34:40PM +0000, Will Deacon wrote:
> > The qrwlock slowpaths involve spinning when either a prospective reader
> > is waiting for a concurrent writer to drain, or a prospective writer is
> > waiting for concurrent readers to drain. In both of these situations,
> > atomic_cond_read_acquire can be used to avoid busy-waiting and make use
> > of any backoff functionality provided by the architecture.
> > 
> > This patch replaces the open-code loops and rspin_until_writer_unlock
> > implementation with atomic_cond_read_acquire. The write mode transition
> > zero to _QW_WAITING is left alone, since (a) this doesn't need acquire
> > semantics and (b) should be fast.
> > 
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Boqun Feng <boqun.feng@gmail.com>
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  kernel/locking/qrwlock.c | 47 +++++++++++------------------------------------
> >  1 file changed, 11 insertions(+), 36 deletions(-)
> > 
> > diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> > index 1af791e37348..b7ea4647c74d 100644
> > --- a/kernel/locking/qrwlock.c
> > +++ b/kernel/locking/qrwlock.c
> > @@ -24,23 +24,6 @@
> >  #include <asm/qrwlock.h>
> >  
> >  /**
> > - * rspin_until_writer_unlock - inc reader count & spin until writer is gone
> > - * @lock  : Pointer to queue rwlock structure
> > - * @writer: Current queue rwlock writer status byte
> > - *
> > - * In interrupt context or at the head of the queue, the reader will just
> > - * increment the reader count & wait until the writer releases the lock.
> > - */
> > -static __always_inline void
> > -rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
> > -{
> > -	while ((cnts & _QW_WMASK) == _QW_LOCKED) {
> > -		cpu_relax();
> > -		cnts = atomic_read_acquire(&lock->cnts);
> > -	}
> > -}
> > -
> > -/**
> >   * queued_read_lock_slowpath - acquire read lock of a queue rwlock
> >   * @lock: Pointer to queue rwlock structure
> >   * @cnts: Current qrwlock lock value
> > @@ -53,13 +36,12 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
> 
> So the second parameter(@cnts) could be removed entirely, right?
> Any reason we still keep it?

Well spotted! I'll remove it.

Will

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Switch arm64 over to qrwlock
  2017-10-09  9:59     ` Will Deacon
@ 2017-10-09 12:49       ` Yury Norov
  -1 siblings, 0 replies; 40+ messages in thread
From: Yury Norov @ 2017-10-09 12:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, Jeremy.Linton, peterz, mingo,
	longman, boqun.feng, paulmck

On Mon, Oct 09, 2017 at 10:59:36AM +0100, Will Deacon wrote:
> Hi Yury,
> 
> On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> > On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote:
> > > This is version two of the patches I posted yesterday:
> > > 
> > >   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
> > > 
> > > I'd normally leave it longer before posting again, but Peter had a good
> > > suggestion to rework the layout of the lock word, so I wanted to post a
> > > version that follows that approach.
> > > 
> > > I've updated my branch if you're after the full patch stack:
> > > 
> > >   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
> > > 
> > > As before, all comments (particularly related to testing and performance)
> > > welcome!
> > > 
> > I tested your patches with locktorture and found measurable performance
> > regression. I also respin the patch of Jan Glauber [1], and I also
> > tried Jan's patch with patch 5 from this series. Numbers differ a lot
> > from my previous measurements, but since that I changed working
> > station and use qemu with the support of parallel threads.
> >                         Spinlock        Read-RW lock    Write-RW lock
> > Vanilla:                129804626       12340895        14716138
> > This series:            113718002       10982159        13068934
> > Jan patch:              117977108       11363462        13615449
> > Jan patch + #5:         121483176       11696728        13618967
> > 
> > The bottomline of discussion [1] was that queued locks are more
> > effective when SoC has many CPUs. And 4 is not many. My measurement
> > was made on the 4-CPU machine, and it seems it confirms that. Does
> > it make sense to make queued locks default for many-CPU machines only?
> 
> Just to confirm, you're running this under qemu on an x86 host, using full
> AArch64 system emulation? If so, I really don't think we should base the
> merits of qrwlocks on arm64 around this type of configuration. Given that
> you work for a silicon vendor, could you try running on real arm64 hardware
> instead, please?

I don't have the hardware access at the moment. I'll run the test when
I'll get it.

> My measurements on 6-core and 8-core systems look a lot
> better with qrwlock than what we currently have in mainline, and they
> also fix a real starvation issue reported by Jeremy [1].
> 
> I'd also add that lock fairness comes at a cost, so I'd expect a small drop
> in total throughput for some workloads. I encourage you to try passing
> different arguments to locktorture to see this in action. For example, on
> an 8-core machine:
> 
> # insmod ./locktorture.ko nwriters_stress=2 nreaders_stress=8 torture_type="rw_lock_irq" stat_interval=2
> 
> -rc3:
> 
>   Writes:  Total: 6612  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 1265230  Max/Min: 0/0   Fail: 0
>   Writes:  Total: 6709  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 1916418  Max/Min: 0/0   Fail: 0
>   Writes:  Total: 6725  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 5103727  Max/Min: 0/0   Fail: 0
> 
> notice how the writers are really struggling here (you only have to tweak a
> bit more and you get RCU stalls, lose interrupts etc).
> 
> With the qrwlock:
> 
>   Writes:  Total: 47962  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 277903  Max/Min: 0/0   Fail: 0
>   Writes:  Total: 100151  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 525781  Max/Min: 0/0   Fail: 0
>   Writes:  Total: 155284  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 767703  Max/Min: 0/0   Fail: 0
> 
> which is an awful lot better for maximum latency and fairness, despite the
> much lower reader count.
> 
> > There were 2 preparing patches in the series: 
> > [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock
> > and
> > [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h
> > 
> > 1st patch is not needed anymore because Babu Moger submitted similar patch that
> > is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with
> > qrwlock.c"). Could you revisit second patch?
> 
> Sorry, not sure what you're asking me to do here.

It removes unneeded #include <linux/atomic.h> in
include/asm-generic/qspinlock_types.h. Could you or someone else take
it upstream?
 
> Will
> 
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534299.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 0/5] Switch arm64 over to qrwlock
@ 2017-10-09 12:49       ` Yury Norov
  0 siblings, 0 replies; 40+ messages in thread
From: Yury Norov @ 2017-10-09 12:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 09, 2017 at 10:59:36AM +0100, Will Deacon wrote:
> Hi Yury,
> 
> On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> > On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote:
> > > This is version two of the patches I posted yesterday:
> > > 
> > >   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
> > > 
> > > I'd normally leave it longer before posting again, but Peter had a good
> > > suggestion to rework the layout of the lock word, so I wanted to post a
> > > version that follows that approach.
> > > 
> > > I've updated my branch if you're after the full patch stack:
> > > 
> > >   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
> > > 
> > > As before, all comments (particularly related to testing and performance)
> > > welcome!
> > > 
> > I tested your patches with locktorture and found measurable performance
> > regression. I also respin the patch of Jan Glauber [1], and I also
> > tried Jan's patch with patch 5 from this series. Numbers differ a lot
> > from my previous measurements, but since that I changed working
> > station and use qemu with the support of parallel threads.
> >                         Spinlock        Read-RW lock    Write-RW lock
> > Vanilla:                129804626       12340895        14716138
> > This series:            113718002       10982159        13068934
> > Jan patch:              117977108       11363462        13615449
> > Jan patch + #5:         121483176       11696728        13618967
> > 
> > The bottomline of discussion [1] was that queued locks are more
> > effective when SoC has many CPUs. And 4 is not many. My measurement
> > was made on the 4-CPU machine, and it seems it confirms that. Does
> > it make sense to make queued locks default for many-CPU machines only?
> 
> Just to confirm, you're running this under qemu on an x86 host, using full
> AArch64 system emulation? If so, I really don't think we should base the
> merits of qrwlocks on arm64 around this type of configuration. Given that
> you work for a silicon vendor, could you try running on real arm64 hardware
> instead, please?

I don't have the hardware access at the moment. I'll run the test when
I'll get it.

> My measurements on 6-core and 8-core systems look a lot
> better with qrwlock than what we currently have in mainline, and they
> also fix a real starvation issue reported by Jeremy [1].
> 
> I'd also add that lock fairness comes at a cost, so I'd expect a small drop
> in total throughput for some workloads. I encourage you to try passing
> different arguments to locktorture to see this in action. For example, on
> an 8-core machine:
> 
> # insmod ./locktorture.ko nwriters_stress=2 nreaders_stress=8 torture_type="rw_lock_irq" stat_interval=2
> 
> -rc3:
> 
>   Writes:  Total: 6612  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 1265230  Max/Min: 0/0   Fail: 0
>   Writes:  Total: 6709  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 1916418  Max/Min: 0/0   Fail: 0
>   Writes:  Total: 6725  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 5103727  Max/Min: 0/0   Fail: 0
> 
> notice how the writers are really struggling here (you only have to tweak a
> bit more and you get RCU stalls, lose interrupts etc).
> 
> With the qrwlock:
> 
>   Writes:  Total: 47962  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 277903  Max/Min: 0/0   Fail: 0
>   Writes:  Total: 100151  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 525781  Max/Min: 0/0   Fail: 0
>   Writes:  Total: 155284  Max/Min: 0/0   Fail: 0
>   Reads :  Total: 767703  Max/Min: 0/0   Fail: 0
> 
> which is an awful lot better for maximum latency and fairness, despite the
> much lower reader count.
> 
> > There were 2 preparing patches in the series: 
> > [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock
> > and
> > [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h
> > 
> > 1st patch is not needed anymore because Babu Moger submitted similar patch that
> > is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with
> > qrwlock.c"). Could you revisit second patch?
> 
> Sorry, not sure what you're asking me to do here.

It removes unneeded #include <linux/atomic.h> in
include/asm-generic/qspinlock_types.h. Could you or someone else take
it upstream?
 
> Will
> 
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534299.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Switch arm64 over to qrwlock
  2017-10-09 12:49       ` Yury Norov
@ 2017-10-09 13:13         ` Will Deacon
  -1 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-09 13:13 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, linux-arm-kernel, Jeremy.Linton, peterz, mingo,
	longman, boqun.feng, paulmck

On Mon, Oct 09, 2017 at 03:49:21PM +0300, Yury Norov wrote:
> On Mon, Oct 09, 2017 at 10:59:36AM +0100, Will Deacon wrote:
> > On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> > > There were 2 preparing patches in the series: 
> > > [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock
> > > and
> > > [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h
> > > 
> > > 1st patch is not needed anymore because Babu Moger submitted similar patch that
> > > is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with
> > > qrwlock.c"). Could you revisit second patch?
> > 
> > Sorry, not sure what you're asking me to do here.
> 
> It removes unneeded #include <linux/atomic.h> in
> include/asm-generic/qspinlock_types.h. Could you or someone else take
> it upstream?

My patch implements qrwlocks, not qspinlocks, so it's a bit weird to take
this random patch in the same series. Given that Arnd acked it, I'd suggest
either sending it through him, or leaving it until I get round to looking at
qspinlock for arm64 (see my reply to Peter).

Will

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 0/5] Switch arm64 over to qrwlock
@ 2017-10-09 13:13         ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-09 13:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 09, 2017 at 03:49:21PM +0300, Yury Norov wrote:
> On Mon, Oct 09, 2017 at 10:59:36AM +0100, Will Deacon wrote:
> > On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote:
> > > There were 2 preparing patches in the series: 
> > > [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock
> > > and
> > > [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h
> > > 
> > > 1st patch is not needed anymore because Babu Moger submitted similar patch that
> > > is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with
> > > qrwlock.c"). Could you revisit second patch?
> > 
> > Sorry, not sure what you're asking me to do here.
> 
> It removes unneeded #include <linux/atomic.h> in
> include/asm-generic/qspinlock_types.h. Could you or someone else take
> it upstream?

My patch implements qrwlocks, not qspinlocks, so it's a bit weird to take
this random patch in the same series. Given that Arnd acked it, I'd suggest
either sending it through him, or leaving it until I get round to looking at
qspinlock for arm64 (see my reply to Peter).

Will

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Switch arm64 over to qrwlock
  2017-10-06 13:34 ` Will Deacon
@ 2017-10-09 21:19   ` Waiman Long
  -1 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2017-10-09 21:19 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: linux-arm-kernel, Jeremy.Linton, peterz, mingo, boqun.feng,
	paulmck, Yury Norov

On 10/06/2017 09:34 AM, Will Deacon wrote:
> Hi all,
>
> This is version two of the patches I posted yesterday:
>
>   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
>
> I'd normally leave it longer before posting again, but Peter had a good
> suggestion to rework the layout of the lock word, so I wanted to post a
> version that follows that approach.
>
> I've updated my branch if you're after the full patch stack:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
>
> As before, all comments (particularly related to testing and performance)
> welcome!
>
> Cheers,
>
> Will
>
> --->8
>
> Will Deacon (5):
>   kernel/locking: Use struct qrwlock instead of struct __qrwlock
>   locking/atomic: Add atomic_cond_read_acquire
>   kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
>   arm64: locking: Move rwlock implementation over to qrwlocks
>   kernel/locking: Prevent slowpath writers getting held up by fastpath
>
>  arch/arm64/Kconfig                      |  17 ++++
>  arch/arm64/include/asm/Kbuild           |   1 +
>  arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
>  arch/arm64/include/asm/spinlock_types.h |   6 +-
>  include/asm-generic/atomic-long.h       |   3 +
>  include/asm-generic/qrwlock.h           |  20 +---
>  include/asm-generic/qrwlock_types.h     |  15 ++-
>  include/linux/atomic.h                  |   4 +
>  kernel/locking/qrwlock.c                |  83 +++-------------
>  9 files changed, 58 insertions(+), 255 deletions(-)
>
I had done some performance test of your patch on a 1 socket Cavium
CN8880 system with 32 cores. I used my locking stress test which
produced the following results with 16 locking threads at various mixes
of reader & writer threads on 4.14-rc4 based kernels. The numbers are
the minimum/average/maximum locking operations done per locking threads
in a 10 seconds period. A minimum number of 1 means there is at least 1
thread that cannot acquire the lock during the test period.

                w/o qrwlock patch               with qrwlock patch
                -----------------               ------------------
16 readers   793,024/1,169,763/1,684,751  1,060,127/1,198,583/1,331,003
        
12 readers 1,162,760/1,641,714/2,162,939  1,685,334/2,099,088/2,338,461
 4 writers         1/        1/        1     25,540/  195,975/  392,232
 
 8 readers 2,135,670/2,391,612/2,737,564  2,985,686/3,359,048/3,870,423
 8 writers         1/   19,867/   88,173    119,078/  559,604/1,112,769
 
 4 readers 1,194,917/1,250,876/1,299,304  3,611,059/4,653,775/6,268,370
12 writers   176,156/1,088,513/2,594,534      7,664/  795,393/1,841,961

16 writers    35,007/1,094,608/1,954,457  1,618,915/1,633,077/1,645,637

It can be seen that qrwlock performed much better than the original rwlock
implementation.

Tested-by: Waiman Long <longman@redhat.com>

Cheers,
Longman

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 0/5] Switch arm64 over to qrwlock
@ 2017-10-09 21:19   ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2017-10-09 21:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 10/06/2017 09:34 AM, Will Deacon wrote:
> Hi all,
>
> This is version two of the patches I posted yesterday:
>
>   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
>
> I'd normally leave it longer before posting again, but Peter had a good
> suggestion to rework the layout of the lock word, so I wanted to post a
> version that follows that approach.
>
> I've updated my branch if you're after the full patch stack:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
>
> As before, all comments (particularly related to testing and performance)
> welcome!
>
> Cheers,
>
> Will
>
> --->8
>
> Will Deacon (5):
>   kernel/locking: Use struct qrwlock instead of struct __qrwlock
>   locking/atomic: Add atomic_cond_read_acquire
>   kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
>   arm64: locking: Move rwlock implementation over to qrwlocks
>   kernel/locking: Prevent slowpath writers getting held up by fastpath
>
>  arch/arm64/Kconfig                      |  17 ++++
>  arch/arm64/include/asm/Kbuild           |   1 +
>  arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
>  arch/arm64/include/asm/spinlock_types.h |   6 +-
>  include/asm-generic/atomic-long.h       |   3 +
>  include/asm-generic/qrwlock.h           |  20 +---
>  include/asm-generic/qrwlock_types.h     |  15 ++-
>  include/linux/atomic.h                  |   4 +
>  kernel/locking/qrwlock.c                |  83 +++-------------
>  9 files changed, 58 insertions(+), 255 deletions(-)
>
I had done some performance test of your patch on a 1 socket Cavium
CN8880 system with 32 cores. I used my locking stress test which
produced the following results with 16 locking threads at various mixes
of reader & writer threads on 4.14-rc4 based kernels. The numbers are
the minimum/average/maximum locking operations done per locking threads
in a 10 seconds period. A minimum number of 1 means there is at least 1
thread that cannot acquire the lock during the test period.

                w/o qrwlock patch               with qrwlock patch
                -----------------               ------------------
16 readers   793,024/1,169,763/1,684,751  1,060,127/1,198,583/1,331,003
        
12 readers 1,162,760/1,641,714/2,162,939  1,685,334/2,099,088/2,338,461
 4 writers         1/        1/        1     25,540/  195,975/  392,232
 
 8 readers 2,135,670/2,391,612/2,737,564  2,985,686/3,359,048/3,870,423
 8 writers         1/   19,867/   88,173    119,078/  559,604/1,112,769
 
 4 readers 1,194,917/1,250,876/1,299,304  3,611,059/4,653,775/6,268,370
12 writers   176,156/1,088,513/2,594,534      7,664/  795,393/1,841,961

16 writers    35,007/1,094,608/1,954,457  1,618,915/1,633,077/1,645,637

It can be seen that qrwlock performed much better than the original rwlock
implementation.

Tested-by: Waiman Long <longman@redhat.com>

Cheers,
Longman

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Switch arm64 over to qrwlock
  2017-10-06 13:34 ` Will Deacon
@ 2017-10-09 22:31   ` Jeremy Linton
  -1 siblings, 0 replies; 40+ messages in thread
From: Jeremy Linton @ 2017-10-09 22:31 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, longman, boqun.feng, paulmck

Hi,

On 10/06/2017 08:34 AM, Will Deacon wrote:
> Hi all,
> 
> This is version two of the patches I posted yesterday:
> 
>    http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
> 
> I'd normally leave it longer before posting again, but Peter had a good
> suggestion to rework the layout of the lock word, so I wanted to post a
> version that follows that approach.
> 
> I've updated my branch if you're after the full patch stack:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
> 
> As before, all comments (particularly related to testing and performance)
> welcome!

I've been doing perf comparisons with the rwlock fairness patch I posted 
last week on a single socket thunderx and the baseline rwlock. For most 
cases where the ratio of read/writers is similar (uncontended 
readers/writers, single writer/lots readers, etc) the absolute number of 
lock acquisitions is very similar (within a few percent one way or the 
other).

In this regard both patches are light years ahead of the current arm64 
rwlock. The unfairness of the current rwlock allows significantly higher 
lock acquisition counts (say 4x at 30Readers:1Writer) at the expense of 
complete writer starvation (or a ~43k:1 ratio at a 30R:1W per 
locktorture). This is untenable.

The qrwlock does an excellent job of maintaining the ratio of 
reader/writer acquisitions proportional to the number of readers/writers 
until the total lockers exceeds the number of cores where the ratios 
start to far exceed the reader/writer ratios (440:1 acquisitions @ 96R:1W)

In comparison the other patch tends to favor the writers more, so at a 
ratio of 48R/1W, the readers are only grabbing the lock at a ratio of 
15:1. This flatter curve continues past the number of cores with the 
readers having a 48:1 advantage with 96R/1W. That said the total lock 
acquisition counts remain very similar (with maybe a slight advantage to 
the non queued patch with 1 writer and 12-30 readers) despite the writer 
acquiring the lock at a higher frequency. OTOH, if the number of writers 
is closer to the number of readers (24R:24W) then the writers have about 
a 1.5X bias over the readers independent of the number of 
reader/writers. This bias seems to be the common multiplier given a 
reader/writer ratio with those patches and doesn't account for possible 
single thread starvation.

Of course, I've been running other tests as well and the system seems to 
be behaving as expected (likely better than the rwlock patches under 
stress). I will continue to test this on a couple other platforms.

In the meantime:

Tested-by: Jeremy Linton <jeremy.linton@arm.com>


> 
> Cheers,
> 
> Will
> 
> --->8
> 
> Will Deacon (5):
>    kernel/locking: Use struct qrwlock instead of struct __qrwlock
>    locking/atomic: Add atomic_cond_read_acquire
>    kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
>    arm64: locking: Move rwlock implementation over to qrwlocks
>    kernel/locking: Prevent slowpath writers getting held up by fastpath
> 
>   arch/arm64/Kconfig                      |  17 ++++
>   arch/arm64/include/asm/Kbuild           |   1 +
>   arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
>   arch/arm64/include/asm/spinlock_types.h |   6 +-
>   include/asm-generic/atomic-long.h       |   3 +
>   include/asm-generic/qrwlock.h           |  20 +---
>   include/asm-generic/qrwlock_types.h     |  15 ++-
>   include/linux/atomic.h                  |   4 +
>   kernel/locking/qrwlock.c                |  83 +++-------------
>   9 files changed, 58 insertions(+), 255 deletions(-)
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 0/5] Switch arm64 over to qrwlock
@ 2017-10-09 22:31   ` Jeremy Linton
  0 siblings, 0 replies; 40+ messages in thread
From: Jeremy Linton @ 2017-10-09 22:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On 10/06/2017 08:34 AM, Will Deacon wrote:
> Hi all,
> 
> This is version two of the patches I posted yesterday:
> 
>    http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
> 
> I'd normally leave it longer before posting again, but Peter had a good
> suggestion to rework the layout of the lock word, so I wanted to post a
> version that follows that approach.
> 
> I've updated my branch if you're after the full patch stack:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
> 
> As before, all comments (particularly related to testing and performance)
> welcome!

I've been doing perf comparisons with the rwlock fairness patch I posted 
last week on a single socket thunderx and the baseline rwlock. For most 
cases where the ratio of read/writers is similar (uncontended 
readers/writers, single writer/lots readers, etc) the absolute number of 
lock acquisitions is very similar (within a few percent one way or the 
other).

In this regard both patches are light years ahead of the current arm64 
rwlock. The unfairness of the current rwlock allows significantly higher 
lock acquisition counts (say 4x at 30Readers:1Writer) at the expense of 
complete writer starvation (or a ~43k:1 ratio at a 30R:1W per 
locktorture). This is untenable.

The qrwlock does an excellent job of maintaining the ratio of 
reader/writer acquisitions proportional to the number of readers/writers 
until the total lockers exceeds the number of cores where the ratios 
start to far exceed the reader/writer ratios (440:1 acquisitions @ 96R:1W)

In comparison the other patch tends to favor the writers more, so at a 
ratio of 48R/1W, the readers are only grabbing the lock at a ratio of 
15:1. This flatter curve continues past the number of cores with the 
readers having a 48:1 advantage with 96R/1W. That said the total lock 
acquisition counts remain very similar (with maybe a slight advantage to 
the non queued patch with 1 writer and 12-30 readers) despite the writer 
acquiring the lock at a higher frequency. OTOH, if the number of writers 
is closer to the number of readers (24R:24W) then the writers have about 
a 1.5X bias over the readers independent of the number of 
reader/writers. This bias seems to be the common multiplier given a 
reader/writer ratio with those patches and doesn't account for possible 
single thread starvation.

Of course, I've been running other tests as well and the system seems to 
be behaving as expected (likely better than the rwlock patches under 
stress). I will continue to test this on a couple other platforms.

In the meantime:

Tested-by: Jeremy Linton <jeremy.linton@arm.com>


> 
> Cheers,
> 
> Will
> 
> --->8
> 
> Will Deacon (5):
>    kernel/locking: Use struct qrwlock instead of struct __qrwlock
>    locking/atomic: Add atomic_cond_read_acquire
>    kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
>    arm64: locking: Move rwlock implementation over to qrwlocks
>    kernel/locking: Prevent slowpath writers getting held up by fastpath
> 
>   arch/arm64/Kconfig                      |  17 ++++
>   arch/arm64/include/asm/Kbuild           |   1 +
>   arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
>   arch/arm64/include/asm/spinlock_types.h |   6 +-
>   include/asm-generic/atomic-long.h       |   3 +
>   include/asm-generic/qrwlock.h           |  20 +---
>   include/asm-generic/qrwlock_types.h     |  15 ++-
>   include/linux/atomic.h                  |   4 +
>   kernel/locking/qrwlock.c                |  83 +++-------------
>   9 files changed, 58 insertions(+), 255 deletions(-)
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/5] arm64: locking: Move rwlock implementation over to qrwlocks
  2017-10-06 13:34   ` Will Deacon
@ 2017-10-10  1:34     ` Waiman Long
  -1 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2017-10-10  1:34 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: linux-arm-kernel, Jeremy.Linton, peterz, mingo, boqun.feng, paulmck

On 10/06/2017 09:34 AM, Will Deacon wrote:
> Now that the qrwlock can make use of WFE, remove our homebrew rwlock
> code in favour of the generic queued implementation.
>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm64/Kconfig                      |  17 ++++
>  arch/arm64/include/asm/Kbuild           |   1 +
>  arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
>  arch/arm64/include/asm/spinlock_types.h |   6 +-
>  4 files changed, 20 insertions(+), 168 deletions(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0df64a6a56d4..6d32c9b0d4bb 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -22,7 +22,24 @@ config ARM64
>  	select ARCH_HAS_STRICT_MODULE_RWX
>  	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>  	select ARCH_HAVE_NMI_SAFE_CMPXCHG if ACPI_APEI_SEA
> +	select ARCH_INLINE_READ_LOCK if !PREEMPT
> +	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
> +	select ARCH_INLINE_READ_LOCK_IRQ if !PREEMPT
> +	select ARCH_INLINE_READ_LOCK_IRQSAVE if !PREEMPT
> +	select ARCH_INLINE_READ_UNLOCK if !PREEMPT
> +	select ARCH_INLINE_READ_UNLOCK_BH if !PREEMPT
> +	select ARCH_INLINE_READ_UNLOCK_IRQ if !PREEMPT
> +	select ARCH_INLINE_READ_UNLOCK_IRQSAVE if !PREEMPT
> +	select ARCH_INLINE_WRITE_LOCK if !PREEMPT
> +	select ARCH_INLINE_WRITE_LOCK_BH if !PREEMPT
> +	select ARCH_INLINE_WRITE_LOCK_IRQ if !PREEMPT
> +	select ARCH_INLINE_WRITE_LOCK_IRQSAVE if !PREEMPT
> +	select ARCH_INLINE_WRITE_UNLOCK if !PREEMPT
> +	select ARCH_INLINE_WRITE_UNLOCK_BH if !PREEMPT
> +	select ARCH_INLINE_WRITE_UNLOCK_IRQ if !PREEMPT
> +	select ARCH_INLINE_WRITE_UNLOCK_IRQSAVE if !PREEMPT
>  	select ARCH_USE_CMPXCHG_LOCKREF
> +	select ARCH_USE_QUEUED_RWLOCKS
>  	select ARCH_SUPPORTS_MEMORY_FAILURE
>  	select ARCH_SUPPORTS_ATOMIC_RMW
>  	select ARCH_SUPPORTS_NUMA_BALANCING

Inlining is good for performance, but it may come with an increase in
kernel text size. Inlining unlock and unlock_irq are OK, but the other
inlines will increase the text size of the kernel. Have you measured how
much size increase will that be? Is there any concern about the
increased text size?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 4/5] arm64: locking: Move rwlock implementation over to qrwlocks
@ 2017-10-10  1:34     ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2017-10-10  1:34 UTC (permalink / raw)
  To: linux-arm-kernel

On 10/06/2017 09:34 AM, Will Deacon wrote:
> Now that the qrwlock can make use of WFE, remove our homebrew rwlock
> code in favour of the generic queued implementation.
>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm64/Kconfig                      |  17 ++++
>  arch/arm64/include/asm/Kbuild           |   1 +
>  arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
>  arch/arm64/include/asm/spinlock_types.h |   6 +-
>  4 files changed, 20 insertions(+), 168 deletions(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0df64a6a56d4..6d32c9b0d4bb 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -22,7 +22,24 @@ config ARM64
>  	select ARCH_HAS_STRICT_MODULE_RWX
>  	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>  	select ARCH_HAVE_NMI_SAFE_CMPXCHG if ACPI_APEI_SEA
> +	select ARCH_INLINE_READ_LOCK if !PREEMPT
> +	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
> +	select ARCH_INLINE_READ_LOCK_IRQ if !PREEMPT
> +	select ARCH_INLINE_READ_LOCK_IRQSAVE if !PREEMPT
> +	select ARCH_INLINE_READ_UNLOCK if !PREEMPT
> +	select ARCH_INLINE_READ_UNLOCK_BH if !PREEMPT
> +	select ARCH_INLINE_READ_UNLOCK_IRQ if !PREEMPT
> +	select ARCH_INLINE_READ_UNLOCK_IRQSAVE if !PREEMPT
> +	select ARCH_INLINE_WRITE_LOCK if !PREEMPT
> +	select ARCH_INLINE_WRITE_LOCK_BH if !PREEMPT
> +	select ARCH_INLINE_WRITE_LOCK_IRQ if !PREEMPT
> +	select ARCH_INLINE_WRITE_LOCK_IRQSAVE if !PREEMPT
> +	select ARCH_INLINE_WRITE_UNLOCK if !PREEMPT
> +	select ARCH_INLINE_WRITE_UNLOCK_BH if !PREEMPT
> +	select ARCH_INLINE_WRITE_UNLOCK_IRQ if !PREEMPT
> +	select ARCH_INLINE_WRITE_UNLOCK_IRQSAVE if !PREEMPT
>  	select ARCH_USE_CMPXCHG_LOCKREF
> +	select ARCH_USE_QUEUED_RWLOCKS
>  	select ARCH_SUPPORTS_MEMORY_FAILURE
>  	select ARCH_SUPPORTS_ATOMIC_RMW
>  	select ARCH_SUPPORTS_NUMA_BALANCING

Inlining is good for performance, but it may come with an increase in
kernel text size. Inlining unlock and unlock_irq are OK, but the other
inlines will increase the text size of the kernel. Have you measured how
much size increase will that be? Is there any concern about the
increased text size?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Switch arm64 over to qrwlock
  2017-10-06 13:34 ` Will Deacon
@ 2017-10-10 18:20   ` Adam Wallis
  -1 siblings, 0 replies; 40+ messages in thread
From: Adam Wallis @ 2017-10-10 18:20 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: peterz, boqun.feng, Jeremy.Linton, mingo, longman, paulmck,
	linux-arm-kernel

On 10/6/2017 9:34 AM, Will Deacon wrote:
> Hi all,
> 
> This is version two of the patches I posted yesterday:
> 
>   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
> 
> I'd normally leave it longer before posting again, but Peter had a good
> suggestion to rework the layout of the lock word, so I wanted to post a
> version that follows that approach.
> 
> I've updated my branch if you're after the full patch stack:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
> 
> As before, all comments (particularly related to testing and performance)
> welcome!
> 
> Cheers,
> 
> Will
> 
> --->8
> 
> Will Deacon (5):
>   kernel/locking: Use struct qrwlock instead of struct __qrwlock
>   locking/atomic: Add atomic_cond_read_acquire
>   kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
>   arm64: locking: Move rwlock implementation over to qrwlocks
>   kernel/locking: Prevent slowpath writers getting held up by fastpath
> 
>  arch/arm64/Kconfig                      |  17 ++++
>  arch/arm64/include/asm/Kbuild           |   1 +
>  arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
>  arch/arm64/include/asm/spinlock_types.h |   6 +-
>  include/asm-generic/atomic-long.h       |   3 +
>  include/asm-generic/qrwlock.h           |  20 +---
>  include/asm-generic/qrwlock_types.h     |  15 ++-
>  include/linux/atomic.h                  |   4 +
>  kernel/locking/qrwlock.c                |  83 +++-------------
>  9 files changed, 58 insertions(+), 255 deletions(-)
> 

Applied on 4.14-rc4, I tested these patches with multiple combinations of
readers:writers . These patches help prevent writer starvation in every
combination that I tested. Without these patches, when the reader:writer ratio
is 2:1, it's trivial for me to see acquisitions of 250:1 (@ 2R:1W).

After applying the qrwlock patches, I see the acquisition ratios level out to
around ~1.6:1 (@ 2R:1W), which is quite an improvement.

Thanks Will!

Tested-by: Adam Wallis <awallis@codeaurora.org>

-- 
Adam Wallis
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 0/5] Switch arm64 over to qrwlock
@ 2017-10-10 18:20   ` Adam Wallis
  0 siblings, 0 replies; 40+ messages in thread
From: Adam Wallis @ 2017-10-10 18:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 10/6/2017 9:34 AM, Will Deacon wrote:
> Hi all,
> 
> This is version two of the patches I posted yesterday:
> 
>   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html
> 
> I'd normally leave it longer before posting again, but Peter had a good
> suggestion to rework the layout of the lock word, so I wanted to post a
> version that follows that approach.
> 
> I've updated my branch if you're after the full patch stack:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock
> 
> As before, all comments (particularly related to testing and performance)
> welcome!
> 
> Cheers,
> 
> Will
> 
> --->8
> 
> Will Deacon (5):
>   kernel/locking: Use struct qrwlock instead of struct __qrwlock
>   locking/atomic: Add atomic_cond_read_acquire
>   kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
>   arm64: locking: Move rwlock implementation over to qrwlocks
>   kernel/locking: Prevent slowpath writers getting held up by fastpath
> 
>  arch/arm64/Kconfig                      |  17 ++++
>  arch/arm64/include/asm/Kbuild           |   1 +
>  arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
>  arch/arm64/include/asm/spinlock_types.h |   6 +-
>  include/asm-generic/atomic-long.h       |   3 +
>  include/asm-generic/qrwlock.h           |  20 +---
>  include/asm-generic/qrwlock_types.h     |  15 ++-
>  include/linux/atomic.h                  |   4 +
>  kernel/locking/qrwlock.c                |  83 +++-------------
>  9 files changed, 58 insertions(+), 255 deletions(-)
> 

Applied on 4.14-rc4, I tested these patches with multiple combinations of
readers:writers . These patches help prevent writer starvation in every
combination that I tested. Without these patches, when the reader:writer ratio
is 2:1, it's trivial for me to see acquisitions of 250:1 (@ 2R:1W).

After applying the qrwlock patches, I see the acquisition ratios level out to
around ~1.6:1 (@ 2R:1W), which is quite an improvement.

Thanks Will!

Tested-by: Adam Wallis <awallis@codeaurora.org>

-- 
Adam Wallis
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/5] arm64: locking: Move rwlock implementation over to qrwlocks
  2017-10-10  1:34     ` Waiman Long
@ 2017-10-11 11:49       ` Will Deacon
  -1 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-11 11:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, linux-arm-kernel, Jeremy.Linton, peterz, mingo,
	boqun.feng, paulmck

Hi Waiman,

On Mon, Oct 09, 2017 at 09:34:08PM -0400, Waiman Long wrote:
> On 10/06/2017 09:34 AM, Will Deacon wrote:
> > Now that the qrwlock can make use of WFE, remove our homebrew rwlock
> > code in favour of the generic queued implementation.
> >
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  arch/arm64/Kconfig                      |  17 ++++
> >  arch/arm64/include/asm/Kbuild           |   1 +
> >  arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
> >  arch/arm64/include/asm/spinlock_types.h |   6 +-
> >  4 files changed, 20 insertions(+), 168 deletions(-)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 0df64a6a56d4..6d32c9b0d4bb 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -22,7 +22,24 @@ config ARM64
> >  	select ARCH_HAS_STRICT_MODULE_RWX
> >  	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> >  	select ARCH_HAVE_NMI_SAFE_CMPXCHG if ACPI_APEI_SEA
> > +	select ARCH_INLINE_READ_LOCK if !PREEMPT
> > +	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
> > +	select ARCH_INLINE_READ_LOCK_IRQ if !PREEMPT
> > +	select ARCH_INLINE_READ_LOCK_IRQSAVE if !PREEMPT
> > +	select ARCH_INLINE_READ_UNLOCK if !PREEMPT
> > +	select ARCH_INLINE_READ_UNLOCK_BH if !PREEMPT
> > +	select ARCH_INLINE_READ_UNLOCK_IRQ if !PREEMPT
> > +	select ARCH_INLINE_READ_UNLOCK_IRQSAVE if !PREEMPT
> > +	select ARCH_INLINE_WRITE_LOCK if !PREEMPT
> > +	select ARCH_INLINE_WRITE_LOCK_BH if !PREEMPT
> > +	select ARCH_INLINE_WRITE_LOCK_IRQ if !PREEMPT
> > +	select ARCH_INLINE_WRITE_LOCK_IRQSAVE if !PREEMPT
> > +	select ARCH_INLINE_WRITE_UNLOCK if !PREEMPT
> > +	select ARCH_INLINE_WRITE_UNLOCK_BH if !PREEMPT
> > +	select ARCH_INLINE_WRITE_UNLOCK_IRQ if !PREEMPT
> > +	select ARCH_INLINE_WRITE_UNLOCK_IRQSAVE if !PREEMPT
> >  	select ARCH_USE_CMPXCHG_LOCKREF
> > +	select ARCH_USE_QUEUED_RWLOCKS
> >  	select ARCH_SUPPORTS_MEMORY_FAILURE
> >  	select ARCH_SUPPORTS_ATOMIC_RMW
> >  	select ARCH_SUPPORTS_NUMA_BALANCING
> 
> Inlining is good for performance, but it may come with an increase in
> kernel text size. Inlining unlock and unlock_irq are OK, but the other
> inlines will increase the text size of the kernel. Have you measured how
> much size increase will that be? Is there any concern about the
> increased text size?

Yes, I did look at the disassembly and bloat-o-meter results. Inlining
these functions means that the fastpath sits entirely within a 64-byte
cacheline and bloat-o-meter shows a relatively small increase in vmlinux
size for a defconfig build with CONFIG_PREEMPT disabled:

Total: Before=13800924, After=13812904, chg +0.09%

(I also just noticed my typos in ARCH_INLINE_{READ.WRITE}_UNLOCK_IRQSAVE
so I regenerated the numbers!)

Will

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 4/5] arm64: locking: Move rwlock implementation over to qrwlocks
@ 2017-10-11 11:49       ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2017-10-11 11:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Waiman,

On Mon, Oct 09, 2017 at 09:34:08PM -0400, Waiman Long wrote:
> On 10/06/2017 09:34 AM, Will Deacon wrote:
> > Now that the qrwlock can make use of WFE, remove our homebrew rwlock
> > code in favour of the generic queued implementation.
> >
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  arch/arm64/Kconfig                      |  17 ++++
> >  arch/arm64/include/asm/Kbuild           |   1 +
> >  arch/arm64/include/asm/spinlock.h       | 164 +-------------------------------
> >  arch/arm64/include/asm/spinlock_types.h |   6 +-
> >  4 files changed, 20 insertions(+), 168 deletions(-)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 0df64a6a56d4..6d32c9b0d4bb 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -22,7 +22,24 @@ config ARM64
> >  	select ARCH_HAS_STRICT_MODULE_RWX
> >  	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> >  	select ARCH_HAVE_NMI_SAFE_CMPXCHG if ACPI_APEI_SEA
> > +	select ARCH_INLINE_READ_LOCK if !PREEMPT
> > +	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
> > +	select ARCH_INLINE_READ_LOCK_IRQ if !PREEMPT
> > +	select ARCH_INLINE_READ_LOCK_IRQSAVE if !PREEMPT
> > +	select ARCH_INLINE_READ_UNLOCK if !PREEMPT
> > +	select ARCH_INLINE_READ_UNLOCK_BH if !PREEMPT
> > +	select ARCH_INLINE_READ_UNLOCK_IRQ if !PREEMPT
> > +	select ARCH_INLINE_READ_UNLOCK_IRQSAVE if !PREEMPT
> > +	select ARCH_INLINE_WRITE_LOCK if !PREEMPT
> > +	select ARCH_INLINE_WRITE_LOCK_BH if !PREEMPT
> > +	select ARCH_INLINE_WRITE_LOCK_IRQ if !PREEMPT
> > +	select ARCH_INLINE_WRITE_LOCK_IRQSAVE if !PREEMPT
> > +	select ARCH_INLINE_WRITE_UNLOCK if !PREEMPT
> > +	select ARCH_INLINE_WRITE_UNLOCK_BH if !PREEMPT
> > +	select ARCH_INLINE_WRITE_UNLOCK_IRQ if !PREEMPT
> > +	select ARCH_INLINE_WRITE_UNLOCK_IRQSAVE if !PREEMPT
> >  	select ARCH_USE_CMPXCHG_LOCKREF
> > +	select ARCH_USE_QUEUED_RWLOCKS
> >  	select ARCH_SUPPORTS_MEMORY_FAILURE
> >  	select ARCH_SUPPORTS_ATOMIC_RMW
> >  	select ARCH_SUPPORTS_NUMA_BALANCING
> 
> Inlining is good for performance, but it may come with an increase in
> kernel text size. Inlining unlock and unlock_irq are OK, but the other
> inlines will increase the text size of the kernel. Have you measured how
> much size increase will that be? Is there any concern about the
> increased text size?

Yes, I did look at the disassembly and bloat-o-meter results. Inlining
these functions means that the fastpath sits entirely within a 64-byte
cacheline and bloat-o-meter shows a relatively small increase in vmlinux
size for a defconfig build with CONFIG_PREEMPT disabled:

Total: Before=13800924, After=13812904, chg +0.09%

(I also just noticed my typos in ARCH_INLINE_{READ.WRITE}_UNLOCK_IRQSAVE
so I regenerated the numbers!)

Will

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/5] arm64: locking: Move rwlock implementation over to qrwlocks
  2017-10-11 11:49       ` Will Deacon
@ 2017-10-11 14:03         ` Waiman Long
  -1 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2017-10-11 14:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, Jeremy.Linton, peterz, mingo,
	boqun.feng, paulmck

On 10/11/2017 07:49 AM, Will Deacon wrote:
> Hi Waiman,
>
>>
>> Inlining is good for performance, but it may come with an increase in
>> kernel text size. Inlining unlock and unlock_irq are OK, but the other
>> inlines will increase the text size of the kernel. Have you measured how
>> much size increase will that be? Is there any concern about the
>> increased text size?
> Yes, I did look at the disassembly and bloat-o-meter results. Inlining
> these functions means that the fastpath sits entirely within a 64-byte
> cacheline and bloat-o-meter shows a relatively small increase in vmlinux
> size for a defconfig build with CONFIG_PREEMPT disabled:
>
> Total: Before=13800924, After=13812904, chg +0.09%
>
> (I also just noticed my typos in ARCH_INLINE_{READ.WRITE}_UNLOCK_IRQSAVE
> so I regenerated the numbers!)

The size increase looks small enough. That may largely due to the fact
that rwlocks aren't as frequently used as spinlocks, so the number of
call sites are not that many. Anyway, I am OK with that. I just want to
make sure that people are aware of this.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 4/5] arm64: locking: Move rwlock implementation over to qrwlocks
@ 2017-10-11 14:03         ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2017-10-11 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 10/11/2017 07:49 AM, Will Deacon wrote:
> Hi Waiman,
>
>>
>> Inlining is good for performance, but it may come with an increase in
>> kernel text size. Inlining unlock and unlock_irq are OK, but the other
>> inlines will increase the text size of the kernel. Have you measured how
>> much size increase will that be? Is there any concern about the
>> increased text size?
> Yes, I did look at the disassembly and bloat-o-meter results. Inlining
> these functions means that the fastpath sits entirely within a 64-byte
> cacheline and bloat-o-meter shows a relatively small increase in vmlinux
> size for a defconfig build with CONFIG_PREEMPT disabled:
>
> Total: Before=13800924, After=13812904, chg +0.09%
>
> (I also just noticed my typos in ARCH_INLINE_{READ.WRITE}_UNLOCK_IRQSAVE
> so I regenerated the numbers!)

The size increase looks small enough. That may largely due to the fact
that rwlocks aren't as frequently used as spinlocks, so the number of
call sites are not that many. Anyway, I am OK with that. I just want to
make sure that people are aware of this.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2017-10-11 14:03 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-06 13:34 [PATCH v2 0/5] Switch arm64 over to qrwlock Will Deacon
2017-10-06 13:34 ` Will Deacon
2017-10-06 13:34 ` [PATCH v2 1/5] kernel/locking: Use struct qrwlock instead of struct __qrwlock Will Deacon
2017-10-06 13:34   ` Will Deacon
2017-10-06 13:34 ` [PATCH v2 2/5] locking/atomic: Add atomic_cond_read_acquire Will Deacon
2017-10-06 13:34   ` Will Deacon
2017-10-06 13:34 ` [PATCH v2 3/5] kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock Will Deacon
2017-10-06 13:34   ` Will Deacon
2017-10-08  1:03   ` Boqun Feng
2017-10-08  1:03     ` Boqun Feng
2017-10-09 11:30     ` Will Deacon
2017-10-09 11:30       ` Will Deacon
2017-10-06 13:34 ` [PATCH v2 4/5] arm64: locking: Move rwlock implementation over to qrwlocks Will Deacon
2017-10-06 13:34   ` Will Deacon
2017-10-10  1:34   ` Waiman Long
2017-10-10  1:34     ` Waiman Long
2017-10-11 11:49     ` Will Deacon
2017-10-11 11:49       ` Will Deacon
2017-10-11 14:03       ` Waiman Long
2017-10-11 14:03         ` Waiman Long
2017-10-06 13:34 ` [PATCH v2 5/5] kernel/locking: Prevent slowpath writers getting held up by fastpath Will Deacon
2017-10-06 13:34   ` Will Deacon
2017-10-08 21:30 ` [PATCH v2 0/5] Switch arm64 over to qrwlock Yury Norov
2017-10-08 21:30   ` Yury Norov
2017-10-09  6:52   ` Peter Zijlstra
2017-10-09  6:52     ` Peter Zijlstra
2017-10-09 10:02     ` Will Deacon
2017-10-09 10:02       ` Will Deacon
2017-10-09  9:59   ` Will Deacon
2017-10-09  9:59     ` Will Deacon
2017-10-09 12:49     ` Yury Norov
2017-10-09 12:49       ` Yury Norov
2017-10-09 13:13       ` Will Deacon
2017-10-09 13:13         ` Will Deacon
2017-10-09 21:19 ` Waiman Long
2017-10-09 21:19   ` Waiman Long
2017-10-09 22:31 ` Jeremy Linton
2017-10-09 22:31   ` Jeremy Linton
2017-10-10 18:20 ` Adam Wallis
2017-10-10 18:20   ` Adam Wallis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.